Racial Bias in Code and Data: An Interview with Alex Garcia

As a young data journalist, I was advised to attend NICAR — an annual data journalism conference organized by Investigative Editors and Reporters and their suborganization, the National Institute for Computer-Assisted Reporting. In researching the conference, I stumbled upon recordings of the 2019 NICAR Lightning Talks, which are five-minute presentations related to data journalism chosen by a popular vote. Last year, Alex Garcia gave a talk called 5 ways to write racist code (with examples). I was able to chat with him last week about his talk, the response he received, and how he’s feeling about it a year later.

Emilia Ruzicka: Thank you so much for agreeing to meet with me! Can we start with an introduction?

Photo credit: Evangelina Rodriguez

Alex Garcia: Sure! My name is Alex. I recently graduated from University of California, San Diego (UCSD) with a major in computer engineering. I’m from Los Angeles, went to school down in San Diego. I’ve always been interested in computers and when I started at UCSD I decided, “Oh, computer engineering might be something kind of cool.” The first time I ever programmed or did anything in this field was when I started out in college.

I didn’t know about the data journalism field until about a year and a half, two years ago, and I found out through Reddit, Data is Beautiful, and I found all these New York Times articles and whatever else, so that’s how I got into it. I didn’t know too much about the actual field and NICAR until I saw someone randomly tweet about it. I saw it was going to be in Newport Beach and I was like, “Oh, that’s really cool!” In terms of my actual experience in journalism, I honestly have none. There’s student newspapers on campus and all that, but I never really got into that, never knew it was available. I did do a little bit of data stuff, but I just really didn’t know much about it.

So during NICAR I met a lot of really cool people, saw what the field was like, got really interested in it. I met someone who goes to UCSD and is interested in journalism. We were actually roommates for this past quarter, which was really cool. Right now, I just graduated in December. I have a couple of months off where I’m not doing too much. I’m going to start a new job at the end of March doing general software engineering stuff. In the future, I hope to get into some sort of newsroom, some kind of data journalism, later down the road.

ER: That’s a really interesting journey, where you started not knowing, entered computer science, and then by association and serendipity found data journalism. Speaking of, last year, you gave a lightning talk at NICAR. Could you talk about your topic?

AG: Yeah, so a little bit of background about that. It was specifically about racial bias in algorithms and racial bias in code. This is a field that at the time I was somewhat interested in because I’d see a tweet or an article here and there that someone wrote. I had friends from different fields who were taking classes and they’d say, “Hey, this is a cool article, why don’t you read it?” and it would be about courtroom justice and how these algorithms would determine whatever. So I was always tangentially interested in it. I always had the idea in the back of my mind that I should just aggregate all these links or stories that I find and have it in one list that people can go to and find. But I never did that because I just never got around to it.

So when I signed up for the conference and saw they had these lightning talks where you can do a few minute speech about whatever you want, having that idea in my mind, I thought I could either aggregate this list or do this talk. I was specifically excited to do a talk to journalists, too, because I don’t know how many reporters really know about this field. They may know tangentially — kind of like my knowledge of college sports and how students can get paid for playing; I know something about that field, but I don’t know much — so I thought it was the same in this case, where people may have heard stories about courtroom injustice or some Microsoft twitter bot that went crazy because people took it over, but they may not know the differences between what leads to those things. I thought if I aggregate all these things and show how diverse this field is, how these different problems arise, and what fields they appear in, it might be something nice to share.

I had a bunch of bookmarks to all these different stories I had, cobbled them together, threw a pitch in, and it was a lot of fun aggregating! I’m not the best public speaker and I’m not the best organizer for all these thoughts, so the night before I was frantically working on the slides. I had a lot of ideas about what I wanted to put in the talk, but since it’s only five minutes, I had to cut things out, cut things short, and move things around. But it was fun! It was definitely nerve-wracking, especially because I knew no one in the audience besides two or three people I had met during the days leading up to it.

ER: You touched on this a little bit, but what inspired your talk? Was there any particular article that you encountered that made you think you needed to do your talk on racial bias in code or was it more of the conglomerate idea that sparked it?

AG: That’s a great question. I think for general inspiration of the talk, it was just a bunch of different links that I saw and stories that I would find. Also, the general — not ignorance, per se — but how people don’t know that this is a problem or that it could exist. One of the things that I don’t think I mentioned in the talk specifically, but one of the links that I had was a Reddit thread about gerrymandering. There was some news article talking about gerrymandering and one of the top comments was, “Oh, this research team or this company is working on an algorithm that could do it automatically. They give it whatever and then the computer will do it, so there will be no bias at all.” A couple of comments after that they were saying, “Why are humans doing this? Computers could do it and it would have no bias.” And somewhere hidden in there, there was one comment saying, “Hey, that’s not really how that works. A computer could do it and it could still be biased and there’s many different ways that could come across.” So I think that thread, in particular, stuck out to me. I’ve seen similar threads since then, whether it’s just random regulatory items or other random stuff where people will say that if a computer could do something, it would be a lot easier or more fair.

There would also be other general conversations I would have with friends, not necessarily talking about whether it would be fair for computers to do something, but more about the actual impacts that these issues might have on people. I think there was also a tweet from Alexandria Ocasio-Cortez. She said something about how algorithms have bias and algorithms could be racist. And then there was a reporter from Daily Wire saying that code can’t be racist. So it’s just a lot of nit-picky things where I don’t know if people really understand this, how it works, and how it manifests.

A slide from Garcia’s talk about Alexandria Ocasio-Cortez and Ryan Saavedra of Daily Wire

Also, about a year before the talk, I took a small seminar in computer science education and the professor at UCSD was really interested in K-12 computer science education. Part of the stuff she would talk about and that I learned more about in future classes was the importance of knowing the fundamentals of computer science or programming. Not necessarily knowing how to program or whatever, but knowing how it works, the way it works, and what it can or can’t do. Think about the general US population and how many actually know, not how the computers work, but what their limits are. That’s a field that drives this conversation. You know, if people are ignorant or they don’t know that these computers are not unbiased, that can be a problem.

ER: You mentioned briefly how important you felt it was to present this to an audience of journalists. Could you talk more about that and any sort of considerations you made when you were giving your talk, knowing that your audience was journalists and the ethics that are inherently assumed when journalists present information?

AG: I remember one thing I was thinking about while I was making the presentation and noticed specifically at NICAR was that most of the journalists there are journalists first. They learned how to code while working on stories or doing their job. There are some people who are half-journalist and half-engineer and they know more about coding, but most of the audience seemed to be the kind of people who would take a Python or R workshop to learn about them for the first time. So I didn’t want to have anything that was too technical or show too much code. One thing I did to counteract that was to use a lot of headlines or stories by reporters who were in the field who know more about it and would be familiar to the audience. And while I did show some code, I made sure it wouldn’t be too complicated and would be easy to explain.

One of the points was about doing sentiment analysis and how if you use the wrong model and pass in a string like, “I like Italian food” you would have a higher sentiment than if you used “I like Mexican food.” So if I did show code, it was very simplified and probably something that people were somewhat used to.

A slide from Garcia’s talk about sentiment analysis

For the ethical implications, I’m not sure. I did my best to have sources or links that people could go to and follow, but what I didn’t talk about was how you can report on this or how you can find different agencies that may be meddling in this, mostly because I don’t know how to do that. I don’t have a journalism background, so I don’t know how you find sources or what’s the best, most ethical way to go about doing that. I kind of avoided doing that and said: “Here are some stories and headlines that all have something to do with each other and some reasons behind how one event led to another.”

ER: After you gave this talk, what was the response? Were people really interested and want to learn more? If so, has that response continued or have you seen a continued trend in the media in reports on stories like the ones you used in your talk?

AG: Right after the talk, I would get random Twitter DMs here and there from journalists saying, “this was really cool, I really liked it!” or “I had a small question about a source that you used.” One person wanted to talk about the realm in general — what companies are maybe more susceptible to this or that danger. Personally, it was a great way to meet people and see who is working in this field and who is interested in it.

In terms of long term what I’ve seen in the media since my talk, I think I’ve seen the field get a little bit worse. There’s a company that The Washington Post did an article about where you send them videos of job interviews and the company uses AI to see if they’re a good candidate by analyzing speech and body patterns. And it’s so problematic because there are just so many things that can go wrong, but seeing the amount of money and velocity and power that they have is pretty scary. That’s probably the biggest thing I’ve seen since the talk. I’ve probably seen a couple of other headlines because there’s more and more of a focus on this, especially an academic focus, but I can’t think of any off the top of my head.

A slide from Garcia’s talk depicting the racial and gender bias of facial recognition algorithms

ER: You mentioned earlier that you had a lot of things you wanted to put in your talk, but because of time constraints you couldn’t. If you had the opportunity to give the talk again without a time limit, what are some things you would have mentioned, both from when you were preparing the talk and from current issues of racial bias in code and data?

AG: I think for each of the five sections I had, there were one or two more articles I had, so I would have included those to make my points stronger. Also, I had this reach goal for the presentation when I wrote the slides to use a Javascript tool to make my slides a website. I wanted to run a machine learning algorithm during the presentation to show that you don’t need a big fancy server or computer to have the resources to make biased code. And at the end, I would be able to show that it was running on some NYPD stop and frisk data that I had and how biased the outcome could be with some pretty readily available tools and data. It’s not hard at all for this to happen.

I was trying to make it work, but the logistics weren’t working out and I didn’t want to cause too many difficulties, so I just went with regular slides instead, but I think having an example like that would drive the point home even further. Even the presentation you make for a talk has the power to make automatic, biased decisions for no good reason. I also would have liked to do demos of where things could go wrong, such as the sentiment analysis example I used, so that people could see exactly what was happening instead of just getting the theory. I think recreating that would reinforce my ideas.

ER: Cool! Is there anything else you want to say about the importance of being aware of racial bias in code and data or how people can become more conscious and evaluative of what they’re consuming?

AG: I think one heuristic that can be helpful in noticing when these things happen is watching for when someone says, “Oh yeah, a computer did that” or “a computer made the decision” or even “oh, that can’t be biased because of X, Y, or Z.” That’s something I feel happens a lot from day to day where something happened “automatically,” but for me, that’s a red flag. Those are things to look into a little more and check out how the decision was actually built.

With data visualization specifically, when you’re making these visualizations, it’s only as sound as the data you’re building on top of. If the data has underlying problems, then no matter what you put on top of it, you’re just going to make it worse. For instance, electoral maps. If you look at election results by county for the entire United States, you’re in some ways supporting an older, racist, white supremacist system. The goal might not be to create a racist visualization, but in some ways, you’re biasing the view and integrity of that data.

There are many other examples with data visualization and data analysis, but just knowing that whatever data you’re using, you’re sitting on top of a historical view of how it came to that point. I think that’s definitely something to consider as you work.

You can listen to Alex Garcia’s full lightning talk via IRE Radio here.

The final slide from Garcia’s talk

Emilia Ruzicka is a data journalist, designer, producer, and storyteller who specializes in health, science, and technology reporting. They are currently pursuing their M.A. in Media, Culture, and Technology at University of Virginia while continuing to work on freelance projects and write their own blog. Outside of data viz, Emilia loves to visit museums, make art, and talk about the USPS. If you have a project proposal, story tips, or want to find out more, visit emiliaruzicka.com.