Far Too Often, Big Data Is Bad Data

Far Too Often, Big Data Is Bad Data

We generate 2.5 quintillion bytes of data every day. You might not search or text that much, but together we all search and text that much. Big data has become gigantic data. 

From the NSA’s massive surveillance dragnet to Facebook’s information-gurgling algorithms to welfare systems that track their subjects’ every move, we’ve seen the dark effects of this growing data collection: More doesn’t always mean better. When you collect nearly everything, you collect a lot of crap with it too. Big data becomes dirty data. Add more noise, and signals get buried. Increase the variable count, and correlation gets harder. Scientific causation becomes a casualty; fairness can get left behind, too.

Big, sloppy datasets have reinforced prejudice, engendered black boxes, and contributed to flawed systems: recidivism predictors that recommend discriminately longer prison sentences for black convicts, chatbots that learn how to be racist and misogynistic in less than 24 hours, cameras that think Asians are always blinking, photo apps that label black people as gorillas, search engines that show men as CEOs at higher rates than women — the list goes on and on. (It has even been suggested that searching for “black names” will show ads for criminal background checks.)

Still, many fall for the seduction that data itself is infallible. It’s not. That insight is hardly new, but it’s a lesson humans need to keep learning over and over.

Newspapers and pollsters still have egg on their face from 1948’s “Dewey Defeats Truman” headlines, when all of America learned a lesson in bad sample selection. Much more recently, when “big data” became an overnight buzzword for big business, economist and author Tim Harford had to pump the brakes for breathless analysts bragging about their voluminous spreadsheets and their magical cost-cutting tools.

Harford told us in 2015 that big data fans often seem to forget that no matter the enormity of a dataset, it’s still subject to error-causing sample bias. You can load entire server rooms with data, but 5,000 carefully selected survey-takers will provide better results than a billion random Google searches. 

Nevertheless, when data research told Starbucks executives that they would save a few pennies through “clopenings” — forcing certain workers to close stores at 9 p.m. and open them back up at 4:30 a.m. — the firm charged ahead with the soul-sucking suggestion until public embarrassment made them back down. Forget the inhumanity for a moment; given the resulting employee resentment, clopenings didn’t even make long-term business sense.

“If you impose lots of costs on workers for tiny savings on your (balance sheet), that’s not only evil capitalism, it’s incompetent capitalism,” Harford said.

We need to be very careful about the hidden bias in any data project...Some things are easy to measure and some things aren’t. We’ve known this for a long time, but we keep making the same mistake.

Big data can become big, bad data in the wrong hands.

The solution is actually one that governments, corporations, and consumers can all get behind: We need to think about “good” data rather than big data. Instead of simply collecting more data, which only invades individual privacy and reinforces the inequalities around us, organizations need to focus on collecting and using good data to empower fair, privacy-preserving, and rights-respecting systems.

As two Stanford professors recently pointed out in Nature, much of the data around us — Wikipedia entries, Google images, article citations — overrepresents white men and countries like the U.S. and, in doing so, therefore underrepresents everyone else. Word embeddings, used in machine learning to represent language, even “quantify 100 years of gender and ethnic stereotypes” in code. 

Collecting more of this data isn’t going to fix anything; feeding a face scanner even more images of white people will not address the racist projections that result from these systems. We need to stop thinking in terms of terabytes and start thinking in terms of ethics.

This notion of good data is hardly simple. In fact, it can be painstakingly difficult to create good data samples. Since people of color are disproportionately arrested in the U.S., how do we build ethical crime predictors? Do we deliberately skew the data to show different races committing crimes at the same rate? Because that would raise questions about who does the skewing, and how, as well as how they’re held accountable. Or what if we skew the data beyond recognition, so much so that algorithms then become poor predictors of behavior — for instance, underplaying how socioeconomic factors may impact one’s likelihood of committing robbery? Doing so might result in “fairer” treatment, but at what cost? 

So yes, it’ll be hard. That’s ok. The alternative we have today — lazy, packrat programmers just collecting every data piece they find and storing it in the hopes that they can make (and sell) meaning out of it later — creates a world none of us want to live in.

Technologists have already argued for corrections on the policy side — and in fact, that’s where some of our work falls — but there still need to be corrections on the tech side, before systems are even developed. Every programmer in America should be focused on figuring out how to do more, and better, with less. Otherwise, we’re reinforcing inequality and programming Dewey defeats Truman into every fabric of our digital lives.

Justin Sherman is a student at Duke University and the Co-Founder and Vice President of Ethical Tech. Bob Sullivan is an author and advisor to Ethical Tech.

Comment
Show comments Hide Comments

Related Articles