Sunday, May 28, 2017

“Everybody Lies- Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are” by Seth Stephens-Davidowitz

Stephens-Davidowitz is a data scientist, who has made his name analyzing Big Data and extrapolating interesting trends. He mainly uses Google’s data, particularly Google Trends and Google AdWords, to tease out statistics that differ from the information people give to pollsters or use in polite conversation. In Google searches, unlike polls, people have the incentive to tell the truth, in addition to the anonymity. Stephens-Davidowitz goes so far as to suggest “Google searches are the most important dataset ever collected on the human psyche.” However, he cautions to not be overly impressed by the size of a given dataset. After all, “the bigger an effect, the fewer the number of observations necessary to see it.” Enormous piles of data require a scientist to tease out what is of most interest. “Frequently, the value of Big Data is not its size; it’s that it can offer you new kinds of information to study- information that had never previously been collected…. If you are going to try to use new data to revolutionize a field, it is best to go into a field where old methods are lousy.” That is why more trends are being found in medicine and education than for stocks on Wall St. Often times, for commercial success, weak causation or even just correlation is all that you need. Trend spotters are “in the prediction business, not the explanation business…. When trying to make predictions, you needn’t worry too much about why your models work.” Much of the work is just in considering what piles of data are worth further analysis. “You have to be open and flexible in determining what counts as data…. Consider nontraditional sources of data.” 

These days a doppelgänger search algorithm is considered the preeminent way to accurately predict. “For a doppelgänger search to be truly accurate, you don’t want to find someone who merely likes the same things you like. You also want to find someone who dislikes the things you dislike…. Amazon uses something like a doppelgänger search to suggest what books you might like. They see what people similar to you select and base their recommendations on that. Pandora does the same in picking what songs you might want to listen to. And this is how Netflix figures out the movies you might like.” When Netflix switched to making suggestions based on their doppelgänger algorithm, as opposed to suggestions from customers’ own movie queues, clicks and return visits to their site increased exponentially. The other revolution in Big Data was the proliferation of randomized controlled experiments or A/B testing. “Facebook now runs a thousand A/B tests per day, which means that a small number of engineers at Facebook start more randomized, controlled experiments in a given day than the entire pharmaceutical industry starts in a year.” This can also be done using natural experiments via regression discontinuity. “Anytime there is a precise number that divides people into two different groups- a discontinuity- economists can compare- or regress- the outcomes of people very, very close to the cutoff.” The biggest worry with Big Data is that there is too much of it. “If you test enough things, just by random chance, one of them will be statistically significant…. The more variables you try, the more humble you have to be. The more variables you try, the tougher the out-of-sample test has to be. It is also crucial to keep track of every test you attempt.” Another problem is the so-called lamppost fallacy. Just because it is important does not mean there is data for it. And just because there is data for it, does not mean it is important. “The things we can measure are often not exactly what we care about.”

So what are some of the most interesting trends teased out from Big Data? Strawberry Pop-Tarts sell seven times faster than in normal days leading up to a hurricane. Areas that supported Trump in the largest numbers were those that made the most Google searches for “nigger.” Having a core common group of Facebook friends with your romantic other is a strong predictor that your relationship will not last. Better socioeconomic status means a higher chance of making it to the NBA. The Google search most correlated with the national unemployment rate between 2004 and 2011 was the term “Slutload.” The size of the left ventricle of a horse’s heart is a massive predictor of its racing success. A man who searches for “Judy Garland” is three times more likely to search for gay than straight porn. Among women, “gay” is 10 percent more likely to complete searches that begin “Is my husband…” than the second-place word, “cheating.” The States with the highest percentage of women asking this are South Carolina and Louisiana. In fact, in twenty-one of the twenty-five states where this question is most frequently asked, support for gay marriage is lower than the national average. Penn State students who were admitted to Harvard have the same career incomes as Harvard graduates. Similarly, students who just missed the admittance cutoff for Stuyvesant High School in New York City by a question or two have indistinguishable SAT and AP scores from those who were barely admitted.

No comments:

Post a Comment