If you’re 33 years old and have attended a few family Thanksgivings in a row without a date, the topic of mate choice is likely to arise. And just about everybody will have an opinion.
“Seth needs a crazy girl, like him,” my sister says.
“You’re crazy! He needs a normal girl to balance him out,” my brother says.
“Seth’s not crazy,” my mother says.
“You’re crazy! Of course Seth is crazy,” my father says.
All of a sudden, my shy, soft-spoken grandmother, quiet through the dinner, speaks. The loud, aggressive New York voices go silent, and all eyes focus on the small old lady with short yellow hair and still a trace of an Eastern European accent. “Seth, you need a nice girl. Not too pretty. Very smart. Good with people. Social, so you will do things. Sense of humor, because you have a good sense of humor.”
Why does this old woman’s advice command such attention and respect in my family? Well, my 88-year-old grandmother has seen more than everybody else at the table. She’s observed more marriages, many that worked and many that didn’t. And over the decades, she has catalogued the qualities that make for successful relationships. At that Thanksgiving table, for that question, my grandmother has access to the largest number of data points. My grandmother is Big Data.
Like it or not, data is playing an increasingly important role in all of our lives — and its role is going to get larger. Newspapers now have full sections devoted to data. Companies have teams with the exclusive task of analyzing their data. Investors give startups tens of millions of dollars if they can store more data. Even if you never learn how to run a regression or calculate a confidence interval, you are going to encounter a lot of data — in the pages you read, the business meetings you attend, the gossip you hear next to the watercoolers you drink from.
Many people are anxious over this development. They are intimidated by data, easily lost and confused in a world of numbers. They think that a quantitative understanding of the world is for a select few left-brained prodigies, not for them. As soon as they encounter numbers, they are ready to turn the page, end the meeting, or change the conversation.
But I have spent 10 years in the data analysis business and have been fortunate to work with many of the top people in the field. And one of the most important lessons I have learned is this: Good data science is less complicated than people think. The best data science, in fact, is surprisingly intuitive.
What makes data science intuitive? At its core, data science is about spotting patterns and predicting how one variable will affect another. People do this all the time.
Just think about how my grandmother gave me relationship advice. She utilized the large database of relationships that her brain has uploaded over a near century of life — in the stories she has heard from her family, her friends, her acquaintances. She limited her analysis to a sample of relationships in which the man had many qualities that I have — a sensitive temperament, a tendency to isolate himself, a sense of humor. She zeroed in on key qualities of the woman — how kind she was, how smart she was, how pretty she was. She correlated these key qualities of the woman with a key quality of the relationship — whether it was a good one. Finally, she reported her results. In other words, she spotted patterns and predicted how one variable will affect another. Grandma is a data scientist.
You are a data scientist, too. When you were a kid, you noticed that when you cried, your mom gave you attention. That is data science. When you reached adulthood, you noticed that if you complain too much, people want to hang out with you less. That is data science, too.
Because data science is so natural, the best Big Data studies, I have found, can be understood by just about any smart person. If you can’t understand a study, the problem is probably with the study, not with you.
Want proof that great data science tends to be intuitive? I recently came across a study that may be one of the most important conducted in the past few years. It is also one of the most intuitive studies I’ve ever seen. I want you to think not just about the importance of the study — but how natural and grandma-like it is.
The study was by a team of researchers from Columbia University and Microsoft. The team wanted to find what symptoms predict pancreatic cancer. This disease has a low five-year survival rate — only about 3 percent — but early detection can double a patient’s chances.
The researchers’ method? They utilized data from tens of thousands of anonymous users of Bing, Microsoft’s search engine. They coded a user as having recently been given a diagnosis of pancreatic cancer based on unmistakable searches, such as “just diagnosed with pancreatic cancer” or “I was told I have pancreatic cancer, what to expect.”
Next, the researchers looked at searches for health symptoms. They compared that small number of users who later reported a pancreatic cancer diagnosis with those who didn’t. What symptoms, in other words, predict that, in a few weeks or months, a user will be reporting a diagnosis?
The results were striking. Searching for back pain and then yellowing skin turned out to be a sign of pancreatic cancer; searching for just back pain alone made it unlikely someone had pancreatic cancer. Similarly, searching for indigestion and then abdominal pain was evidence of pancreatic cancer, while searching for just indigestion without abdominal pain meant a person was unlikely to have it. The researchers could identify 5 to 15 percent of cases with almost no false positives. Now, this may not sound like a great rate, but if you have pancreatic cancer, even a 10 percent chance of possibly doubling your chances of survival would feel like a windfall.
The paper detailing this study would be difficult for non-experts to fully make sense of. It includes a lot of technical jargon, such as the Kolmogorov-Smirnov test, the meaning of which, I have to admit, I had forgotten. (It’s a way to determine whether a model correctly fits data.)
However, note how natural and intuitive this remarkable study is at its most fundamental level. The researchers looked at a wide array of medical cases and tried to connect symptoms to a particular illness. You know who else uses this methodology in trying to figure out whether someone has a disease? Husbands and wives, mothers and fathers, and nurses and doctors. Based on experience and knowledge, they try to connect fevers, headaches, runny noses, and stomach pains to various diseases. In other words, the Columbia and Microsoft researchers wrote a groundbreaking study by utilizing the natural, obvious methodology that everybody uses to make health diagnoses.
But wait. Let’s slow down here. If the methodology of the best data science is frequently natural and intuitive, as I claim, this raises a fundamental question about the value of Big Data. If humans are naturally data scientists, if data science is intuitive, why do we need computers and statistical software? Why do we need the Kolmogorov-Smirnov test? Can’t we just use our gut? Can’t we do it like Grandma does, like nurses and doctors do?
This gets to an argument intensified after the release of Malcolm Gladwell’s bestselling book Blink, which extols the magic of people’s gut instincts. Gladwell tells the stories of people who, relying solely on their guts, can tell whether a statue is fake; whether a tennis player will fault before he hits the ball; how much a customer is willing to pay. The heroes in Blink do not run regressions; they do not calculate confidence intervals; they do not run Kolmogorov-Smirnov tests. But they generally make remarkable predictions. Many people have intuitively supported Gladwell’s defense of intuition: They trust their guts and feelings. Fans of Blink might celebrate the wisdom of my grandmother giving relationship advice without the aid of computers. Fans of Blink may be less apt to celebrate my studies or other studies which use computers. If Big Data — of the computer type, rather than the grandma type — is a revolution, it has to prove that it’s more powerful than our unaided intuition, which, as Gladwell has pointed out, can often be remarkable.
The Columbia and Microsoft study offers a clear example of rigorous data science and computers teaching us things our gut alone could never find. This is also one case where the size of the data set matters. Sometimes there is insufficient experience for our unaided gut to draw upon. It is unlikely that you — or your close friends or family members — have seen enough cases of pancreatic cancer to tease out the difference between indigestion followed by abdominal pain compared to indigestion alone. Indeed, it is inevitable, as the Bing data set gets bigger, that the researchers will pick up many more subtle patterns in the timing of symptoms — for this and other illnesses — that even doctors might miss.
Moreover, while our gut may usually give us a good general sense of how the world works, it is frequently not precise. We need data to sharpen the picture. Consider, for example, the effects of weather on mood. You would probably guess that people are more likely to feel more gloomy on a 10-degree day than on a 70-degree day. Indeed, this is correct. But you might not guess how big an impact this temperature difference can make. I looked for correlations between an area’s Google searches for depression and a wide range of factors, including economic conditions, education levels, and church attendance. Winter climate swamped all the rest. In winter months, warm climates, such as that of Honolulu, have 40 percent fewer depression searches than cold climates, such as that of Chicago. Just how significant is this effect? An optimistic read of the effectiveness of antidepressants would find that the most effective drugs decrease the incidence of depression by only about 20 percent. To judge from the Google numbers, a Chicago-to-Honolulu move would be at least twice as effective as medication for your winter blues. (Full disclosure: Shortly after I completed this study, I moved from California to New York. Using data to learn what you should do is often easy. Actually doing it is tough.)
Sometimes our gut, when not guided by careful computer analysis, can be dead wrong. We can get blinded by our own experiences and prejudices. Indeed, even though my grandmother is able to utilize her decades of experience to give better relationship advice than the rest of my family, she still has some dubious views on what makes a relationship last. For example, she has frequently emphasized to me the importance of having common friends. She believes that this was a key factor in her marriage’s success: She spent most warm evenings with her husband, my grandfather, in their small backyard in Queens, New York, sitting on lawn chairs and gossiping with their tight group of neighbors.
However, at the risk of throwing my own grandmother under the bus, data science suggests that Grandma’s theory is wrong. A team of computer scientists recently analyzed the biggest data set ever assembled on human relationships — Facebook. They looked at a large number of couples who were, at some point, “in a relationship.” Some of these couples stayed “in a relationship.” Others switched their status to “single.” Having a common core group of friends, the researchers found, is a strong predictor that a relationship will not last. Perhaps hanging out every night with your partner and the same small group of people is not such a good thing; separate social circles may help make relationships stronger.
As you can see, our intuition alone, when we stay away from the computers and go with our gut, can sometimes amaze. But it can make big mistakes. Grandma may have fallen into one cognitive trap: We tend to exaggerate the relevance of our own experience. In the parlance of data scientists, we weight our data, and we give far too much weight to one particular data point: ourselves.
Grandma was so focused on her evening schmoozes with Grandpa and their friends that she did not think enough about other couples. She forgot to fully consider her brother-in-law and his wife, who chitchatted most nights with a small, consistent group of friends but who fought frequently and divorced. She forgot to fully consider my parents, her daughter and son-in-law. My parents go their separate ways many nights — my dad to a jazz club or ball game with his friends, my mom to a restaurant or the theater with her friends — yet they remain happily married.
When relying on our gut, we can also be thrown off by the basic human fascination with the dramatic. We tend to overestimate the prevalence of anything that makes for a memorable story. For example, when asked in a survey, people consistently rank tornadoes as a more common cause of death than asthma. In fact, asthma causes about 70 times more deaths. Deaths by asthma don’t stand out — and don’t make the news. Deaths by tornadoes do.
We are often wrong, in other words, about how the world works when we rely just on what we hear or personally experience. While the methodology of good data science is often intuitive, the results are frequently counterintuitive. Data science takes a natural and intuitive human process — spotting patterns and making sense of them — and injects it with steroids, often showing us that the world works in a completely different way from how we thought it did.
It took time for the natural sciences to begin changing our lives — to create penicillin, satellites, and computers. It may take time before Big Data leads the social and behavioral sciences to important advances in the way we love, learn, and live. But I believe such advances are coming. I hope, in fact, that some of you reading this help create them.
Seth Stephens-Davidowitz is a Harvard-trained economist, former Google data scientist, and author of The New York Times best-seller Everybody Lies (Dey Street Books, 2017).
This article is featured in the January/February 2018 issue of The Saturday Evening Post. Subscribe to the magazine for more art, inspiring stories, fiction, humor, and features from our archives.
From the book Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us about Who We Really Are by Seth Stephens-Davidowitz.
Copyright © 2017 by Seth Stephend-Davidowitz. reprinted by permisson of Dey Street Books, an imprint of Harper-Collins Publishers.