Big Data Analytics: What It Means and Why We Should Tread Carefully
Amazon has a pretty good idea of what books I like (and buy). Netflix knows which Portlandia-adjacent shows I will watch. My credit card company understands my spending patterns much better than I do. All these revelations are made possible through big data analytics.
I recently began a course at NYU on big data analytics, an area not new to the Kauffman Foundation. Our interest is highlighted by support and work with National Statistics Agencies, the opening of the new Kansas City Research Data Center project just around the corner from us, our work in education, and many other places.
Image courtesy Intel Free Press via Flikr.com
What is Big Data?
There are many definitions of big data. Essentially, big data means lots of data—larger than a “normal” sample size or a “typical” database. Imagine opening a spreadsheet with so many rows and columns it breaks Excel. Big data is a data set so large that traditional methods are not able to analyze the correlations. Big data requires the use of machine learning and algorithms to decipher and uncover valuable information held deep within the information. Large amounts of data are nothing without the tools to break them down.
In my examples with Amazon, Netflix, and my credit card, three things are happening. First, big data is being collected on a massive scale. Second, an algorithm is written to organize the data and discover patterns of human behavior. That is to say, the algorithm uses my past behavior to predict my future behavior. Finally, the machine applies that algorithm to my individual data to determine what I will want next.
Big Data is a Big Deal for Entrepreneurs
Big data analytics are game changing for entrepreneurs. This process can help improve how products are advertised and brought to market, help develop platforms for people to work, and help entrepreneurs test often and fail fast, which make entrepreneurs more efficient.
Local governments can also take advantage of big data to better serve their citizens. They have the ability to better understand which streets are most likely to have potholes, which blocks are highest in crime, or which parts of town could most benefit from attracting a new grocery store.
Thanks to the ever-growing Internet of Things, data points on all things great and small are more readily available than ever before. And so we use that data, because more is always better, and bigger sample sizes can always make us more informed, right?
The Problem With Big Data Analytics
For all the potential power big data analytics promise, there are speed bumps that must be overcome. There are three issues that deserve consideration so big data can be all that it can be: understanding how algorithms judge causation and correlation, sample size and data collection, and biases in how algorithms are developed.
Causation and Correlation
An article from the Financial Times took a contrarian approach to the reliability of big data analytics. Simply put, big data can look at the statistical analysis of correlation without the thoughtful reflection of causation. That is to say, in the author’s words, “a theory-free analysis of mere correlation is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.”
Thoughtful analysis by clever analysts should not be underestimated. Rather, they should be added to the new tools we have for big data analytics, to provide a more thorough understanding of the data. We can get a lot from correlation, but we need to dig deeper and look at underlying causes to produce the best analysis.
A well known example of a time big data incorrectly diagnosed its analysis was with Google Flu Trends (GFT). The idea was that search trends could predict where the flu was most likely happening sooner than the CDC. Wired Magazine describes this spectacular flop as the GFT algorithm was prone to count incorrect searches, never unlearn the data capture, and increases in flu searches when the public was more afraid of the flu.
Sample Size and Data Collection
Furthermore, in terms of sample size, some believe big data holds all of the data. But, is “all” really possible? The article continues, “is ‘N=All’ really a good description of most of the found data sets we are considering? Probably not.” We do not need all of the data to have enough data. It’s more important to have a proper and representative sample rather than larger but non-representative sample.
Likewise, there’s the traditional notion “what gets measured gets managed,” so we must be careful what we measure. With so much data, it’s easy to get stuck in the data forest and lose sight of the smaller-but-more-important trees. Algorithms are designed by people, and because of this, correlation may be built into the algorithm, corrupting findings before analysis begins.
The Financial Times article described a time when Twitter tried to draw a conclusion of the public’s general mood by analyzing the contents of Tweets in a certain period of time. This proved inaccurate though, as the general public isn’t necessarily on Twitter. There was a clear disconnect with the population studied, and the information shared about the general mood of the nation based on that study.
Sometimes algorithms developed for big datasets contain inherent biases. For example, the Harvard Business Review cites an instance from the rollout of a smartphone pothole tracker released by the City of Boston. While this app seemed like a passive solution for helping the city discover where potholes needed fixing based on where citizens tracked potholes, it unearthed a bias towards wealthier neighborhoods where residents had access to smartphones. It’s important to remember that it’s people who create algorithms, which may cause errors or biases, and we shouldn’t trust all analyses simply because a “smart machine” ran the numbers.
Despite its problems, big data analytics is here to stay. It can be a hugely useful tool, but we need to learn to use it thoughtfully. Analysis needs to be fundamentally sound (or at least the weaknesses in the approach well understood) to understand causation and correlation. How data is collected needs to be carefully considered. Biases and human error are built into these analytics as people develop algorithms.
We use sophisticated methodologies to analyze “regular-sized” data. While those methodologies may not yet exist for big data, statisticians, researchers, and analysts need to continue to try and find ways to apply those methods to larger-scale data.
That said, big data analytics will continue to help many entrepreneurs and policymakers act more efficiently and create greater good. That’s all for now, my Netflix account won’t watch itself…yet.