4 Data Science Insights from Facebook Headquarters
Response rates to traditional surveys are declining. Polls relying on calling landline phones are becoming less and less representative. And even though this type of data is great for research, it is getting harder to collect it the traditional way.
On the other hand, data generated online is becoming more ubiquitous and easier to access. Most of us leave a long data trail every day in our online activities, from professional trajectories on LinkedIn, to political preference on Twitter. Yet, researchers do not yet know how to interpret these data, and it is still unclear how reliable or insightful it can actually be.
To help bridge the gap between web data ubiquity and actual use in research, Facebook hosted a data science conference in August for researchers at their headquarters in Silicon Valley. I attended it to present a Kauffman research I co-authored with colleagues Yas Motoyama, Jared Konczal, and Jordan Bell-Masterson (more info about the research in the sidear). I took home some interesting lessons from the conference and wanted to share them.
People, Not Users
These are all words I have used in the last week alone to refer to people. And it is scary.
Facebook office's walls are more or less covered with posters like the one above, with various messages to their team. This specific message hit home. For my work at the Kauffman Foundation, I often sift through piles of data trying to figure out trends or insights. But it is easy to forget that we are the data represent real people.
If you are curious, you can see more Facebook posters and other office pictures here. The ones security will let you take pictures of, anyways. They are understandably very careful about what you can and cannot do at their offices.
The insight: People, not data points
The Facebook Data Science Team Toolkit
The Facebook Data Science Team shared the tools they use on their own analyses. The majority of them are open-source, and tools mentioned include:
- Python-based tools
- "Big Data" tools
- Hadoop (distributed computing – open-source
- Apache Hive (data warehouse – originally developed by Facebook, now open-source)
- Network analysis tools
- Gephi (network analysis and visualization – open-source)
The insight: The data scientist needs an evolving toolkit. If you want to see more about it, check out this deck put together by Software Carpentry.
80% of Data Science is Cleaning and Munging
When I start working with a new dataset, I sometimes get frustrated with how much time I have to spend cleaning and munging the data before I can do produce anything really good with it.
This frustration carried a sort of hope, however. I told myself that, in the future, once I knew all the tools and was really good at data science, I would get done so quickly with my data cleaning that it would take almost no time at all.
The talks with the Facebook Data Science team shattered my naivety. Most of the data scientists there, arguably some of the best Silicon Valley can create, highlighted how much time they spend cleaning and munging data. Many cited that they spend around 80 percent of the time doing just that.
The insight: Data science requires a lot of schlep, and the only way do to it is getting your hands dirty.
What Academia is up to in the Digital Data Space
A big part of the conference was dedicated to academics presenting how they are using social media or digitally collected data on their research. Since the Facebook conference was the day before the Annual American Sociological Association meeting in San Francisco, most attendees were sociologists.
Here are 3 of the projects presented:
- Going It Alone? Social Connectivity at Reentry from Prison. Naomi F. Sugie (Princeton).
- Studying smartphone use of 128 parolees and how their networks help in re-integration to society (e.g.: finding a job)
- Vegetarianism and Twitter Use. Orlando Torres (Florida State University).
- Looking at Twitter mentions of vegetarianism and using textual analysis to see if mentions were positive and negative. Looking at geographical distribution of positive VS. negatives associations of vegetarianism.
- Martin Barron from NORC presented a really good overview of trade-offs of using social media data. I did not find the presentation online, but here is the summary of what he talked about.
The insight: Big data is making its way into social science research increasingly, but there remains a lot of room for innovation.
A Fun Fact About Facebook Data
Facebook has a huge wall with screens Facebook users and their connections.
Besides the interesting fact that Facebook users are all over the world, there is one even more impressive observation: there is no underlying map being shown.
Facebook has so many users that simply plotting latitude and longitude for them in a blank space gives you a visualization that very closely resembles an actual map of the world.
See Arnobio's slide deck from the Facebook event.
Read the full research report.
Read commentary from Forbes, Business Insider and Venture Beat.
comments powered by