I’m in the middle of a great book called Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger and Kenneth Cukier. If you’re not yet familiar with big data, it’s all about the switch that technology has allowed us to make from collecting small sample sizes and focusing on causation and instead collecting ALL the data available and mining it for correlations.
There are loads of relevant examples in the book, but one that’s easily relatable is how Google is able to determine with accuracy (before any health agency can) where the flu is spreading based on people’s search terms in specific geographic areas. This is only possible because Google collects ALL the data about what people are searching for and where they’re searching from. It’s the incredible magnitude of the data points collected that enables Google to make correlations that we may, or may not, expect to find.
Some other applications of big data include Google’s translation system which is so effective not because the algorithm is necessarily so much better than other translators, but because of the massive amount of data the algorithm was fed. Then there are the more creepy examples, such as how Target can tell when a woman is pregnant based on her purchases in the first few weeks of pregnancy and market to her accordingly. They are only able to figure out these correlations by collecting and saving millions of pieces of data to analyze.
I look forward to writing more about the book in a future post, but last night as I was reading I realized there were some great big data buzzwords in the book worth sharing.
Collaborative Filtering – You’re probably already familiar with this concept, but it’s aways nice to put a name to something you don’t know what to call, isn’t it?
Collaborative filtering describes what’s happening behind the scenes when a website, for example, makes recommendations about products or services you might like. It’s literally filtering out which products to show you by mining through loads of data regarding the preferences of you and other users (that’s the collaborative part). So the next time Netflix recommends a terrible movie, you can specifically curse their collaborative filtering method (for the record, I am usually happy with the movies Netflix recommends for me, but I’d like my 2 hours back for watching It’s a Disaster because, frankly, it was). On a side note, did you know that Netflix actually crowdsourced its user rating prediction algorithm? Check out the Netflix Prize if you want to learn more.
There are all different types of collaborative filtering systems that look at various data points in different combinations. For example, Amazon was a pioneer in item-to-item collaborative filtering which they started back in 1998. The “item-to-item” bit means they focus on product-specific data and making associations between products in addition to what products users look at in succession, purchase together, etc.
There’s also “content-based filtering” which means the data considered is specific to the item rather than a user’s past behavior or the behavior of other users. If you’re familiar with Pandora’s Music Genome Project, that’s a perfect example of content-based filtering. You tell Pandora what music you like, and their system looks at characteristics of those artists or songs you identified and finds other songs and artists whose music has similar characteristics. This is called “content-based” filtering rather than “collaborative” filtering because all the data being used to generate the recommendations is specific to the content – the music – it’s not looking at the activity of other users at all.
Predictive Analytics – Or as I like to call it, data-driven fortune-telling.
When there is historical data galore to analyze, we can find correlations that help us to make predictions about future outcomes. One example the book gave is how UPS collects data from each of their 60,000 vehicles to determine exactly when certain mechanical parts should be replaced. This is saving them millions of dollars because they’re no longer making unnecessary replacements based on a preset timeframe (or waiting until it’s too late and there’s a mechanical failure) but instead using their own historical data to identify the signs that indicate a part needs replacing. Because they’re carefully monitoring their vehicles and individual parts, they can spot these signs in real-time and make the replacements when needed.
Now, predictive analytics isn’t specific to big data. Any time you use existing data to predict future outcomes you’re conducting predictive analysis. If you’ve ever created a budget based on last year’s numbers, then you, too, are a data-driven fortune teller!
But big data is taking predictive analytics to a whole new level. With big data, the volume and breadth of historical data being accumulated is growing exponentially. And because there is so much data available, we’re able to find correlations we may not have ever thought of ourselves… some of the correlations don’t even make logical sense but that doesn’t make them any less valid.
For example, the book states research from the University of Ontario Institute of Technology that shows certain changes in the vital signs of premature babies can predict the onslaught of a serious infection. And the changes they identified that predict this? The vital signs actually become more stable. Not what you might expect to happen right before the battle of a baby’s life. You think there’d be some vital signs that are totally out of whack to indicate a problem was on the horizon, but instead it’s the calm before the storm that can now help doctors ring the warning bells and take preemptive action.
I look forward to further exploring big data – and the lessons we non-statisticians can learn from it and apply in our own work – in a future post soon.