Big Data Analytics Pitfalls
Notes on Nature’s article “When Google got flu wrong”, link here.
Google Flu Trend’s occasional large differences in predicted flu levels compared to the CDC’s surveillance is due in part to that Google’s estimations were based on search traffic for flu, data which can be influenced by factors other than if people have flu symptoms. For example, in 2009 and 2013, due to the novelness of swine flu and large media coverage of flu season, respectively, Google Flu Trends overestimated the percentage of the US population with flu symptoms. This shows that data used for these large-scale algorithms should be monitored and adjusted to control for extraneous factors.
Overfitting
Notes on Science’s article “Google’s Big Data Flu Flop”, link here.
Another issue with Google Flu Trend’s data is that it takes all search terms correlated with flu cases, and because flu is typically seasonal, this includes searches for unrelated events that occur during flu season. Google Flu Trends shows evidence of overfitting to this unrelated data, and this overfitting leads to errors.
Final Notes
Both of the aforementioned articles make a larger point that dig data solutions, such as Google Flu, are not substitutes for more traditional solutions like the CDC’s data collection, but are there to supplement more traditional solutions. Furthermore, the method of collection for many sources of big data, such as Twitter or Google searches, are not held held to the same standards that data collected for experimental anaylsis would be, and if the data itself is flawed, algorithms to parse the data will be flawed as well.