Co-Founder and Director of streamdrill (formerly TWIMPACT). PostDoc in machine learning at the Technische Universität Berlin, Germany.
Fast track media upload for twitter.com/mikiobraun
sellthenews has a must-read in-depth analysis of the statistical analysis in the Twitter mood predicts the stock market. They uncover a number of methodological errors to the effect that the method appears to work much better than it will applied on real-time data. In machine learning terms, the reported results have a huge bias due to overfitting to the data due to not correcting for multiple testing, or reporting test errors instead of errors on independent validation sets.
Probably the most amazing short-coming of the paper is that the results were evaluated on 15 days in December 2008. That is very little data and distinct market situations to base your evaluation on. To see what this means, have a look at Figure 3 (page 4) in the paper. Shown are the stock courses and the predictions with areas highlighted where the authors claim to see significant correlation. Pardon my bluntness, but if I sampled any three random time-series I would probably be able to identify the same pattern of correlatedness.
So what can be done to get more robust results. As I’ve discussed already in a blog post, the key is to do proper validation on independent data, and also to test both on data where you believe there is something to be discovered and where there isn’t. Publishing your raw data to let others validate your result would help, too.
Real-time seems to be the next big thing in big data. Map-Reduced has shown how to perform big analyses on huge data sets in parallel, and the next challenge seems to be to find a similar kind of approach to real-time.
When you look around the web, there are two major approaches out there which try to building something which can scale to deal with Twitter-firehose-scale amounts of data. One is starting with a MapReduce framework like Hadoop and somehow finagle real-time or at least streaming capabilities on it. The other approach starts with some event-driven “streaming” computing architecture and makes it scale on cluster.
These are interesting and very cool projects, however from our own experience with retweet analysis at TWIMPACT, I get the feeling that both approaches fall short of providing a definitive answer.
In short: One does not simply scale into real-time.