Stack exchange data dump
This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.
Click here to see context.
Medicare Claims data set
Available when you participate in the new Cloudera challenge.
The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014. It costs $600 to partcipate. I guess they are worried that the data get re-sold or about some other potential data leaks. They also want real practionioners (an issue on Kaggle competitions), as students are unlikely to fork out $600. But if you participate, you get a copy of Hadoop to install on your laptop; this copy it emulates multi-node Hadoop.
Links to other data sets
- Source code for our Big Data keyword correlation API
- Great statistical analysis: forecasting meteorite hits
- Fast clustering algorithms for massive datasets
- 53.5 billion clicks dataset available for benchmarking and testing
- Over 5,000,000 financial, economic and social datasets*
- New pattern to predict stock prices, multiplies return by factor 5*
- 3.5 billion web pages*
- Another large data set - 250 million data points - available for do...*
- 125 Years of Public Health Data Available for Download*