Friday, November 02, 2007

New York Times uses Hadoop to digitize its archives

Neat use of Hadoop to quickly (24 hours!) churn through terabytes of data. Rather than dynamically generate PDFs for each of the articles available, everything from 1851 to 1980, Derek Gottfrid uses Hadoop and Amazon's S3/EC2 to generate the all the PDFs in just one day:

But thanks to the swell people at Amazon, I got access to a few more machines and churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. (In fact, it work so well that we ran it twice, since after we were done we noticed an error in the PDFs.)

Very nice, especially considering this is a pretty non-obvious use of the Hadoop framework. Hopefully all of the hadoopists - my god, I hope that term doesn't catch on - will trumpet this work.

1 snarky replies:

Mike said...

Wow, that rocks!
Thanks for posting this.
I hope we can convince UWB to promote distributed computation in more courses.