The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:
4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11 million PDFs, 1.5 TB on S3
Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):
EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868
I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.
As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!
This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.
- Hadoop support for Amazon S3 [Hadoop Wiki]
- Godwin-Jones, Robert. 2008. Emerging Technologies of Elastic Clouds and Treebanks: New Opportunities for Content-Based and Data-Driven Language Learning. Language Learning & Technology February, Volume 12, Number 1.
- Ramakrishnan, Raghu. 2008. Web Data Management: Powering the New Web. Australasian Database Conference University of Wollongong, January 22-25, 2008.
- White, Tom. 2007. Running Hadoop MapReduce on Amazon EC2 and Amazon S3. Amazon Web Services: Developer Connection.
- Grossman, Robert. 2007. Data Grids, Data Clouds and Data Webs: A Survey of High Performance and Distributed Data Mining. Hardware and software for large-scale biological computing in the next decade workshop. December 11-14, 2007 in Okinawa, Japan
- Saso, Steve. 2007. Scaling for the Participation Age. BCNet2007, Vancouver, April 17 & 18 2007.
- Nicolaou, Alex. 2007. Deconstructing Google: Building Scalable Software using MapReduce. BCNet2007, Vancouver, April 17 & 18 2007.