Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)
I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon S3) and the Amazon Elastic Compute Cloud (Amazon EC2).
The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:
Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):
Not unreasonable at all! Of course this does not include the cost of bandwidth that the NYT needed to upload/download their data.
I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.
As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!
This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.
Additional resources:
The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:
Recipe:
4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11 million PDFs, 1.5 TB on S3
Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):
EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868
I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.
As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!
This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.
Additional resources:
- Hadoop support for Amazon S3 [Hadoop Wiki]
- Godwin-Jones, Robert. 2008. Emerging Technologies of Elastic Clouds and Treebanks: New Opportunities for Content-Based and Data-Driven Language Learning. Language Learning & Technology February, Volume 12, Number 1.
- Ramakrishnan, Raghu. 2008. Web Data Management: Powering the New Web. Australasian Database Conference University of Wollongong, January 22-25, 2008.
- White, Tom. 2007. Running Hadoop MapReduce on Amazon EC2 and Amazon S3. Amazon Web Services: Developer Connection.
- Grossman, Robert. 2007. Data Grids, Data Clouds and Data Webs: A Survey of High Performance and Distributed Data Mining. Hardware and software for large-scale biological computing in the next decade workshop. December 11-14, 2007 in Okinawa, Japan
- Saso, Steve. 2007. Scaling for the Participation Age. BCNet2007, Vancouver, April 17 & 18 2007.
- Nicolaou, Alex. 2007. Deconstructing Google: Building Scalable Software using MapReduce. BCNet2007, Vancouver, April 17 & 18 2007.
Comments
Still, I'm a huge fan of all things AWS, and seeing it used for non-trivial, commercial projects is a really great sign.
So while the infrastructure costs are approaching zero (and the cost of ownership in this case IS zero), the people costs are not.
- RL