Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

February 27, 2008

I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon S3) and the Amazon Elastic Compute Cloud (Amazon EC2).

The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:

Recipe:
4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11 million PDFs, 1.5 TB on S3

Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):

EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868

Not unreasonable at all! Of course this does not include the cost of bandwidth that the NYT needed to upload/download their data.

I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.

As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!

This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.

Additional resources:

Hadoop support for Amazon S3 [Hadoop Wiki]
Godwin-Jones, Robert. 2008. Emerging Technologies of Elastic Clouds and Treebanks: New Opportunities for Content-Based and Data-Driven Language Learning. Language Learning & Technology February, Volume 12, Number 1.
Ramakrishnan, Raghu. 2008. Web Data Management: Powering the New Web. Australasian Database Conference University of Wollongong, January 22-25, 2008.
White, Tom. 2007. Running Hadoop MapReduce on Amazon EC2 and Amazon S3. Amazon Web Services: Developer Connection.
Grossman, Robert. 2007. Data Grids, Data Clouds and Data Webs: A Survey of High Performance and Distributed Data Mining. Hardware and software for large-scale biological computing in the next decade workshop. December 11-14, 2007 in Okinawa, Japan
Saso, Steve. 2007. Scaling for the Participation Age. BCNet2007, Vancouver, April 17 & 18 2007.
Nicolaou, Alex. 2007. Deconstructing Google: Building Scalable Software using MapReduce. BCNet2007, Vancouver, April 17 & 18 2007.

Comments

Anonymous said…

I would imagine that the time required to code and test the solution greatly exceeds the running time (and cost) of the actual execution. It's cheap to run stuff on EC2/S3, but you still have to write the code.

Still, I'm a huge fan of all things AWS, and seeing it used for non-trivial, commercial projects is a really great sign.

28 February, 2008 13:22

Glen Newton said…

Yes, there is at present no generalizable way of cutting the programming (and programmer) out of the system! :-)

So while the infrastructure costs are approaching zero (and the cost of ownership in this case IS zero), the people costs are not.

28 February, 2008 14:18

Anonymous said…

This article made me weep. WOW.

- RL

28 February, 2008 14:46

Search This Blog

Zzzoot

Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

Comments

Popular posts from this blog

Java, MySql increased performance with Huge Pages

Lucene concurrent search performance with 1,2,4,8 IndexReaders

Postscript coding resources