Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon S3) and the Amazon Elastic Compute Cloud (Amazon EC2).

The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:
Recipe:
4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11 million PDFs, 1.5 TB on S3

Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):
EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868
Not unreasonable at all! Of course this does not include the cost of bandwidth that the NYT needed to upload/download their data.

I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.

As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!

This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.

Additional resources:

Comments

Anonymous said…
I would imagine that the time required to code and test the solution greatly exceeds the running time (and cost) of the actual execution. It's cheap to run stuff on EC2/S3, but you still have to write the code.

Still, I'm a huge fan of all things AWS, and seeing it used for non-trivial, commercial projects is a really great sign.
Glen Newton said…
Yes, there is at present no generalizable way of cutting the programming (and programmer) out of the system! :-)

So while the infrastructure costs are approaching zero (and the cost of ownership in this case IS zero), the people costs are not.
Anonymous said…
This article made me weep. WOW.

- RL

Popular posts from this blog

Java, MySql increased performance with Huge Pages

IBM on Linux: "Lean, clean, and green"

Lucene indexing performance benchmarks for journal article metadata and full-text