Wednesday, February 27, 2008

Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon S3) and the Amazon Elastic Compute Cloud (Amazon EC2).

The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:
4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11 million PDFs, 1.5 TB on S3

Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):
EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868
Not unreasonable at all! Of course this does not include the cost of bandwidth that the NYT needed to upload/download their data.

I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.

As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!

This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.

Additional resources:

NSF joins Google/IBM (U.S.-only?) Research Cluster Initiative

When Google and IBM last October announced their Internet-Scale Computing Initiative - which looked to dedicate a cluster of 1600 computers for the use of researchers, for free - it was not clear (to me) whether this was a U.S.-only initiative, or was also available (or would eventually become available) to non-U.S. researchers:
The University of Washington was the first to join the initiative. A small number of universities will also pilot the program, including Carnegie Mellon University, Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley and the University of Maryland. In the future, the program will be expanded to include additional researchers, educators and scientists.

Now with the NSF's announcement that they are partnering with Google and IBM in this initiative in what they are calling the Cluster Exploratory (CluE), it is even less clear (or maybe more clear that is only available to U.S. researchers??), with the NSF responsible with selecting who can use the resource:
"NSF will then select the researchers to have access to the cluster and provide support to the researchers to conduct their work."
This initiative is built using Apache Hadoop (primarily a Yahoo project), which includes Open Source implementations of Google's MapReduce and GFS. With more supercomputing / cloud computing resources going commodity, more researchers will be altering their compute job implementations to be more MapReduce-friendly.

Related: Yahoo's Doug Cutting on MapReduce and the Future of Hadoop

Saturday, February 23, 2008

Openness in the library (technology)

Inside Higher Ed (Feb 19) has an article on some of the changes afoot in library technology: Open Minds, Open Books, Open Source.
"Last month, a survey by Marshall Breeding, director for innovative technologies and research at Vanderbilt University’s library, revealed a measure of discord over the options available to librarians for automating their electronic catalogs and databases, software called integrated library systems.....So librarians aren’t exactly reaching for their torches and pitchforks. Still, some libraries, fed up with software that doesn’t fully meet their needs, have decided to take matters, figuratively, into their own hands. With a bit of grant money and some eager developers, institutions have begun creating their own open-source solutions that are fully customizable, free for others to use and compatible with existing systems. The result has been a whole crop of projects that, when combined, could serve as a fully integrated, end-to-end open-source solution for academic libraries, covering the interface, search mechanism, database system, citations and even course management."

Thursday, February 21, 2008

Pervasive, transparent search and inferencing services

This amazing mobile device mock-up (I'd love one for my birthday tomorrow!), is described by Hard Geek as having "advanced search function". Is this how I would describe it, or how the average user would describe it? Rather, by the time this level of hardware technology is available, the concept of "search" will (should?) have disappeared (to the user at least), and devices will instead should have a seamless understanding of the world around them, including an intimate semantic understanding of their user's short- and long-term goals.

No (or very very few) explicit search boxes; instead, they will be extremely context aware, where context includes: geography, orientation, weather, user history, user voice conversation, user goal(s), interactions with other users' similar (trusted and untrusted) devices, specific user inquiries etc.

Devices such as this one would be giant (transparent for the most part) mashups, deriving their suggestions and answers from a huge possible number of source data, search and inferencing services. Yes, inferencing services. I believe that there will soon be inferencing services which will be able to take large complex semantic networks and inference over them -- themselves drawing on data, search and inferencing services -- to render complex, explainable answers to users' situations and inquiries.

Related: Microsoft Live Lab's Photosynth Project: BBC, Wikipedia, Photosynth on PBS

Update 2008 03 04: The "identify-what-I-am-looking-at" technology needed for this mock-up can be seen at least partially demonstrated in "Cyber Goggles: High-tech memory aid " if perhaps not as elegantly or simply...

Thursday, February 14, 2008


In his blog (An Antic Disposition: Punct Contrapunct, which has a great sub-title: "...thinking the unthinkable, pondering the imponderable, effing the ineffable and scruting the inscrutable"), Rob Weir has a very good discussion about some recent FUD (Burton Group's "What's up, .DOC?") around OOXML and the upcoming JTC1 ballot.

In this very shilly document - which has even in its abstract subtle digs ("The competitive stakes are huge, and the related political posturing is sometimes perplexing.[emphasis added]")

Some points:
  • "On one hand, government agencies and other organizations seeking to use a free, non-Microsoft productivity suite will be happy to use ODF, the file format behind": No. Governments (and other organizations) are not looking for free non-Microsoft software; they are looking for (true) open standards for document formats that will allow them to properly manage, distribute and archive their documents without having to worry if they can open a document created five years ago or to purchase software from more than a single vendor.
  • "OOXML is an extensible standard. It allows vendors and enterprises to extend the standard within an OOXML-defined framework... This built-in ability to augment the OOXML standard is a safety valve for future innovation, allowing new features to be added without forcing vendors to invent yet another separate file format or wait for standards bodies to give their approval. While such extensions initially decrease interoperability, it's Burton Group's belief that this issue will resolve itself over time, as popular extensions are adopted by other vendors or eventually move into the baseline specification." This is an opaque way of saying that to do more interesting things, you have to use another Microsoft proprietary format.
  • They do not deign to address any of the serious issues around the viability of OOXML as an open standard discussed extensively elsewhere. Actually, that is not entirely true: they did mention something about the ISO process and the "several thousand suggestions" concerning OOXML. Suggestions. How quaint.
  • "The debate and scrutiny are not surprising, given Microsoft's historical track record as an extremely aggressive competitor and convicted monopolist, but it's important to understand that Microsoft appears to be sincerely committed to making OOXML a substantive standard...". Sorry, all these words in the same sentence make my head explode.
  • "Broad recognition of OOXML as a legitimate (real and de facto) standard..." So its not an open standard anymore?

In general, they damn ODF by giving it faint praise, repeatedly pointing-out its limitations: "ODF represents laudable design and standards work. It's a clean and useful design, but it's appropriate mostly for relatively unusual scenarios in which full Microsoft Office file format fidelity isn't a requirement".

Wednesday, February 13, 2008

Ranting about maps

Both as a Canadian who has enjoyed the rants of Rick Mercer and as someone who spent a lot of time working with geographers at the Atlas of Canada (and even enjoying some of it!), I am appreciating the convergence of these two things as embodied in the self-described rant by Martin Dodge and Chris Perkins (Reclaiming the map: British Geography and ambivalent cartographic practice )

In their rant, the Royal Geographical Society ("the heart of geography") is roasted for using a ("Mc-Map") Google Maps location map on the back of the RGS-IBG 2007 Annual International Conference programme.

While concerned with this and the apparent decline of mapping in geography (Geographers Don’t Map Anymore Do They?) and the sub-discipline of cartography (Cartographers: Who Needs Them Anymore?) I think their conclusion that mapping is actually easier and possibly better and making a popular and significant resurgence (Mapping Reinvigorated?) through mashups (including Google Maps), Open Source tools, open geographic data sets, etc is valid. Unfortunately --as they point out -- this amateur mapping is happening without the participation or even notice of -- (in this case, British) academics.

Sunday, February 10, 2008

2nd European Conference on Scientific Publishing in Biomedicine and Medicine Programme Announced

The 2nd European Conference on Scientific Publishing in Biomedicine and Medicine (Oslo, Sept 4 - 6, 2008) has announced its programme:
  • How to get to universal Open Access and why we want to get there.
  • The progress of science: what hinders and what helps in an OA environment?
  • Funding to authors for OA publishing.
  • University Central Funds for Open Access.
  • Licensing & Copyright: major issues for some.
  • Making the repository a researcher's resource.
  • Open access and the commercial biomedical publishers.
  • Science Publishing: the future of journal publishing.
  • Evaluation of Scientific Research by Advanced Quantitative Methods Beyond Impact Factors.
  • The use and misuse of bibliometric indices in evaluating scholarly performance.
  • What linking in CrossRef can tell us about research.

Special Issue: Cyberinfrastructure, scholarship and scholarly communication

There is a special issue (Winter 2008) of Journal of Electronic Publishing entitled "Special Issue on Communications, Scholarly Communications and the Advanced Research Infrastructure". It examines from a number of different perspectives and interests the intersection of cyberinfrastructure, scholarship and scholarly communication and how they are impacting -- and will be impacting -- scholarly activities.