Posts

Showing posts from February, 2008

Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce ) combined with the Amazon Simple Storage Service ( Amazon S3 ) and the Amazon Elastic Compute Cloud ( Amazon EC2 ). The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles: Recipe: 4 TB of data loaded to S3 (TIFF images) + Hadoop (+ Java Advanced Imaging and various glue) + 100 EC2 instances + 24 hours = 11 million PDFs , 1.5 TB on S3 Unfortunately, the developer ( Derek Gottfrid ) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ ): EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240 S3 : $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33 + $0.10 per GB of data transferred in x 4000 GB = $400 + $0.13 per GB of data transferred out x 1500 GB = $195 Total: = ~$868 Not unre

NSF joins Google/IBM (U.S.-only?) Research Cluster Initiative

When Google and IBM last October announced their Internet-Scale Computing Initiative - which looked to dedicate a cluster of 1600 computers for the use of researchers, for free - it was not clear (to me) whether this was a U.S.-only initiative, or was also available (or would eventually become available) to non-U.S. researchers: The University of Washington was the first to join the initiative. A small number of universities will also pilot the program, including Carnegie Mellon University, Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley and the University of Maryland. In the future, the program will be expanded to include additional researchers, educators and scientists. Now with the NSF's announcement that they are partnering with Google and IBM in this initiative in what they are calling the Cluster Exploratory (CluE), it is even less clear (or maybe more clear that is only available to U.S. researchers??), with the NSF re

Openness in the library (technology)

Inside Higher Ed (Feb 19) has an article on some of the changes afoot in library technology: Open Minds, Open Books, Open Source . "Last month, a survey by Marshall Breeding, director for innovative technologies and research at Vanderbilt University’s library, revealed a measure of discord over the options available to librarians for automating their electronic catalogs and databases, software called integrated library systems.....So librarians aren’t exactly reaching for their torches and pitchforks. Still, some libraries, fed up with software that doesn’t fully meet their needs, have decided to take matters, figuratively, into their own hands. With a bit of grant money and some eager developers, institutions have begun creating their own open-source solutions that are fully customizable, free for others to use and compatible with existing systems. The result has been a whole crop of projects that, when combined, could serve as a fully integrated, end-to-end open-source solutio

Pervasive, transparent search and inferencing services

Image
This amazing mobile device mock-up (I'd love one for my birthday tomorrow!), is described by Hard Geek as having "advanced search function". Is this how I would describe it, or how the average user would describe it? Rather, by the time this level of hardware technology is available, the concept of "search" will (should?) have disappeared (to the user at least), and devices will instead should have a seamless understanding of the world around them, including an intimate semantic understanding of their user's short- and long-term goals. No (or very very few) explicit search boxes; instead, they will be extremely context aware, where context includes: geography, orientation, weather, user history, user voice conversation, user goal(s), interactions with other users' similar (trusted and untrusted) devices, specific user inquiries etc. Devices such as this one would be giant (transparent for the most part) mashups, deriving their suggestions and answers fr

OOXML / ODF FUD

In his blog ( An Antic Disposition: Punct Contrapunct , which has a great sub-title: " ...thinking the unthinkable, pondering the imponderable, effing the ineffable and scruting the inscrutable "), Rob Weir has a very good discussion about some recent FUD (Burton Group's " What's up, .DOC? ") around OOXML and the upcoming JTC1 ballot. In this very shilly document - which has even in its abstract subtle digs ("The competitive stakes are huge, and the related political posturing is sometimes perplexing .[ emphasis added] ") Some points: " On one hand, government agencies and other organizations seeking to use a free, non-Microsoft productivity suite will be happy to use ODF, the file format behind OpenOffice.org ": No. Governments (and other organizations) are not looking for free non-Microsoft software; they are looking for (true) open standards for document formats that will allow them to properly manage, distribute and archive their doc

Ranting about maps

Both as a Canadian who has enjoyed the rants of Rick Mercer and as someone who spent a lot of time working with geographers at the Atlas of Canada (and even enjoying some of it!), I am appreciating the convergence of these two things as embodied in the self-described rant by Martin Dodge and Chris Perkins ( Reclaiming the map: British Geography and ambivalent cartographic practice ) In their rant, the Royal Geographical Society (" the heart of geography ") is roasted for using a (" Mc-Map ") Google Maps location map on the back of the RGS-IBG 2007 Annual International Conference programme. While concerned with this and the apparent decline of mapping in geography ( Geographers Don’t Map Anymore Do They? ) and the sub-discipline of cartography ( Cartographers: Who Needs Them Anymore? ) I think their conclusion that mapping is actually easier and possibly better and making a popular and significant resurgence ( Mapping Reinvigorated? ) through mashups (including

2nd European Conference on Scientific Publishing in Biomedicine and Medicine Programme Announced

The 2nd European Conference on Scientific Publishing in Biomedicine and Medicine (Oslo, Sept 4 - 6, 2008) has announced its programme : How to get to universal Open Access and why we want to get there. The progress of science: what hinders and what helps in an OA environment? Funding to authors for OA publishing. University Central Funds for Open Access. Licensing & Copyright: major issues for some. Making the repository a researcher's resource. Open access and the commercial biomedical publishers. Science Publishing: the future of journal publishing. Evaluation of Scientific Research by Advanced Quantitative Methods Beyond Impact Factors. The use and misuse of bibliometric indices in evaluating scholarly performance. What linking in CrossRef can tell us about research.

Special Issue: Cyberinfrastructure, scholarship and scholarly communication

There is a special issue (Winter 2008) of Journal of Electronic Publishing entitled "Special Issue on Communications, Scholarly Communications and the Advanced Research Infrastructure". It examines from a number of different perspectives and interests the intersection of cyberinfrastructure, scholarship and scholarly communication and how they are impacting -- and will be impacting -- scholarly activities. Overview: Editor's Note . Judith Axler Turner Cyberscholarship: High Performance Computing Meets Digital Libraries . William Y. Arms . When Authorship Isn't Enough: Lessons from CERN on the Implications of Formal and Informal Credit Attribution Mechanisms in Collaborative Research . Jeremy Birnholtz The Virtual Observatory Meets the Library . G. Sayeed Choudhury Triple Helix: Cyberinfrastructure, Scholarly Communication, and Trust . Amy Friedlander Talk About Talking About New Models of Scholarly Communication . Karla L. Hahn Can Universities Dream of Electric She