Tuesday, April 22, 2008

"Science 2.0 -- Is Open Access Science the Future?"

"Is posting raw results online, for all to see, a great tool or a great risk?"
Scientific American article. Read it. Nuff said.

Media in Motion Symposium

McGill University's Documentation and Conservation of the Media Arts Heritage (DOCAM) Research Alliance and Media@McGill have call for papers out for their "Media in Motion: The Challenge of Preservation in the Digital Age" to be held October 29, 2008 at McGill University (Annual International DOCAM Summit). The topics include but are not limited to:
  • Archival Practices
  • Challenges of Audio, Film, Video, and Digital Media Preservation
  • Cultural Influences, Impacts, and Considerations
  • Cultural Property Law
  • Digital Preservation and Cultural Memory
  • Digitization of the Humanities
  • Effects on Artistic Practices
  • Ethical, Social, and Philosophical Concerns
  • Preservation Strategies and Techniques
  • Future Trends and Directions
I found this information on the DIGLIB mailing list, but I can't find the CFP on the DOCAM site, so here is the link to the mailing list announcement.

Wednesday, April 16, 2008

"Wikipedia for Data"

Bret Taylor has a refreshing and rather simple (powerful) idea: Wikipedia for data. All kinds of data. From what I would consider scientific data, to census data to geographic data to TV listings to stock data to CD cover and track data to .....whatever. Cut out the lawyers and the licensing. Open Data for all to use and experiment with.

Coming from a country where you have to buy census data (!!!), this is a wonderful idea. Spread the meme.

"Libraries in the Converging Worlds of Open Data, E-Research, and Web 2.0"

This looks like an interesting article (Libraries in the Converging Worlds of Open Data, E-Research, and Web 2.0, Stuart MacDonald, March/April issue of ONLINE magazine) but I don't have a subscription so I can't really comment much on it. Ironic that the abstract mentions Peter Suber's Open Access News blog.... ;-)

Abstract: "The new forms of research enabled by the latest technologies bring about collaboration among researchers in different locations, institutions, and even disciplines. These new collaborations have two key features -- the prodigious use and production of data. This data-centric research manifests itself in such concepts as e-science, cyberinfrastructure, or e-research. Over the last decade there has been much discussion about the merits of open standards, open source software, open access to scholarly publications, and most recently open data. There are a range of authoritative weblogs that address the open movement, some of which include: 1. DCC's Digital Curation Blog, 2. Peter Suber's Open Access News, and 3. Open Knowledge Foundation Weblog. The data used and produced in e-research activities can be extremely complex, taking different forms depending on the discipline. In the hard sciences, such as biochemistry data can take the form of images and numbers representing the structure of a protein."

Semantic Markup

The Economist has a good article (The Semantic Web: Start Making Sense, April 6, 2008) on the Semantic Web (Web 3.0?). One of the technologies they describe for automatically semantically marking-up structured and unstructured text is Calais from Reuters:
"Reuters, however, believes it has overcome this problem. It recently launched a service called Calais[1] that takes raw web pages (and, indeed, any other form of data) and does the marking up itself. The acronyms can then get to work. That promises to imbue the streams of unstructured text and data sloshing around the internet with almost instant meaning.
The idea is that any website can send a jumble of text and code through Calais and receive back a list of "entities" that the system has extracted--mostly people, places and companies--and, even more importantly, their relationships. It will, for instance, be able recognise a pharmaceutical company's name and, on its own initiative, cross-reference that against data on clinical trials for new drugs that are held in government databases. Alternatively, it can chew up a thousand blogs and expose trends that not even the bloggers themselves were aware of."

The examples are pretty cool, especially the Wikipedia + Amazon API + Calais mashup Semantic Book Suggestions. The Powerhouse Museum's example is pretty good too (see "Auto-generated tags: [Beta]" down the right column).
More examples.

Lucene indexing performance benchmarks for journal article metadata and full-text

I posted these journal article metadata & full-text Lucene indexing benchmarks to the lucene user mailing list using the suggested XML format, but it seems like that was not the proper thing to do. One of the list members (Cass Costello) converted it to HTML (thanks :-) ). I've decided to give it a permanent home here. If you have any questions, just let me know. I have some other benchmarks I will be posting with more records (~25 million) but only article metadata, not full-text. The loader that does all of this was developed as part of my Ungava project.

    Hardware Environment

  • Dedicated machine for indexing: yes
  • CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
  • RAM: 8GB
  • Drive configuration: Dell EMC AX150 storage array fibre channel

  • Software environment

  • Lucene Version: 2.3.1
  • Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
  • Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
  • OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
  • Location of index: Filesystem, on attached storage
  • Lucene indexing variables

  • Number of source documents: 6,404,464
  • Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this

  • Average filesize of source documents: 22KB + metadata (see above)

  • Source documents storage location: Filesystem

  • File type of source documents: text (PDFs converted to text then gzipped)

  • Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing

  • Analyzer(s) used: StandardAnalyzer

  • Number of fields per document: 24

  • Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;

  • Index persistence: FSDirectory

  • Index size: 83GB

  • Number of terms: 143,298,010
  • Figures

  • Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours

  • Time taken / 1000 docs indexed: 11.5 seconds

  • Memory consumption: -Xms4000m -Xmx6000m

  • Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s

  • Notes

  • These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
  • Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
  • File system file reading and Un-gzip performed multithreaded.
  • Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
  • Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.

Monday, April 07, 2008

Must-read for Science Librarians: "Open Notebook Science: Implications for the Future of Libraries"

Jean-Claude Bradley's presentation "Open Notebook Science: Implications for the Future of Libraries" is a must-read for all research and science librarians if they want to know how science is starting to be, and will be, done. It should also be read by those who plan the futures of research and science libraries, in order to understand how, for instance, the millennials will be doing science, if they are not already.

Fundamental to this future (and present) are Open Access, Open Data, social (research) networking, the blogging/wiki/GoogleDocs/mailing-list dynamics, Wiki versioning (of experimental and other research activities), Second Life for presentations and teaching, and the necessity of machine-to-machine communications and interactions (see my earlier blog entry: New Open Access Criterion: Support access by machines").

Abstract: Open Notebook Science involves a variety of internet-based techniques for sharing of scientific information, from the use of wikis for experiments, to the Chemspider database, where chemists share molecules in a fashion that is socially (but not technically) similar to Wikipedia. Aspects of Open Notebook Science that are of relevance to librarians are discussed, such as automating of metadata for describing the steps of experiments, and the importance of using a 3rd-party wiki to record Open Notebook Science, so that contributions can be tracked and time-stamped. Bradley predicts movement towards more machine-to-machine communication, which will considerably speed up the research process.

New Open Access Criterion: Support access by machines (m2m)

Related to my last posting (FREE THE ARTICLES! (full-text for researchers & scientists and their machines)) and in the light of Peter Murray-Rust's recent annoying discovery that he cannot text-mine Pubmed Central (Can I data- and Text-mine Pubmed Central?), I would like to suggest an additional criterion to the definition of Open Access:
Open Access must include access by machines:
  • At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
  • Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
  • Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.
The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.

Thanks to Peter Suber's Open Access News for the pointer to Peter Murray-Rust's difficulties.

Friday, April 04, 2008

FREE THE ARTICLES! (Full-text for researchers & scientists and their machines)

At a recent plenary I gave [earlier post] at the Colorado Association of Research Libraries Next Gen Library Interfaces conference, I went a little off-script and was educating (/haranguing) the mostly librarian audience about the present-and-near-future importance of the accessibility of full-text research articles to their researchers and scientists.

By accessibility of full-text I didn't mean the ability of a human to access the PDF or HTML of an article via a web browser: I was referring to the machine-accessibility of the text contained in the article (and the metadata and the citation information).

I was concerned because of the increasing number of discipline-specific tools that use full-text (& metadata & citations) to allow users (via text mining, semantic analysis, etc.) to navigate, analyze and discover new ideas and relationships, from the research literature. The general label for this kind of research is 'literature-based discovery', where new knowledge hidden in the literature is exposed using text mining and other tools.

Most publisher licenses do not allow for the sort of access to the full-text that many of these discovery and exploration tools need.

When I asked for a show of hands of how many were aware of this issue, of the ~200 in the audience, no one raised their hand.

I went on to suggest/rant that librarians should expect more of their researcher/scientist patrons to be needing/demanding this sort of access to the full-text of (licensed) journal articles. They need to anticipate this response, and I suggested the following non-mutually-exclusive strategies:
  • demanding licenses from publishers and aggregators that allow them to offer access to full-text for analysis by arbitrary patron tools
  • asking publishers to publish their full-text in the Open Text Mining Interface (OTMI)
  • supporting Open Access journals which allow-for much of this this out-of-the-box (but often have very difficult APIs or non-at-all and only web pages to get at the content!!)
Recently I retro-discovered an article[1] in The Economist, which explains to the lay-person some of the kind of things that can be done with access to the literature. This study [2] shows how researchers discovered the biochemical pathway involved in drug addiction from the literature alone. They did no experiments. This discovery was derived from an analysis and extraction of information from more than 1000 articles! This is not the first time this sort of thing has happened[3]. Clearly, this sort of analysis can save time and money in discovering important and relevant scientific knowledge.

[1] Drug Addiction: Going by the book (2008). The Economist, January 10 print issue.
[2] Li, C., Mao, X., Wei, L. (2008). Genes and (Common) Pathways Underlying Drug Addiction. PLoS Computational Biology, 4(1), e2. DOI: 10.1371/journal.pcbi.0040002
[3] Swanson, D. (1986). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med, 30:1:7-18.

Additional reading:
Update 2008 April 7: Peter Suber's posts on how OA facilitates meta-analysis and text-mining.

Thanks to Martha Lee UCLA via NGC4LIB.

Tuesday, April 01, 2008

"Places & Spaces: Mapping Science" exhibit @ NRC-CISTI

It is very exciting that the Places and Spaces: Mapping Science exhibit from Indiana University will be on display at NRC-CISTI from April 3 - June 27 2008.

This is the first time this collection of amazing maps of science is on display outside the U.S.

The diverse and creative collection includes traditional cartographic maps, concept maps and domain maps. These are all physical paper (+other media) maps, and also includes some hands-on maps made specifically for children to interact with.

Congrats to all involved at NRC-CISTI and in particular my CISTI Research colleague Jeff Demaine who was the originator and champion of this initiative.


Update: 2008 April 15: Indiana University SLIS Events News: Mapping Science Exhibit at the National Research Council - Ottawa, Canada