Monday, April 07, 2008

New Open Access Criterion: Support access by machines (m2m)

Related to my last posting (FREE THE ARTICLES! (full-text for researchers & scientists and their machines)) and in the light of Peter Murray-Rust's recent annoying discovery that he cannot text-mine Pubmed Central (Can I data- and Text-mine Pubmed Central?), I would like to suggest an additional criterion to the definition of Open Access:
Open Access must include access by machines:
  • At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
  • Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
  • Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.
The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.

Thanks to Peter Suber's Open Access News for the pointer to Peter Murray-Rust's difficulties.


PA said...

I agree that machine-reading is a fundamental necessity to leverage the maximum capabilities of Open Access. At the moment, however, I hate to say that I think we need to win the first battle of simply getting more literature published under any OA paradigm, then worry more about how to make these resources accessible by crawlers.

My reasoning is that it's a fairly simple thing to implement code, for instance at the level of PubMedCentral, that will output a readable XML file of any document, whereas the sort of mass conversion of the publishing "norms" to Open Access is a much more complicated procedure.

Stevan Harnad said...

Meet the Old OA Criterion Before Demanding More

I think it would be a big strategic mistake if today, when the cupboards are still 85% bare, we were to start insisting that deposits must all be Cordon Bleu ****.

OA just means free online access to the full-text of refereed journal articles. Please let's not risk getting less by needlessly insisting on more. The rest will come in due time, but what is urgently needed today, and what is still 85% overdue by more than 10 years today, is free online access. Let the Green OA mandates provide that, and the rest will all come naturally with the territory soon enough of its own accord.

But over-reach gratuitously now, and we will just delay the optimal and inevitable, already within our reach, still longer.

Ceterum Censeo: The BBB "definitions" (which were not brought down to us by Moses from On High, but puttered together by muddled mortals, including myself) are not etched in stone, and need some tweaking to get them right.

"Time to Update the BBB Definition of Open Access"

OA is free online access. With that comes, automatically, the individual capability of linking, reading, downloading, storing, printing off, and data-mining (locally).

The further "rights" for 3rd-party databases to data-mine and re-publish will come after universal Green OA mandates generate universal OA (free online access). But you'll never get universal Green OA mandates if you insist in advance that the 3rd-party re-use rights must be part of the mandate! (Notice that the Harvard mandate has an opt-out, which means it's not a mandate.)

"On Patience, and Letting (Human) Nature Take Its Course"

And as to demanding machine-readable XML from authors: 85% of authors cannot now be bothered to do even the few keystrokes it takes to get them to deposit the drafts they already have: Does this sound like a reasonable time to ask them to upgrade their drafts to Cordon Bleu XML?

Stevan Harnad
American Scientist Open Access Forum

Glen Newton said...


I have to agree with your points, but I still believe that we should plant the seeds with both the authors and the publishers about what is still needed. Perhaps we can call it something like OA-NG, where we could collect all of our "important-to-have but we'll put off to the near future" requirements. Having an open discussion about what is needed in the next generation of OA would both lay the groundwork for the core OA community as well as giving notice to the publishers et al. with respect to possible future demands on them. I am sure there are other communities besides the text mining community for which OA is a good thing but does not satisfy all of their interests.