Related to my last posting (FREE THE ARTICLES! (full-text for researchers & scientists and their machines)) and in the light of Peter Murray-Rust's recent annoying discovery that he cannot text-mine Pubmed Central (Can I data- and Text-mine Pubmed Central?), I would like to suggest an additional criterion to the definition of Open Access:
Open Access must include access by machines:
The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.
- At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
- Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
- Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.
Thanks to Peter Suber's Open Access News for the pointer to Peter Murray-Rust's difficulties.