Tuesday, April 22, 2008
- Archival Practices
- Challenges of Audio, Film, Video, and Digital Media Preservation
- Cultural Influences, Impacts, and Considerations
- Cultural Property Law
- Digital Preservation and Cultural Memory
- Digitization of the Humanities
- Effects on Artistic Practices
- Ethical, Social, and Philosophical Concerns
- Preservation Strategies and Techniques
- Future Trends and Directions
Wednesday, April 16, 2008
Coming from a country where you have to buy census data (!!!), this is a wonderful idea. Spread the meme.
This looks like an interesting article (Libraries in the Converging Worlds of Open Data, E-Research, and Web 2.0, Stuart MacDonald, March/April issue of ONLINE magazine) but I don't have a subscription so I can't really comment much on it. Ironic that the abstract mentions Peter Suber's Open Access News blog.... ;-)
Abstract: "The new forms of research enabled by the latest technologies bring about collaboration among researchers in different locations, institutions, and even disciplines. These new collaborations have two key features -- the prodigious use and production of data. This data-centric research manifests itself in such concepts as e-science, cyberinfrastructure, or e-research. Over the last decade there has been much discussion about the merits of open standards, open source software, open access to scholarly publications, and most recently open data. There are a range of authoritative weblogs that address the open movement, some of which include: 1. DCC's Digital Curation Blog, 2. Peter Suber's Open Access News, and 3. Open Knowledge Foundation Weblog. The data used and produced in e-research activities can be extremely complex, taking different forms depending on the discipline. In the hard sciences, such as biochemistry data can take the form of images and numbers representing the structure of a protein."
"Reuters, however, believes it has overcome this problem. It recently launched a service called Calais that takes raw web pages (and, indeed, any other form of data) and does the marking up itself. The acronyms can then get to work. That promises to imbue the streams of unstructured text and data sloshing around the internet with almost instant meaning.
The idea is that any website can send a jumble of text and code through Calais and receive back a list of "entities" that the system has extracted--mostly people, places and companies--and, even more importantly, their relationships. It will, for instance, be able recognise a pharmaceutical company's name and, on its own initiative, cross-reference that against data on clinical trials for new drugs that are held in government databases. Alternatively, it can chew up a thousand blogs and expose trends that not even the bloggers themselves were aware of."The examples are pretty cool, especially the Wikipedia + Amazon API + Calais mashup Semantic Book Suggestions. The Powerhouse Museum's example is pretty good too (see "Auto-generated tags: [Beta]" down the right column).
I posted these journal article metadata & full-text Lucene indexing benchmarks to the lucene user mailing list using the suggested XML format, but it seems like that was not the proper thing to do. One of the list members (Cass Costello) converted it to HTML (thanks :-) ). I've decided to give it a permanent home here. If you have any questions, just let me know. I have some other benchmarks I will be posting with more records (~25 million) but only article metadata, not full-text. The loader that does all of this was developed as part of my Ungava project.
- Dedicated machine for indexing: yes
- CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
- RAM: 8GB
- Drive configuration: Dell EMC AX150 storage array fibre channel
- Lucene Version: 2.3.1
- Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
- Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
- OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
- Location of index: Filesystem, on attached storage
- Number of source documents: 6,404,464
Lucene indexing variables
- Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this
- Average filesize of source documents: 22KB + metadata (see above)
- Source documents storage location: Filesystem
- File type of source documents: text (PDFs converted to text then gzipped)
- Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing
- Analyzer(s) used: StandardAnalyzer
- Number of fields per document: 24
- Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;
- Index persistence: FSDirectory
- Index size: 83GB
- Number of terms: 143,298,010
- Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours
- Time taken / 1000 docs indexed: 11.5 seconds
- Memory consumption: -Xms4000m -Xmx6000m
- Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s
- These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
- Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
- Highly multithreaded & pipelined architecture using java.util.concurrent.ThreadPoolExecutor
- File system file reading and Un-gzip performed multithreaded.
- Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
- Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.
Monday, April 07, 2008
Fundamental to this future (and present) are Open Access, Open Data, social (research) networking, the blogging/wiki/GoogleDocs/mailing-list dynamics, Wiki versioning (of experimental and other research activities), Second Life for presentations and teaching, and the necessity of machine-to-machine communications and interactions (see my earlier blog entry: New Open Access Criterion: Support access by machines").
Abstract: Open Notebook Science involves a variety of internet-based techniques for sharing of scientific information, from the use of wikis for experiments, to the Chemspider database, where chemists share molecules in a fashion that is socially (but not technically) similar to Wikipedia. Aspects of Open Notebook Science that are of relevance to librarians are discussed, such as automating of metadata for describing the steps of experiments, and the importance of using a 3rd-party wiki to record Open Notebook Science, so that contributions can be tracked and time-stamped. Bradley predicts movement towards more machine-to-machine communication, which will considerably speed up the research process.
Open Access must include access by machines:
The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.
- At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
- Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
- Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.
Thanks to Peter Suber's Open Access News for the pointer to Peter Murray-Rust's difficulties.
Friday, April 04, 2008
By accessibility of full-text I didn't mean the ability of a human to access the PDF or HTML of an article via a web browser: I was referring to the machine-accessibility of the text contained in the article (and the metadata and the citation information).
I was concerned because of the increasing number of discipline-specific tools that use full-text (& metadata & citations) to allow users (via text mining, semantic analysis, etc.) to navigate, analyze and discover new ideas and relationships, from the research literature. The general label for this kind of research is 'literature-based discovery', where new knowledge hidden in the literature is exposed using text mining and other tools.
Most publisher licenses do not allow for the sort of access to the full-text that many of these discovery and exploration tools need.
When I asked for a show of hands of how many were aware of this issue, of the ~200 in the audience, no one raised their hand.
I went on to suggest/rant that librarians should expect more of their researcher/scientist patrons to be needing/demanding this sort of access to the full-text of (licensed) journal articles. They need to anticipate this response, and I suggested the following non-mutually-exclusive strategies:
- demanding licenses from publishers and aggregators that allow them to offer access to full-text for analysis by arbitrary patron tools
- asking publishers to publish their full-text in the Open Text Mining Interface (OTMI)
- supporting Open Access journals which allow-for much of this this out-of-the-box (but often have very difficult APIs or non-at-all and only web pages to get at the content!!)
 Drug Addiction: Going by the book (2008). The Economist, January 10 print issue.
 Li, C., Mao, X., Wei, L. (2008). Genes and (Common) Pathways Underlying Drug Addiction. PLoS Computational Biology, 4(1), e2. DOI: 10.1371/journal.pcbi.0040002
 Swanson, D. (1986). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med, 30:1:7-18.
- Bourne, P.E., Fink, J.L., Gerstein, M. (2008). Open Access: Taking Full Advantage of the Content. PLoS Computational Biology, 4(3), e1000037. DOI:10.1371/journal.pcbi.1000037
- Demirandasanto, M., Coelho, G., Dossantos, D., Filho, L. (2006). Text mining as a valuable tool in foresight exercises: A study on nanotechnology. Technological Forecasting and Social Change, 73(8), 1013-1027. DOI: 10.1016/j.techfore.2006.05.020
- Džeroski, S., Langley, P., Todorovski, L. (2007). Computational Discovery of Scientific Knowledge. Lecture Notes in Computer Science 4660 DOI:10.1007/978-3-540-73920-3
- Glenisson, P. (2004). Integrating scientific literature with large scale gene expression analysis. PhD Thesis, Katholieke Universiteit Leuven, Belgium.
- Hristovski, D., Peterlin, B., Džeroski, S., Stare, S. (2007). Literature Based Discovery Support System and Its Application to Disease Gene Identification. , 4660, 307-326. DOI: 10.1007/978-3-540-73920-3_15
- Kostoff, R. (2007). Validating discovery in literature-based discovery (letter to the editor). Journal of Biomedical Informatics, 40(4), 448-450. DOI:10.1016/j.jbi.2007.05.001
- Krallinger, M., Valencia, A. (2005). Text-mining and information-retrieval services for molecular biology. Genome Biology, 6(7), 224. DOI:10.1186/gb-2005-6-7-224
- Krogel, M., Scheffer, T. (2004). Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics. Machine Learning, 57(1/2), 61-81. DOI: 10.1023/B:MACH.0000035472.73496.0c
- Mack, R. (2002). Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discovery Today, 7(11), S89-S98. DOI:10.1016/S1359-6446(02)02286-9
- Saso Dzeroski, Ljupco Todorovski, Eds. (2007). Computational Discovery of Scientific Knowledge, Introduction, Techniques, and Applications in Environmental and Life Sciences. Lecture Notes in Computer Science 4660 Springer, ISBN 978-3-540-73919-7
- Weeber, M. (2007). Drug Discovery as an Example of Literature-Based Discovery. 4660, 290-306. DOI: 10.1007/978-3-540-73920-3_14
- Weeber, M., Kors, J.A., Mons, B. (2005). Online tools to support literature-based discovery in the life sciences. Briefings in Bioinformatics, 6(3), 277-286. DOI:10.1093/bib/6.3.277
- Zhou D., Y. He (2008) Extracting interactions between proteins from the literature. Journal of Biomedical Informatics41:2:393-407. DOI:10.1016/j.jbi.2007.11.008
- Zweigenbaum, P., Demner-Fushman, D., Yu, H., Cohen, K.B. (2007). Frontiers of biomedical text mining: current progress. Briefings in Bioinformatics, 8(5), 358-375. DOI:10.1093/bib/bbm045
Thanks to Martha Lee UCLA via NGC4LIB.
Tuesday, April 01, 2008
It is very exciting that the Places and Spaces: Mapping Science exhibit from Indiana University will be on display at NRC-CISTI from April 3 - June 27 2008.
This is the first time this collection of amazing maps of science is on display outside the U.S.
The diverse and creative collection includes traditional cartographic maps, concept maps and domain maps. These are all physical paper (+other media) maps, and also includes some hands-on maps made specifically for children to interact with.
Jeff Demaine who was the originator and champion of this initiative.
- Boyack, K.W., Klavans, R., Börner, K. (2005). Mapping the Backbone of Science. Scientometrics, 64(3), 351-374.