Wednesday, July 29, 2009

Project Torngat: Building Large-Scale Semantic 'Maps of Science' with LuSql, Lucene, Semantic Vectors, R and Processing from Full-Text

Project Torngat is a research project here at NRC-CISTI  [Note that I am no longer at CISTI and that I am now continuing this work at Carleton University - GN 2010 04 07] that looks to use the full-text of journal articles to construct semantic journal maps for use in -- among other things -- projecting article search results onto the map to visualize the results and support interactive exploration and discovery of related articles, term and journals.

Starting with 5.7 million full-text articles from 2200+ journals (mostly science, technology and medical (STM)), and using LuSql, Lucene, Semantic Vectors, R, and processing, a two dimensional mapping of a 512 dimension semantic space was created which revealed an excellent correspondence with the 23 human-created journal categories:


Font sizeSemantic Journal Space of 2231 Journals
Scaled to Two Dimensions

This initial work was initiated to find a technique that would scale, and follow-up work is looking at integrating this with a search interface, and evaluating if better structure is revealed within semantic journal mappings of single categories.
This may be the first time such large scale full-text is used in this fashion, without the help of article metadata.

Try-out the prototype (needs Java on the browser) [the site appears to be down right now], which displays journals in the 2-D space.

How it was done:
  1. Using a custom LuSql filter, for each of 2231 journals, concatenate the full-text of all a journal's articles into a single document.
  2. Using LuSql, create a Lucene index of all the journal documents (took ~14hrs, highly multithreaded on multicore, resulting in 43GB index)
  3. Using Semantic Vectors BuildIndex, create a docvector index of all journal documents, with 512 dimensions (58 minutes, 3.4GB index)
  4. Using Semantic Vectors Search, find the cosine distance between all journal documents (8 minutes)
  5. Build journal-journal distance matrix
  6. Use R's multidimensional scaling (MDS) to scale distance matrix to 2-D
  7. Build visualization using Processing
NB: all the above software are Open Source.

You can read more about it in the preprint:
Newton, G. & A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009.



Thanks to my collaborators, Alison Callahan and Michel Dumontier, Carleton University.

Built with Open Source Software.

4 comments:

Jonathan Rochkind said...

I think this might make an interesting Code4Lib Journal article, if you're interested.

How did you get the full-text for those 2k journals?

http://journal.code4lib.org/call-for-submissions

Glen Newton said...

Hi Jonathan,

Yes, that is a great idea: I'll send off an abstract to the code4lib journal as-soon-as-I-can. :-)

The journals: until recently* I worked at CISTI which had licensed local holdings of full-text for many S&T journals. I used these.

*CISTI has undergone 70% cuts, and I was one of the many let go....

thanks,
Glen

King said...

Hello Glen,

I have a project at the university in germany in comparing documents and develop strategies to improve delete documents which has the same content!

It would be very helpfull for us if you can send us an instructions how to make it.

Regars from germany

Glen Newton said...

Hello,

The description in the paper is quite explicit about how it uses Lucene and Semantic Vectors.
--Glen