Project Torngat: Building Large-Scale Semantic 'Maps of Science' with LuSql, Lucene, Semantic Vectors, R and Processing from Full-Text
Project Torngat is a research project here at NRC-CISTI [Note that I am no longer at CISTI and that I am now continuing this work at Carleton University - GN 2010 04 07] that looks to use the full-text of journal articles to construct semantic journal maps for use in -- among other things -- projecting article search results onto the map to visualize the results and support interactive exploration and discovery of related articles, term and journals.
Starting with 5.7 million full-text articles from 2200+ journals (mostly science, technology and medical (STM)), and using LuSql, Lucene, Semantic Vectors, R, and processing, a two dimensional mapping of a 512 dimension semantic space was created which revealed an excellent correspondence with the 23 human-created journal categories:
This initial work was initiated to find a technique that would scale, and follow-up work is looking at integrating this with a search interface, and evaluating if better structure is revealed within semantic journal mappings of single categories.
This may be the first time such large scale full-text is used in this fashion, without the help of article metadata.
How it was done:
You can read more about it in the preprint:
Thanks to my collaborators, Alison Callahan and Michel Dumontier, Carleton University.
Built with Open Source Software.
Starting with 5.7 million full-text articles from 2200+ journals (mostly science, technology and medical (STM)), and using LuSql, Lucene, Semantic Vectors, R, and processing, a two dimensional mapping of a 512 dimension semantic space was created which revealed an excellent correspondence with the 23 human-created journal categories:
This initial work was initiated to find a technique that would scale, and follow-up work is looking at integrating this with a search interface, and evaluating if better structure is revealed within semantic journal mappings of single categories.
This may be the first time such large scale full-text is used in this fashion, without the help of article metadata.
How it was done:
- Using a custom LuSql filter, for each of 2231 journals, concatenate the full-text of all a journal's articles into a single document.
- Using LuSql, create a Lucene index of all the journal documents (took ~14hrs, highly multithreaded on multicore, resulting in 43GB index)
- Using Semantic Vectors BuildIndex, create a docvector index of all journal documents, with 512 dimensions (58 minutes, 3.4GB index)
- Using Semantic Vectors Search, find the cosine distance between all journal documents (8 minutes)
- Build journal-journal distance matrix
- Use R's multidimensional scaling (MDS) to scale distance matrix to 2-D
- Build visualization using Processing
You can read more about it in the preprint:
Newton, G. & A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009.
Thanks to my collaborators, Alison Callahan and Michel Dumontier, Carleton University.
Built with Open Source Software.
Comments
How did you get the full-text for those 2k journals?
http://journal.code4lib.org/call-for-submissions
Yes, that is a great idea: I'll send off an abstract to the code4lib journal as-soon-as-I-can. :-)
The journals: until recently* I worked at CISTI which had licensed local holdings of full-text for many S&T journals. I used these.
*CISTI has undergone 70% cuts, and I was one of the many let go....
thanks,
Glen
I have a project at the university in germany in comparing documents and develop strategies to improve delete documents which has the same content!
It would be very helpfull for us if you can send us an instructions how to make it.
Regars from germany
The description in the paper is quite explicit about how it uses Lucene and Semantic Vectors.
--Glen