Starting with 5.7 million full-text articles from 2200+ journals (mostly science, technology and medical (STM)), and using LuSql, Lucene, Semantic Vectors, R, and processing, a two dimensional mapping of a 512 dimension semantic space was created which revealed an excellent correspondence with the 23 human-created journal categories:
This initial work was initiated to find a technique that would scale, and follow-up work is looking at integrating this with a search interface, and evaluating if better structure is revealed within semantic journal mappings of single categories.
This may be the first time such large scale full-text is used in this fashion, without the help of article metadata.
Try-out the prototype (needs Java on the browser) [the site appears to be down right now], which displays journals in the 2-D space.
How it was done:
- Using a custom LuSql filter, for each of 2231 journals, concatenate the full-text of all a journal's articles into a single document.
- Using LuSql, create a Lucene index of all the journal documents (took ~14hrs, highly multithreaded on multicore, resulting in 43GB index)
- Using Semantic Vectors BuildIndex, create a docvector index of all journal documents, with 512 dimensions (58 minutes, 3.4GB index)
- Using Semantic Vectors Search, find the cosine distance between all journal documents (8 minutes)
- Build journal-journal distance matrix
- Use R's multidimensional scaling (MDS) to scale distance matrix to 2-D
- Build visualization using Processing
You can read more about it in the preprint:
Newton, G. & A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009.
Thanks to my collaborators, Alison Callahan and Michel Dumontier, Carleton University.
Built with Open Source Software.