Wednesday, July 29, 2009

Project Torngat: Building Large-Scale Semantic 'Maps of Science' with LuSql, Lucene, Semantic Vectors, R and Processing from Full-Text

Project Torngat is a research project here at NRC-CISTI  [Note that I am no longer at CISTI and that I am now continuing this work at Carleton University - GN 2010 04 07] that looks to use the full-text of journal articles to construct semantic journal maps for use in -- among other things -- projecting article search results onto the map to visualize the results and support interactive exploration and discovery of related articles, term and journals.

Starting with 5.7 million full-text articles from 2200+ journals (mostly science, technology and medical (STM)), and using LuSql, Lucene, Semantic Vectors, R, and processing, a two dimensional mapping of a 512 dimension semantic space was created which revealed an excellent correspondence with the 23 human-created journal categories:


Font sizeSemantic Journal Space of 2231 Journals
Scaled to Two Dimensions

This initial work was initiated to find a technique that would scale, and follow-up work is looking at integrating this with a search interface, and evaluating if better structure is revealed within semantic journal mappings of single categories.
This may be the first time such large scale full-text is used in this fashion, without the help of article metadata.

Try-out the prototype (needs Java on the browser) [the site appears to be down right now], which displays journals in the 2-D space.

How it was done:
  1. Using a custom LuSql filter, for each of 2231 journals, concatenate the full-text of all a journal's articles into a single document.
  2. Using LuSql, create a Lucene index of all the journal documents (took ~14hrs, highly multithreaded on multicore, resulting in 43GB index)
  3. Using Semantic Vectors BuildIndex, create a docvector index of all journal documents, with 512 dimensions (58 minutes, 3.4GB index)
  4. Using Semantic Vectors Search, find the cosine distance between all journal documents (8 minutes)
  5. Build journal-journal distance matrix
  6. Use R's multidimensional scaling (MDS) to scale distance matrix to 2-D
  7. Build visualization using Processing
NB: all the above software are Open Source.

You can read more about it in the preprint:
Newton, G. & A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009.



Thanks to my collaborators, Alison Callahan and Michel Dumontier, Carleton University.

Built with Open Source Software.

Tuesday, July 21, 2009

Springer LNCS, or, How not to do alerts!

I subscribe to Springer Lecture Notes in Computer Science (LNCS) alerts. Several times now, I have received alerts when there was no web content at the URLs that they sent me. Very annoying, wasting my time. Please, this is 2009: try and make things work!.

The latest was yesterday: at Mon, 20 Jul 2009 14:53:35 -0700 (PDT) I got an alert email from Springer LNCS:

Dear Glen Newton,

We are pleased to deliver your requested table of contents alert
for a new volume of "Lecture Notes in Computer Science",
subseries: "Lecture Notes in Artificial Intelligence".

Volume 5632: Machine Learning and Data Mining in Pattern Recognition
by Petra Perner
is now available on the SpringerLink web site at
http://springer.r.delivery.net/r/r?2.1.Ee.2Tp.1gRdiL.ByTshW..N.I9y2.3DBm.bW89MQ%5f%5fDJNcFRf0

Going to this page (as of Tues 14:26 ET July 21 2009, ~24hrs later), or to any of the URLs for the articles (including DOIs, like http://dx.doi.org/10.1007/978-3-642-03070-3_5) gives me - No, not an error page - but a BLANK PAGE:


How hard is it to make sure you don't send out alerts with links to web pages until after the linked-to pages actually have the content you are planning on presenting?

And at least try and have a decent error page when you do not have the content in place.

Oh, and 1999 called, they wants their web infrastructure back.


[Of course, by the time you read this entry these URLs may be working...]


Update: 2009 07 21 16:03 ET: Now when I go to the above pages, instead of a blank page I get a general launching-off page for SpringerLink:


Still not good enough! :-)

Update: 2009 07 21 22:40 ET:
Now the URLs for the articles work:

but the URLs for the publication do not:

Wednesday, July 15, 2009

Emacs 'mode' and learning `modes`

I've used emacs as my primary editor, (emersive?) environment and de facto almost-OS for about 20 years now. I read and send my email in it (vm), write/run/debug Java in it (JDEE), edit and compile my LaTeX in it, edit all other files with it, sometimes with complex macros that others would use Perl to do, and interact with shells inside of it. In the past I've edited and debugged C and C++, HTML and XML in various fooML modes. The only other major thing I have running on my workstation is my web browser (//and occasionally OpenOffice for reading Word files//). Of course, I will have additional emacs windows open on the 3 or 4 servers I am editing and running code on (and also use tramp to transparently edit remote files).

Yes, I've tried Eclipse. I know it quite well: I've even written an Eclipse plugin and published a paper about it (Takaka: Eclipse Image Processing Plug-in) . But it does not work for me like emacs does. If Eclipse works for you: that is great. But I don't think it necessarily works for everyone. Emacs + JDEE IS my IDE for Java (and is other things as well).

And I have no interest in starting any editor war ('IDE war'?)

Before I actually tried Eclipse in a significant way, I thought that I was just unwilling to change due to the learning curve of something new, and the momentum of the 'known' and once I tried it I would find it better, like many other people seemd to (and certainly many of the pundits). But when I did invest in making the change, I discovered that was not the reason. It still didn't work as well for me as emacs (where 'work' for me meant helped my productivity: no, productivity was less than working using emacs). Again, this is not a criticism of Eclipse.

My theory is that - like learning, where it has been established that different people learn in different ways (Visual/Verbal, Visual/Nonverbal, Auditory/Verbal, Tactile/Kinesthetic) - I believe that particular modes of human-machine interaction are better suited to some individuals than others.

I am not an HCI expert, so I don't know if the various modes have been as clearly defined, delineated and validated as in learning, but I could imagine at least one HCI mode mapping to the Microsoft Word/NetBeans/Eclipse mode of mouse oriented, busy GUIs, and another to the emacs mode [I think I've poorly described the attributes of the former and won't attempt to describe the latter].

This all said, I found it quite interesting to find these two recent blog entries today:

"I think that there is some confusion between mastering emacs, and using emacs. You can learn to use emacs in 1/2 an hour. Is that a shockingly long time? Yes. Great design usually makes its uses obvious.

But emacs makes up for that initial investment with accelerating returns --- where Notepad, or even Eclipse, you stop gaining power and knowledge relatively quickly, emacs is like the universe: no matter how long you look at it, there is always more to learn --- and the best part is the more you learn, the faster you can learn more."- comment by Sam Bleckley on Learn Emacs in Ten Year post
With a little looking around, I found a couple more blog postings that capture some of the emacs-ness of emacs that might be of interest:
XKCD can have the final word (for now) on emacs: