Thursday, May 15, 2008

Minion: possible research alternative to Lucene

I am excited to learn from The Search Guy about the recently released research-oriented full-text engine from Sun Labs called 'Minion'. Minion the Open Source Search:

One of the secret weapons underlying the Search Inside the Music project and Project Aura is a high quality search engine called Minion. Minion handles everything that has to do with Text for these projects. In addition to traditional search, we use Minion for document similarity (the core technique used for Tagomendations), item clustering, sense disambiguation, classification and autotagging. Minion is a research-oriented search engine - meaning that it is designed to allow for all sorts of variations. It is ultra-configurable and has a simple API. The big news is that the process to open source the Minion engine is underway. Steve Green (aka the search guy) has created a Minion project on Java.net - and soon, the Minion search engine will be available for all. "
Right now there is limited real info on Minion, although the reported performance sounds good, although the reports are against Lucene 2.0, which is significantly slower than the recent 2.3.x.

I am working with some large corpora (6.4m documents full-text, about 500GB of PDFs; and 25m documents, only titles, authors; see earlier post Lucene indexing performance benchmarks for journal article metadata and full-text) so I will be trying out Minion on these fairly large corpora in the near future, as well as examining its features that Lucene does not offer, such as document similarity, item clustering, sense disambiguation, classification, etc.

A rather intriguing turn-of-events, the Search Guy is Stephen Green, who's father worked at NRC-CISTI on a search system called CAN/OLE (where I work now doing research on digital library IR and other things) in the 1970s! Quite amazing! I hope I can get to talk to Stephen about Minion in the near future.

More info on Minion:
Related (uses Minion) is another promising project called Aura: (described below as a "recommender for the rest of us" - which sounds somewhat Seinfeldian to me):
"One of the secret weapons underlying the Search Inside the Music project and Project Aura is a high quality search engine called Minion. Minion handles everything that has to do with Text for these projects. In addition to traditional search, we use Minion for document similarity (the core technique used for Tagomendations), item clustering, sense disambiguation, classification and autotagging. Minion is a research-oriented search engine - meaning that it is designed to allow for all sorts of variations. It is ultra-configurable and has a simple API. The big news is that the process to open source the Minion engine is underway. Steve Green (aka the search guy) has created a Minion project on Java.net - and soon, the Minion search engine will be available for all. "


Update 2008 09 03: The Minion Search Engine: Search, Text Similarity And Tag Gardening, 2008 JavaOne presentation.