Lucene indexing performance benchmarks for journal article metadata and full-text

I posted these journal article metadata & full-text Lucene indexing benchmarks to the lucene user mailing list using the suggested XML format, but it seems like that was not the proper thing to do. One of the list members (Cass Costello) converted it to HTML (thanks :-) ). I've decided to give it a permanent home here. If you have any questions, just let me know. I have some other benchmarks I will be posting with more records (~25 million) but only article metadata, not full-text. The loader that does all of this was developed as part of my Ungava project.

    Hardware Environment

  • Dedicated machine for indexing: yes
  • CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
  • RAM: 8GB
  • Drive configuration: Dell EMC AX150 storage array fibre channel

  • Software environment

  • Lucene Version: 2.3.1
  • Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
  • Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
  • OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
  • Location of index: Filesystem, on attached storage
  • Lucene indexing variables

  • Number of source documents: 6,404,464
  • Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this

  • Average filesize of source documents: 22KB + metadata (see above)

  • Source documents storage location: Filesystem

  • File type of source documents: text (PDFs converted to text then gzipped)

  • Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing

  • Analyzer(s) used: StandardAnalyzer

  • Number of fields per document: 24

  • Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;

  • Index persistence: FSDirectory

  • Index size: 83GB

  • Number of terms: 143,298,010
  • Figures

  • Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours

  • Time taken / 1000 docs indexed: 11.5 seconds

  • Memory consumption: -Xms4000m -Xmx6000m

  • Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s

  • Notes

  • These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
  • Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
  • File system file reading and Un-gzip performed multithreaded.
  • Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
  • Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.


Comments

Unknown said…
Thanks for this very useful information. I like this.
Unknown said…
Very cool information.

Have you use RAMDirectory during the test too? Or use only FSDirectory?

Thanks,
Glen Newton said…
Only RAMDirectory.
Glen Newton said…
Sorry: only _FSDirectory_

Popular posts from this blog

Java, MySql increased performance with Huge Pages

Canadian Science Policy Conference

Project Torngat: Building Large-Scale Semantic 'Maps of Science' with LuSql, Lucene, Semantic Vectors, R and Processing from Full-Text