Lucene indexing performance benchmarks for journal article metadata and full-text

April 16, 2008

I posted these journal article metadata & full-text Lucene indexing benchmarks to the lucene user mailing list using the suggested XML format, but it seems like that was not the proper thing to do. One of the list members (Cass Costello) converted it to HTML (thanks :-) ). I've decided to give it a permanent home here. If you have any questions, just let me know. I have some other benchmarks I will be posting with more records (~25 million) but only article metadata, not full-text. The loader that does all of this was developed as part of my Ungava project.

Hardware Environment

Dedicated machine for indexing: yes

CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores

RAM: 8GB

Drive configuration: Dell EMC AX150 storage array fibre channel

Software environment

Lucene Version: 2.3.1

Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)

Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)

OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)

Location of index: Filesystem, on attached storage

Lucene indexing variables

Number of source documents: 6,404,464

Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this

Average filesize of source documents: 22KB + metadata (see above)

Source documents storage location: Filesystem

File type of source documents: text (PDFs converted to text then gzipped)

Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing

Analyzer(s) used: StandardAnalyzer

Number of fields per document: 24

Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;

Index persistence: FSDirectory

Index size: 83GB

Number of terms: 143,298,010

Figures

Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours

Time taken / 1000 docs indexed: 11.5 seconds

Memory consumption: -Xms4000m -Xmx6000m

Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s

Notes

These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.

Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m

Highly multithreaded & pipelined architecture using java.util.concurrent.ThreadPoolExecutor

File system file reading and Un-gzip performed multithreaded.

Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.

Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.