Lucene indexing performance benchmarks for journal article metadata and full-text
- Dedicated machine for indexing: yes
Hardware Environment
- CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
- RAM: 8GB
- Drive configuration: Dell EMC AX150 storage array fibre channel
- Lucene Version: 2.3.1
Software environment
- Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
- Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
- OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
- Location of index: Filesystem, on attached storage
- Number of source documents: 6,404,464
Lucene indexing variables
- Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this
- Average filesize of source documents: 22KB + metadata (see above)
- Source documents storage location: Filesystem
- File type of source documents: text (PDFs converted to text then gzipped)
- Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing
- Analyzer(s) used: StandardAnalyzer
- Number of fields per document: 24
- Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;
- Index persistence: FSDirectory
- Index size: 83GB
- Number of terms: 143,298,010
- Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours
- Time taken / 1000 docs indexed: 11.5 seconds
- Memory consumption: -Xms4000m -Xmx6000m
- Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s
- These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
Figures
Notes
- Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
- Highly multithreaded & pipelined architecture using java.util.concurrent.ThreadPoolExecutor
- File system file reading and Un-gzip performed multithreaded.
- Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
- Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.
Comments
Have you use RAMDirectory during the test too? Or use only FSDirectory?
Thanks,