I posted these journal article metadata & full-text Lucene indexing benchmarks to the lucene user mailing list using the suggested XML format, but it seems like that was not the proper thing to do. One of the list members (Cass Costello) converted it to HTML (thanks :-) ). I've decided to give it a permanent home here. If you have any questions, just let me know. I have some other benchmarks I will be posting with more records (~25 million) but only article metadata, not full-text. The loader that does all of this was developed as part of my Ungava project.
- Dedicated machine for indexing: yes
- CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
- RAM: 8GB
- Drive configuration: Dell EMC AX150 storage array fibre channel
- Lucene Version: 2.3.1
- Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
- Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
- OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
- Location of index: Filesystem, on attached storage
- Number of source documents: 6,404,464
Lucene indexing variables
- Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this
- Average filesize of source documents: 22KB + metadata (see above)
- Source documents storage location: Filesystem
- File type of source documents: text (PDFs converted to text then gzipped)
- Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing
- Analyzer(s) used: StandardAnalyzer
- Number of fields per document: 24
- Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;
- Index persistence: FSDirectory
- Index size: 83GB
- Number of terms: 143,298,010
- Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours
- Time taken / 1000 docs indexed: 11.5 seconds
- Memory consumption: -Xms4000m -Xmx6000m
- Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s
- These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
- Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
- Highly multithreaded & pipelined architecture using java.util.concurrent.ThreadPoolExecutor
- File system file reading and Un-gzip performed multithreaded.
- Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
- Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.