- Dedicated machine for indexing: yes
Hardware Environment
- CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
- RAM: 8GB
- Drive configuration: Dell EMC AX150 storage array fibre channel
- Lucene Version: 2.3.1
Software environment
- Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
- Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
- OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
- Location of index: Filesystem, on attached storage
- Number of source documents: 6,404,464
Lucene indexing variables
- Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this
- Average filesize of source documents: 22KB + metadata (see above)
- Source documents storage location: Filesystem
- File type of source documents: text (PDFs converted to text then gzipped)
- Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing
- Analyzer(s) used: StandardAnalyzer
- Number of fields per document: 24
- Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;
- Index persistence: FSDirectory
- Index size: 83GB
- Number of terms: 143,298,010
- Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours
- Time taken / 1000 docs indexed: 11.5 seconds
- Memory consumption: -Xms4000m -Xmx6000m
- Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s
- These are journal articles, so the additional fields besides the full-text are bibliographic metadata, such as title, authors, abstract, keywords, journal name, volume, issue, start page, year.
Figures
Notes
- Java command line directives: -XX:+AggressiveOpts -XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m -Xmx6000m
- Highly multithreaded & pipelined architecture using java.util.concurrent.ThreadPoolExecutor
- File system file reading and Un-gzip performed multithreaded.
- Eight separate parallel IndexWriters are fed by the pipeline (creation of Document objects occurs in parallel with 64 threads), merged at very end into single index.
- Each parallel index had slightly different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB respectively), so that flushing wouldn’t all happen at the same time.


4 comments:
Thanks for this very useful information. I like this.
Very cool information.
Have you use RAMDirectory during the test too? Or use only FSDirectory?
Thanks,
Only RAMDirectory.
Sorry: only _FSDirectory_
Post a Comment