Simultaneous (Threaded) Query Lucene Performance

I've recently had to do some performance query tests on Lucene (v2.3.1) under concurrent request load.

These benchmarks are on the same machine, VM, OS described in my earlier Lucene indexing performance benchmarks, however the index is a little different: it is an index of title, author, journal name, keywords, etc metadata (no full-text) for 25.6M journal articles. The index size is 19GB and -- using the same framework as the previous benchmarks, above -- indexing time of 4.25 hours. YMMV.

Using a set of 2900 user queries (ranging from single word queries to queries with >600 characters and using multiple fields and operators; no range queries), Lucene was pre-warmed with 2000 (different) queries. Ten runs were performed and averaged.

Below are the results plotting #requests per second handled vs. #threads making requests. This was all run in the same VM, using an instance of java.util.concurrent.ThreadPoolExecutor to parallelize things:


The best results were for 6 or 7 threads. Interesting how the response flattens-out at around 32 threads and stays steady until ~1024 threads..

Of course, we were interested in the wait times of end-users, so below I've plotted the average wait times of users. It is calculated:
(#threads-making-requests/#handled-requests-per-second)*0.5
It is an approximation of course, but good enough to get a general idea.

As you can see below, the average wait times of requests are:
  • less than 0.08 seconds for up to 32 requests / second
  • less than 0.5 seconds for up to 192 requests / second
  • less than 1 seconds for up to 256 requests / second
  • less than 2 seconds for up to 768 requests / second
Not too shabby!


I am running a much larger query set of ~900k queries as we speak, but I don't think it will be finished for another day or so. I will post the results when they are ready, although preliminary results suggest that the performance on this query data set will be poorer (probably due to the nature of the queries: many "b*" types of query terms).

I am going to clean-up the code that does this testing and release it in the next week or so.

PS. The plots were done using gnuplot. Thanks, gnuplot!

Update 2008-06-10: As pointed-out in some follow-ups to my original posting on the Lucene User list for these benchmarks, I left-out some details:
  • The index format was the compound format
  • No command line arguments were passed to the Java VM.
  • One IndexSearcher is shared across all threads.

Comments

Anonymous said…
Hi , very interesting your benchmark. I have some questions: did You compare another corpus (bigger then this, maybe) . I'm curious if different dimensions of number of docs (and indexes of lucene) take an effect on the results. In other words, trends of benchmark...

thanks Paolo Marocco
paolo.marocco@fastwebnet.it
Glen Newton said…
Paolo,

No, I haven't done this, but I will be re-doing some benchmarks to try out the new Lucene v2.9, so when I do this maybe I will try and use several corpora of different sizes.

thanks,
Glen
Anonymous said…
Thanks very much, I have discovered this feature yesterday night. Good blog. I have linked this.
Paolo

Popular posts from this blog

Java, MySql increased performance with Huge Pages

Canadian Science Policy Conference

Project Torngat: Building Large-Scale Semantic 'Maps of Science' with LuSql, Lucene, Semantic Vectors, R and Processing from Full-Text