Thursday, June 26, 2008

If you only read one article on Cyberinfrastructure...

...it should probably be this one:

Jelinkova, K., Carvalho, T., Kerian, D., Knosp, B., Percival, K., Yagi, S. (2008). Creating a Five-Minute Conversation About Cyberinfrastructure. EDUCAUSE Quarterly, 31(2), 78-82.

Thanks Roy Tennant et al., Current Cites June 2008.

Thursday, June 19, 2008

Re-reading "Godel, Escher, Bach: An Eternal Golden Braid"

I have decided to re-read Douglas Hofstadter's "Godel, Escher, Bach: An Eternal Golden Braid". When I first read it - 25+ years ago - it significantly changed how I looked at many things in the world and I would describe it as a seminal book in my development.

What I am wondering is if I will have new revelations on re-reading it (I am quite sure) as I and the world around me has changed, and what they might be.

I am also curious what others have experienced in the re-reading of early-in-life personally seminal books later in life, and how they interpreted the nature of the new revelations.

And I am sure Dr. Suess is likely to be one book that many of us have likely re-read and re-interpreted many times! :-)

Wednesday, June 11, 2008

Lucene concurrent search performance with 1,2,4,8 IndexReaders

My last Lucene evaluation (Simultaneous (Threaded) Query Lucene Performance) from a couple of days ago was looking at concurrent (multithreaded) queries using a single IndexReader across all threads. Due to suggestions/demand from the Lucene User mailing list, I have expanded the evaluation to include multiple IndexReaders.
It is known that a single IndexReader is a limiting factor in a multithreaded environment. So I decided to run the same tests with 1,2,4 and 8 IndexReaders (actually I create IndexReaders and then create an IndexSearcher from each of these and share the IndexSearchers).
Below are the results. All of the test environment are the same as my previous evaluation, except:

  • It goes up to 8192 threads instead of the original 4096 threads
  • I had to pass in to the Java VM: -Xmx4000m because the VM was running out of heap for 8 readers
  • I've made the graph larger

(Click on graphic to see results)

As you can see, 2,4 and 8 readers significantly improve query rate over a single shared reader, between ~10 and 512 threads. The overall winner is 4 readers, showing marked improvement over 2 and 8 readers in the range from 16 threads to about 512 threads. After this point all numbers of readers are effectively the same.

I am not sure why 4 readers appears to be the sweet spot for this particular configuration. I will have to re-run this experiment with a finer granularity of readers (1,2,3,4,5,6,7,8,9,10,12,14,16 readers). However, remember that this machine is a dual CPU, dual core configuration (4 real cores, hyper-threaded to 8 virtual cores): it may be that having the same number of readers as the number of physical cores improves things, perhaps through less state swapping. I am not an expert. With more evaluations we may be able to comment more intelligently on this.

This is just one data point, but I hope it will be helpful.
I am still planning on releasing the code for the evaluation and the results plotting with gnuplot.

I would appreciate any feedback.

This work was done as part of a Lucene evaluation for my employer, CISTI, the National Research Council Canada. I work with Lucene as it is relevant to my research, with an example of some of my Lucene-based research here: Ungava.

Monday, June 09, 2008

Simultaneous (Threaded) Query Lucene Performance

I've recently had to do some performance query tests on Lucene (v2.3.1) under concurrent request load.

These benchmarks are on the same machine, VM, OS described in my earlier Lucene indexing performance benchmarks, however the index is a little different: it is an index of title, author, journal name, keywords, etc metadata (no full-text) for 25.6M journal articles. The index size is 19GB and -- using the same framework as the previous benchmarks, above -- indexing time of 4.25 hours. YMMV.

Using a set of 2900 user queries (ranging from single word queries to queries with >600 characters and using multiple fields and operators; no range queries), Lucene was pre-warmed with 2000 (different) queries. Ten runs were performed and averaged.

Below are the results plotting #requests per second handled vs. #threads making requests. This was all run in the same VM, using an instance of java.util.concurrent.ThreadPoolExecutor to parallelize things:


The best results were for 6 or 7 threads. Interesting how the response flattens-out at around 32 threads and stays steady until ~1024 threads..

Of course, we were interested in the wait times of end-users, so below I've plotted the average wait times of users. It is calculated:
(#threads-making-requests/#handled-requests-per-second)*0.5
It is an approximation of course, but good enough to get a general idea.

As you can see below, the average wait times of requests are:

  • less than 0.08 seconds for up to 32 requests / second
  • less than 0.5 seconds for up to 192 requests / second
  • less than 1 seconds for up to 256 requests / second
  • less than 2 seconds for up to 768 requests / second
Not too shabby!


I am running a much larger query set of ~900k queries as we speak, but I don't think it will be finished for another day or so. I will post the results when they are ready, although preliminary results suggest that the performance on this query data set will be poorer (probably due to the nature of the queries: many "b*" types of query terms).

I am going to clean-up the code that does this testing and release it in the next week or so.

PS. The plots were done using gnuplot. Thanks, gnuplot!

Update 2008-06-10: As pointed-out in some follow-ups to my original posting on the Lucene User list for these benchmarks, I left-out some details:
  • The index format was the compound format
  • No command line arguments were passed to the Java VM.
  • One IndexSearcher is shared across all threads.