code4lib update: LuSql talk done; Lucene, Solr links
Gave my LuSql talk today at code4lib2009 and didn't get cut down by any Solr/Lucene dudes! Met Erik Hatcher of Lucene/Solr fame (and now of Lucid Imagination fame) & hopefully we can collaborate on some Lucene/indexing Solr stuff in the future.
I also spoke with Tom Burton-West of UMich about Lucene indexing and search performance for their 1M+ Google Books index (they use Solr). These are documents that are a lot longer than the STM articles I work with. They have 220GB sized indexes and - as they have to keep stops words for their Humanities for phrase searching - suffer from poor query performance (despite 32GB RAM). I pointed to some of my previous work on high performance indexing and searching [1, 2, 3]. I'd like to get at their data to examine some performance issues in Lucene, both on the indexing and searching side.
I was wondering if Solr is configurable for the initial/max number of IndexSearchers. I couldn't find this in the Solr wiki, but did see information linking caches to IndexSearchers. If it does not, the configuration should allow this, and also smart Solr should have a default of not greater than the number of cores on the machine (use Runtime.availableProcessors()).
[1]Lucene concurrent search performance with 1,2,4,8 IndexReaders
[2]Simultaneous (Threaded) Query Lucene Performance
[3]Lucene indexing performance benchmarks for journal article metadata and full-text
I also spoke with Tom Burton-West of UMich about Lucene indexing and search performance for their 1M+ Google Books index (they use Solr). These are documents that are a lot longer than the STM articles I work with. They have 220GB sized indexes and - as they have to keep stops words for their Humanities for phrase searching - suffer from poor query performance (despite 32GB RAM). I pointed to some of my previous work on high performance indexing and searching [1, 2, 3]. I'd like to get at their data to examine some performance issues in Lucene, both on the indexing and searching side.
I was wondering if Solr is configurable for the initial/max number of IndexSearchers. I couldn't find this in the Solr wiki, but did see information linking caches to IndexSearchers. If it does not, the configuration should allow this, and also smart Solr should have a default of not greater than the number of cores on the machine (use Runtime.availableProcessors()).
[1]Lucene concurrent search performance with 1,2,4,8 IndexReaders
[2]Simultaneous (Threaded) Query Lucene Performance
[3]Lucene indexing performance benchmarks for journal article metadata and full-text
 
Comments
I think it would be a great contribution.
Thanks
I tried creating the index using lusql. With the documentation from MS I gave the right URL for SQL server JDBC connection.
Please let me know what else has to be done.
I am not a java developer.
Your help is greatly appreciated.
Thanks,
Gokul
It sounds like you have had some trouble. Can you post the command line you used with LuSql and the resulting output from Java? With this information I can better help you.
Also, what kind of information are you indexing and for what reason?
thanks,
Glen
i looks like the JDBC driver for MS-SQl is not in your CLASSPATH. I would suggest adding the jar for the MS-SQl JDBC driver to your CLASSPATH and re-trying lusql.
thanks,
Glen
I setup the classpath in the environment variables and also in the command line as explained below.
One more change I had to do was to use the class name as "ca.nrc.cisti.lusql.core.LuSqlMain" instead of "ca.nrc.cisti.lusql.LuSqlMain". 'Core' was missing in the class name. After making this change the code compiled but still the error prevails.
C:\Gokul\Projects\Lucene_Related\Lusql>java -classpath C:\Gokul\Projects\Lucene_Related\Lusql\lusql-0.901.jar;"C:\Progra
m Files\Microsoft SQL Server JDBC Driver 2.0\sqljdbc_2.0\enu\sqljdbc.jar" ca.nrc.cisti.lusql.core.LuSqlMain \ -c "jdbc:s
qlserver://MyServerName;databaseName=MyDBNAME;integratedSecurity=true;"\ -q "select * from Customers" \ -n 100 \ -l
tutorial-1 \ -v
Output
Using sql:[select * from Customers]
Using Analyzer:[org.apache.lucene.analysis.standard.StandardAnalyzer]
Using Stop Word FileName:[null]
Using Properties FileName:[null]
Using DB driver name:[com.mysql.jdbc.Driver]
Using DB URL:[jdbc:sqlserver://MyServerName;databaseName=MyDBNAME;integratedSecurity=true;\]
Using Lucene index:tutorial-1
Using Lucene index RAMBUFFER MBs:48.0
Using multithreaded:true
Using Test:false
Using Field parameters:211
Using setting DB fetchsize=0 (see -m)
Using Num documents to add:100
Using Lucene index directory:tutorial-1
Using -Q SQL replacement character:@
Opening Lucene index: tutorial-1
Opening MySQL connection
Querying:select * from Customers
java.sql.SQLException: No suitable driver found for jdbc:sqlserver://MyServerName;databaseName=MyDBNAME;integratedSecurity=t
rue;\
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.commons.dbcp.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:6
8)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:294)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:974)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:96)
at ca.nrc.cisti.lusql.core.LuSql.getConnection(LuSql.java:658)
at ca.nrc.cisti.lusql.core.LuSql.run(LuSql.java:507)
at ca.nrc.cisti.lusql.core.LuSqlMain.main(LuSqlMain.java:58)
*********** Elapsed time: 0 seconds
You need to set the JDBC driver class name using the "-d" option. Right now it is not set, so it is using the default which is the MySql driver: [com.mysql.jdbc.Driver] which is incorrect for your database.
Please do this and let me know how it goes. :-)
-Glen
Finally the following command line parameters worked.
Good job:). Please update your document to have the "core" in the class name.
Optimizing Index Time: 705 seconds
Total indexing took around 1551 seconds.
Just an FYI, I got the jdbc driver from the following url: http://msdn.microsoft.com/en-us/data/aa937724.aspx
Download Page:
http://www.microsoft.com/downloads/details.aspx?FamilyID=99b21b65-e98f-4a61-b811-19912601fdc9&displaylang=en
C:\Gokul\Projects\Lucene_Related\Lusql>java -classpath C:\Gokul\Projects\Lucene_Related\Lusql\lusql-0.901.jar;
"C:\Program Files\Microsoft SQL Server JDBC Driver 2.0\sqljdbc_2.0\enu\sqljdbc4.jar" ca.nrc.cisti.lusql.core.LuSqlMain \
-c "jdbc:sqlserver://MYDBServer;databaseName=Mydbname;user=test;password=test" \ -q "select * from MYDBTable" \ -n 2000000 \
-l tutorial-2 \ -d com.microsoft.sqlserver.jdbc.SQLServerDriver \ -v \ -m 25000
I'm glad you were able to get this working, despite my poor documentation (Sorry about that: I will update the manual for v0.901).
I am finishing up the development for the next version of LuSql and will be working on the documentation soon. I'll try and not make the same mistake as last time.
Let me know if there is anything you feel you might want in the next version (it is possible it might be there...).
thanks,
Glen
PS. If you have an online application that is using the index produced by LuSql, I would appreciate seeing it. Thanks! :-)