Thursday, November 27, 2008

The (near) Future of Research Articles

Rod Page's demo for his Elsevier Grand Challenge submission ("Towards realising Darwin’s dream: setting the trees free") shows the type of enrichment of biological - if not all research - articles that is quickly becoming possible. Taking a published article ("Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio"), various additional biological, geographical and other metadata are extracted and added to a web page for the article. These include:
  • Map showing all localities mentioned in the paper, with their enclosing
  • List of other studies which have samples in area enclosed by the
    study polygon
  • Each of the following are linked through to their underlying
    databases (such as NIH accession number and NCBI nucleotide viewer
    or linked to ubio taxonomic name viewer record:
    • List of sequence features (such as genes) in the article
    • List of taxa sequenced in the article
    • List of gene sequences cited by the article
  • An image collage of all biological taxa (organisms) in article
  • List of studies on related organisms
You can see his whole vision in his submission, which shows some interesting visualizations, such as his Google Earth Phylogenies:

and his Treemaps of Taxa.

Monday, November 24, 2008

Lucene 2.3.1 vs 2.4 benchmarks using LuSql

I have been doing some indexing performance tests with LuSql, and have some numbers comparing Lucene 2.3.1 with 2.4.

Despite some discussion about 2.4 having poorer indexing performance, my tests with LuSql 0.9 suggest otherwise:

Lucene 2.3.1

Number of records added= 2000000
Optimizing index
Closing index
Optimizing index time: 311 seconds
Closing JDBC: result set
Closing JDBC: statement
Closing JDBC: connection
*********** Elapsed time: 854 seconds
15m 18s

Lucene 2.4

Number of records added= 2000000
Optimizing index
Closing index
Optimizing index time: 322 seconds
Closing JDBC: result set
Closing JDBC: statement
Closing JDBC: connection
*********** Elapsed time: 759 seconds
12m 39s
Index size: 3.7GB.

It is interesting that the overall indexing time is significantly less, but the optimizing time is slightly higher.

Data, hardware and system configuration: as per my previous Lucene benchmarking.

Note that this is a simple benchmark, so YMWV. This benchmark was done with the LuSql default number of threads for the hardware in question, 20.
MySQL version used: v5.0.45 compiled from source, concurrency=8.

Wednesday, November 19, 2008

Asian Digital Libraries 2008 Proceedings

Proceedings of the 11th International Conference on Asian Digital Libraries, ICADL 2008, Bali, Indonesia, December 2-5, 2008 are now available:

  • DL2Go: Editable Digital Libraries in the Pocket. Hyunyoung Kil, Wonhong Nam, Dongwon Lee.

  • Hierarchical Classification of Web Pages Using Support Vector Machine. Yi Wang, Zhiguo Gong.

  • The Prevalence and Use of Web 2.0 in Libraries. Alton Yeow Kuan Chua, Dion Hoe-Lian Goh, Chei Sian Lee.

  • Usability of Digital Repository Software: A Study of DSpace Installation and Configuration. Nils Körber, Hussein Suleman.

  • Developing a Traditional Mongolian Script Digital Library. Garmaabazar Khaltarkhuu, Akira Maeda.

  • Weighing the Usefulness of Social Tags for Content Discovery. Khasfariyati Razikin, Dion Hoe-Lian Goh, Chei Sian Lee, Alton Yeow Kuan Chua.

  • A User Reputation Model for DLDE Learning 2.0 Community. Fusheng Jin, Zhendong Niu, Quanxin Zhang, Haiyang Lang, Kai Qin.

  • Query Relaxation Based on Users Unconfidences on Query Terms and Web Knowledge Extraction. Yasufumi Kaneko, Satoshi Nakamura, Hiroaki Ohshima, Katsumi Tanaka.

  • A Query Language and Its Processing for Time-Series Document Clusters. Sophoin Khy, Yoshiharu Ishikawa, Hiroyuki Kitagawa.

  • Ontology Construction Based on Latent Topic Extraction in a Digital Library. Jian-hua Yeh, Naomi Yang.

  • Towards Intelligent and Adaptive Digital Library Services. Md Maruf Hasan, Ekawit Nantajeewarawat.

  • Searching for Illustrative Sentences for Multiword Expressions in a Research Paper Database. Hidetsugu Nanba, Satoshi Morishita.

  • Query Transformation by Visualizing and Utilizing Information about What Users Are or Are Not Searching. Taiga Yoshida, Satoshi Nakamura, Satoshi Oyama, Katsumi Tanaka.

  • Language Independent Word Spotting in Scanned Documents. Sargur N. Srihari, Gregory R. Ball.

  • Focused Page Rank in Scientific Papers Ranking. Mikalai Krapivin, Maurizio Marchese.

  • Scientific Journals, Overlays and Repositories: A Case of Costs and Sustainability Issues. Panayiota Polydoratou, Martin Moyle.

  • A Collaborative Filtering Algorithm Based on Global and Domain Authorities. Li Zhou, Yong Zhang, Chun-Xiao Xing.

  • Complex Data Transformations in Digital Libraries with Spatio-Temporal Information. Bruno Martins, Nuno Freire, Jose Borbinha.

  • Sentiment Classification of Movie Reviews Using Multiple Perspectives. Tun Thura Thet, Jin-Cheon Na, Christopher S. G. Khoo.

  • Scholarly Publishing in Australian Digital Libraries: An Overview. Bhojaraju Gunjal, Hao Shi, Shalini R. Urs.

  • Utilizing Semantic, Syntactic, and Question Category Information for Automated Digital Reference Services. Palakorn Achananuparp, Xiaohua Hu, Xiaohua Zhou, Xiaodan Zhang.

  • A Collaborative Approach to User Modeling for Personalized Content Recommendations. Heung-Nam Kim, Inay Ha, Seung-Hoon Lee, Geun-Sik Jo.

  • Using a Grid for Digital Preservation. José Barateiro, Gonçalo Antunes, Manuel Cabral, José Borbinha, Rodrigo Rodrigues.

  • A User-Oriented Approach to Scheduling Collection Building in Greenstone. Wendy Osborn, David Bainbridge, Ian H. Witten.

  • LORE: A Compound Object Authoring and Publishing Tool for the Australian Literature Studies Community. Anna Gerber, Jane Hunter.

  • Consolidation of References to Persons in Bibliographic Databases. Nuno Freire, José Borbinha, Bruno Martins.

  • On Visualizing Heterogeneous Semantic Networks from Multiple Data Sources. Maureen, Aixin Sun, Ee-Peng Lim, Anwitaman Datta, Kuiyu Chang.

  • Using Mutual Information Technique in Cross-Language Information Retrieval. Syandra Sari, Mirna Adriani.

  • Exploring User Experiences with Digital Library Services: A Focus Group Approach. Kaur Kiran, Diljit Singh.

  • Beyond the Client-Server Model: Self-contained Portable Digital Libraries. David Bainbridge, Steve Jones, Sam McIntosh, Ian H. Witten, Matt Jones.

  • New Era New Development: An Overview of Digital Libraries in China. Guohui Li, Michael Bailou Huang.

  • Browse&Read Picture Books in a Group on a Digital Table. Jia Liu, Keizo Sato, Makoto Nakashima, Tetsuro Ito.

  • Towards a Webpage-Based Bibliographic Manager. Dinh-Trung Dang, Yee Fan Tan, Min-Yen Kan.

  • Spacio-Temporal Analysis Using the Web Archive System Based on Ajax. Suguru Yoshioka, Masumi Morii, Shintaro Matsushima, Seiichi Tani.

  • Mining a Web2.0 Service for the Discovery of Semantically Similar Terms: A Case Study with Kwan Yi.

  • Looking for Entities in Bibliographic Records. Trond Aalberg, Maja Žumer.

  • Protecting Digital Library Collections with Collaborative Web Image Copy Detection. Jenq-Haur Wang, Hung-Chi Chang, Jen-Hao Hsiao.

  • Enhancing the Literature Review Using Author-Topic Profiling. Alisa Kongthon, Choochart Haruechaiyasak, Santipong Thaiprayoon.

  • Article Recommendation Based on a Topic Model for Wikipedia Selection for Schools. Choochart Haruechaiyasak, Chaianun Damrongrat.

  • On Developing Government Official Appointment and Dismissal Databank. Jyi-Shane Liu.

  • An Integrated Approach for Smart Digital Preservation System Based on Web Service. Chao Li, Ningning Ma, Chun-Xiao Xing, Airong Jiang.

  • Personalized Digital Library Framework Based on Service Oriented Architecture. Li Dong, Chun-Xiao Xing, Jin Lin, Kehong Wang.

  • Automatic Document Mapping and Relations Building Using Domain Ontology-Based Lexical Chains. Angrosh M.A., Shalini R. Urs.

  • A Paper Recommender for Scientific Literatures Based on Semantic Concept Similarity. Ming Zhang, Weichun Wang, Xiaoming Li.

  • Network of Scholarship: Uncovering the Structure of Digital Library Author Community. Monica Sharma, Shalini R. Urs.

  • Understanding Collection Understanding with Collage. Sally Jo Cunningham, Erin Bennett.

  • Person Specific Document Retrieval Using Face Biometrics. Vikram T.N, Shalini R. Urs, K. Chidananda Gowda.

  • The Potential of Collaborative Document Evaluation for Science. Joran Beel, Bela Gipp.

  • Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres. Robert B. Allen, Ilya Waldstein, Weizhong Zhu.

  • Arabic Manuscripts in a Digital Library Context. Sulieman Salem Alshuhri.

  • Discovering Early Europe in Australia: The Europa Inventa Resource Discovery Service. Toby Burrows.

  • Mapping the Question Answering Domain. Mohan John Blooma, Alton Yeow Kuan Chua, Dion Hoe-Lian Goh.

  • A Scavenger Grid for Intranet Indexing. Ndapandula Nakashole, Hussein Suleman.

  • A Study of Web Preservation for DMP, NDAP, and TELDAP, Taiwan. Shu-Ting Tsai, Kuan-Hua Huang.

  • Measuring Public Accessibility of Australian Government Web Pages. Yang Sok Kim, Byeong Ho Kang, Raymond Williams.

  • Named Entity Recognition for Improving Retrieval and Translation of Chinese Documents. Rohini K. Srihari, Erik Peterson.

  • Current Approaches in Arabic IR: A Survey. Mohammed Mustafa, Hisham AbdAlla, Hussein Suleman.

  • A Bilingual Information Retrieval Thesaurus: Design and Value Addition with Online Lexical Tools. K. S. Raghavan, A. Neelameghan.

  • Entity-Based Classification of Web Page in Search Engine. Yicen Liu, Mingrong Liu, Liang Xiang, Qing Yang.

  • MobiTOP: Accessing Hierarchically Organized Georeferenced Multimedia Annotations. Thi Nhu Quynh Kim, Khasfariyati Razikin, Dion Hoe-Lian Goh, Quang Minh Nguyen, Yin Leng Theng, Ee-Peng Lim, Aixin Sun, Chew Hung Chang, Kalyani Chatterjea.

  • Social Tagging in Digital Archives. Shihn-Yuarn Chen, Yu-Ying Teng, Hao-Ren Ke.

  • Editor Networks and Making of a Science: A Social Network Analysis of Digital Libraries Journals. Monica Sharma, Shalini R. Urs.

  • Empowering Doctors through Information and Knowledge. Anjana Chattopadhyay.
  • Monday, November 17, 2008

    Software Announcement: LuSql: Database to Lucene indexing

    LuSql is a simple but powerful tool for building Lucene indexes from relational databases. It is a command-line Java application for the construction of a Lucene index from an arbitrary SQL query of a JDBC-accessible SQL database. It allows a user to control a number of parameters, including the SQL query to use, individual indexing/storage/term-vector nature of fields, analyzer, stop word list, and other tuning parameters. In its default mode it uses threading to take advantage of multiple cores.

    LuSql can handle complex queries, allows for additional per record sub-queries, and has a plug-in architecture for arbitrary Lucene document manipulation. Its only dependencies are three Apache Commons libraries, the Lucene core itself, and a JDBC driver.

    LuSql has been extensively tested, including a large 6+ million full-text & article metadata document collection, producing an 86GB Lucene index.

    I am the author of the LuSql software.

    Update 2008 11 17 14:16:
    Update 2008 11 17 22:00

    Wednesday, November 12, 2008

    New Book: Semantic Digital Libraries

    I am looking forward to getting a hold of this just announced book, Semantic Digital Libraries, Editors: Sebastian Ryszard Kruk, DERI NUI, Galway, Bill McDaniel, DERI NUI, Galway. Springer-Verlag, Heidelberg (DE) 2009, XVI, 246 p. 1 illus., Hardcover ISBN: 978-3-540-85433-3.

    The site for the book includes Tutorial on Semantic Digital Libraries, a tutorial presented at JCDL2008, as well as a faceted searchable interface to the (extensive and useful) links described in the book.

    • Introduction
    • Part I - Introduction to Digital Libraries and Semantic Web
      • Digital Libraries and Knowledge Organization
      • Semantic Web and Ontologies
      • Social Semantic Information Spaces
    • Part II - A Vision of Semantic Digital Libraries
      • Goals of Semantic Digital Libraries
      • Architecture of Semantic Digital Libraries
      • Long-time Preservation
    • Part III - Ontologies for Semantic Digital Libraries
      • Bibliographic Ontology
      • Community-aware Ontologies
    • Part IV - Prototypes of Semantic Digital Libraries
      • JeromeDL - the Social Semantic Digital Library
      • The BRICKS Digital Library Infrastructure
      • Semantics in Greenstone
    • Part V - Building the Future - Semantic Digital Libraries in Use
      • Hyperbooks
      • Semantic Digital Libraries for Archiving
      • Evaluation of Semantic and Social Technologies for Digital Libraries
      • Conclusions: The Future of Semantic Digital Libraries

    Monday, November 03, 2008

    Opportunistic Software Systems Development

    In the 25th anniversary issue (November/December 2008 (vol. 25 no. 6)) of IEEE Software, my NRC colleague Anatol Kark is part of the editorial team for the special issue on "Opportunistic Software Systems Development".

    These are all great articles, and I particularly like the Jansen et al article ("Pragmatic and Opportunistic Reuse in Innovative Start-up Companies") and feel that almost everyone who is trying to bring their organizationl IT into the 21st century should be forced to read the Gamble et al article ("Monoliths to Mashups: Increasing Opportunistic Assets").