Wednesday, February 25, 2009

code4lib 2009: Day 1+2

  • Day 1

    • LibX2. LibX Edition builder. Build custom version of LibX
    • Xtensible Catalog. Drupal, LMS, NCIP, LMS integration: Blackboard. webcast.
    • scriblio: Social Library System Wordpress based plugin
    • Mark Matienzo. anarchivist.Rich contextual book marklet.
    • Emily Lynema. NCSU Libraries. E-Matrix: Open Source ERM
    • Eric Lease Morgan. Alex4.
    • Erik Hatcher. Lucid Imagination. Lucene/SOLR. Index of Lucene apache site.
    • Mike Taylor and Mike. Index Data. Translucent record store=="Torus" pazpar2. Registry of searchable targets? Hard to do. IRSpy:Z39.50
    • Mike Beccaria from Paul Smith College. Microsoft DeepZoom. "Like microfiche" - audience. Photosynth of library stacks??
    • Dan Chudnov, LOC. BagIt File Package Format
    • Random things heard and seen: citation style language, Open Vocab, UCSD Libraries Digital Assest Management System, SWORDS. Distributed version control: monotone, mercurial, bzr.

    • Day 2:
    • A new frontier - the Open Library Environment (OLE) -- Timothy McGeary,
      Lehigh University. National Library of Australia Service Framework adopted in Jan 2009. June 2009: Draft design document: open to all. Modules: Discovery tools outside of OLE modules: 3rd party modules. Design photos.
    • Blacklight as a unified discovery platform. Bess Sadler, University of Virginia. Serendipity not offered by present digital tools. gsearch
    • A New Platform for Open Data -- Introducing Web Services. Joshua Ferraro, LibLime. OpenData. biblios web services.
    • What We Talk About When We Talk About FRBR. -- Jodi Schneider, Appalachian State University; William Denton, York University. Weak to Strong FRBRization. OCLC xISBN service: give ISBN, get back ISBN of other manifestations. LibraryThing also does, through work by users. thingISBN grouping service. OpenLibrary is doing some good FRBR work, with 'work' identifier. Also LoC FRBR Display tool. Patrick LeBouef FRBR in RDF: IFLA. Libris linked data & FRBR in RDF.
    • Iterative updates Summa Summa: A Better Library Catalogue / Search Engine without Spending a Dime, Summa white paper, Summa: This Is Not a Demo.
    • Erik Hatcher. Query Parsing info. DataImportHandler, Solr Cell, LuSql, TermVectorRequestHandler, StatsComponent, LocalSolr (geo searching) 40% duscount aupromo40 on new book addition.
    • FreeCite - An Open Source Free-Text Citation Parser -- Chris Shoemaker, Public Display
    • Great facets, like your relevance, but can I have links to Amazon and Google Book Search? -- Richard Wallis, Talis
    • Freebasing for Fun and Enhancement-- Sean Hannan, Johns Hopkins University.
    • Lightning talks
    • Connector Online Index Data/OCLC. Scraper. Firefox extension. Amazing interface for scraping.
    • Eric Lease Morgan. code4lib Annual Award.
    • Becky Yoose, Miami U, Ohio AutiIt. "Every time you run Macro Express God kills a kitten".
    • Alters/mutates test code. Heckle, Pester, Code Coverage Tools
    • HathiTrust. Shared Digital Repository

code4lib: Sebastian Hammer quote

"If you have something to say, you should release it as code..."
Sebastian Hammer , Index Data

Tuesday, February 24, 2009

code4lib update: LuSql talk done; Lucene, Solr links

Gave my LuSql talk today at code4lib2009 and didn't get cut down by any Solr/Lucene dudes! Met Erik Hatcher of Lucene/Solr fame (and now of Lucid Imagination fame) & hopefully we can collaborate on some Lucene/indexing Solr stuff in the future.

I also spoke with Tom Burton-West of UMich about Lucene indexing and search performance for their 1M+ Google Books index (they use Solr). These are documents that are a lot longer than the STM articles I work with. They have 220GB sized indexes and - as they have to keep stops words for their Humanities for phrase searching - suffer from poor query performance (despite 32GB RAM). I pointed to some of my previous work on high performance indexing and searching [1, 2, 3]. I'd like to get at their data to examine some performance issues in Lucene, both on the indexing and searching side.

I was wondering if Solr is configurable for the initial/max number of IndexSearchers. I couldn't find this in the Solr wiki, but did see information linking caches to IndexSearchers. If it does not, the configuration should allow this, and also smart Solr should have a default of not greater than the number of cores on the machine (use Runtime.availableProcessors()).

[1]Lucene concurrent search performance with 1,2,4,8 IndexReaders
[2]Simultaneous (Threaded) Query Lucene Performance
[3]Lucene indexing performance benchmarks for journal article metadata and full-text

"Elvis impersonators as XML documents"

As heard at code4lib2009:
"Let's say all of these Elvis impersonators are XML documents..."
- Mark A. Matienzo, New Your Public Library

Monday, February 23, 2009

code4lib pre-conference: Linked Data et al...

I am at the exciting and arcane code4lib 2009 conference here in Providence, Rhode Island. Right now at the pre-conference called LinkedData. on Linked Data.

I had forgotten that Rhode Island and more specifically Providence, are the old stomping grounds (and location for many short stories and novels) of H.P. Lovecraft. And - this morning - I was talking to Ross Singer about this, and realised how this all made sense: when I first met Ross at an Access conference a number of years ago, the first thing I thought on meeting him was, "Chthulu"! He of course denied being one of the Elder
and then levitated across the room from me. But I think this explains a lot of things... ;-)

We will have to see what other Links I make at this conference. :-)

Oh, BTW I will be giving a presentation tomorrow morning on LuSql. Feel free to drop in. :-)

Thursday, February 19, 2009

Obama in Ottawa and the Obama - Portuguese Water Dog Effect

Today, of course, U.S. president Barack Obama is visiting Ottawa. Much of the city is shut-down for the visit, at least from the perspective of getting around the city. Many are excited about his first visit outside of the U.S. as president.

In related but much more minor news, my sister and mom breed and show portuguese water dogs (see MacDuff Kennels Portuguese Water Dogs), a breed that is a candidate for the Obama's next family dog. While their web site is rather anemic due to my own general neglect of the site, I do have Google Analytics turned-on, and we noticed a real spike on the site around the time of the U.S. presidential inauguration on Jan 20. Basically the site traffic doubled from its background level. Here is the graph of the time around the inauguration:

By the way, if you are looking for a PWD puppy, my sister has 2 new litters born Dec 26 (8 puppies, most sold) and Jan 3 (10 puppies, some still available).

Wednesday, February 18, 2009

Java, MySql increased performance with Huge Pages

[Resources updates: 2010.07.07, 2009.11.03, 2009.05.27]

Long running, large memory, high performant applications often have special needs with respect to their memory management. On Linux, Solaris and other modern OSes, the translation look-aside buffer (TLB) - whose page size of 4k for many CPUs/OSes - becomes a scalability issue in these extreme conditions. In order to get around TLB scalability issues, huge page sizes are used to reduce the impact on performance. This can be of use to installations with large scale Java, MySql and other large memory applications.
300% improvement: "Well, in my case, I was able to achieve an over 3x improvement in my EJB 3 application, of which fully 60 to 70% of that was due to using large page memory with a 3.5GB heap. Now, a 3.5GB heap without the large memory pages didn't provide any benefit over smaller heaps without large pages. Besides the throughput improvements, I also noticed that GC frequency was cut down by two-thirds, and GC time was also cut down by a similar percentage (each individual GC event was much shorter in duration). Of course, your mileage will vary, but this one optimization is worth looking at for any high throughput application." [20]

25% - 300% improvement: "These memory accesses are then frequently cache misses which introduces a high latency to the memory request. Increasing page sizes from 4K to 16M significantly reduces this problem as the number of tlb misses drops. Typically it will reduce runtimes by 25-30% but in an extreme case I've seen an SPH code run 3x faster simply by enabling large pages." [8]

In this article[11], the too common problem of an intermittant, ephemeral but huge reduction in performance was solved by recognizing that 5GB of memory was taken-up in the page tables, made up of 4k pages. Solution: use Linux huge page size support and make pages larger, reducing the page table size to 200MB.

"Many Java applications, especially those using large heaps, can benefit from what the operating systems call large or huge pages."[10]

"17.26 times faster" (Linux, Java, 64bit) [19]

"While 16 GB pages are intended to only be used in very high-performance environments, 64 KB pages are general-purpose, and most workloads are likely to see a benefit by using 64 KB pages rather than 4 KB pages."[7]

Java on an OS that supports large page sizes has better performance for many application. Which applications? Use tools[43] to see the TLB cache hit rate when these applications run. Different CPUs support different page sizes:
  • i386: 4K and 4M (2M in PAE mode)
  • ia64: 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M
  • PPC64: 4K and 16M
  • POWER5+: 4K, 64K, 16MB, 16GB (!!)
  • UltraSparc III: 8K, 64K, 512K, 4M
  • UltraSparc T2: 8K, 64K, 4M, 256M
Just add -XX:+UseLargePages to the JVM command line to use after setting up. For CPUs that support multiple sizes, use -XX:LargePageSizeInBytes=2m to define the page size you want to use (i.e. for 2MB pages, etc.).

Additional Huge Page resources

    Friday, February 06, 2009

    ICSTI2009 "Managing Data for Science" Conference in Ottawa

    ICSTI2009 "Managing Data for Science" Conference
    From the site:
    ICSTI's 2009 Public conference will take place on June 9 and 10, 2009 at Library and Archives Canada, 395 Wellington Street, Ottawa, Ontario, Canada.

    Speakers from Canada, the United States and Europe will address:
    • How eScience affects the way libraries, publishers and scientists relate to each other.
    • How the era of "big data" will enable enhanced experimentation and collaboration in science.
    The program includes presentations from:
    • Francine Berman San Diego Super-Computing Center, California
    • Richard Boulderstone Director of e-Strategy and Programmes British Library, UK
    • Jan Brase German National Library of Science and Technology, EU
    • Lee Dirks Microsoft, State of Washington
    • Peter Fox Rensselaer Polytechnic University, State of New York
    • Paula Hurtubise Project Manager Carleton University, Ottawa
    • Liz Lyon UKOLN University of Bath, UK
    • James Mullins Purdue University, Indiana
    • Tim Smith CERN-IT Development, EU
    • Paul Uhlir The National Academies, Washington, DC

    Full disclosure: This conference is organized by my employer, who has no affiliation with this (zzzoot), my personal blog.

    Thursday, February 05, 2009

    Open Access for Quebec FRSQ funded health research publications

    The Quebec health research fund (Fonds de la recherche en santé du Québec (FRSQ)) has mandated all full or partially funded projects to publish their research outputs ("peer reviewed publications") to an open access site not less than 6 months after publication. This policy went into effect January 1 2009.

    While this is excellent news, it would have been even better news if the policy took a broader view of research outputs and included research data in its open access policy, mandating research data management and release, similar to the policies of CIHR, Ontario Institute for Cancer Research (OICR), NIH, Genome Canada, UK Research Councils, the Australian government, the European Research Council (ERC) and others. The benefits of sharing research data - especially health data - are well documented.

    NSF-sponsored workshop on Cyberinfrastructure Software Sustainability

    This workshop - to be held at Indiana University March 26,27 2009 - examines the question: "given millions of dollars invested in initiating software development, how is software that will be important to the US research and engineering communities identified, maintained, and supported over years to decades?"

    Of course a question of interest for other countries and their cyberinfrastructure initiatives.

    Workshop Goals:
    The goals of the Cyberinfrastructure Software Sustainability and Reusability workshop are as follows:
    • Examine software evaluation and adoption models by individual research labs and virtual organizations
    • Examine models for long-term software sustainability – the ability to obtain the software one wants with assurance, obtain the information required to use the software, obtain the software and hardware environments required to run the software, and use the software.
    • Discuss mechanisms for supporting sustainability, including direct government support, university-funded consortia, open source (with or without commercial support), community source, and commercialization