Tuesday, March 11, 2008

Extremely Large Databases

The First Workshop on Extremely Large Databases was held at the Stanford Linear Accelerator Center, October 2007. Many of the heavy hitters were there (Google, Yahoo, Microsoft, IBM, Oracle, Terrasoft, SLAC, NCSA, eBay, AT&T, etc) from industry, academia and science (? their classification).

A report is available and I thought I'd touch on some of the more interesting things I found in it:

Scale:
  • Most have systems with > 100TB of data, with 20% of scientific databases > 1PB of data;
  • All from industry reps had >100PB of data, with all having at least one system with >1PB
  • Industry had single tables with > 1 trillion rows; science ~100 times smaller.
  • Need for multi-trillion-row tables in <10 years
  • Peak ingest: 1B rows per hour; 1B rows per day common
  • "All users said that even though their databases were already growing rapidly, they would store even more data in databases if it were affordable. Estimates of the potential ranged from ten to one hundred times current usage. The participants unanimously agreed that "no vendor meets our database needs"."
Usage:
  • The most surprising observation: "...highly unpredictable query loads, with up to 90% of queries being new." Wow: not good for modern learning query optimizers.

Open Source Software
  • "Both groups often use free and/or open source software such as Linux, MySQL, and PostgreSQL extensively to reduce costs."
MapReduce for some operations
  • "The map/reduce paradigm has built substantial mind-share thanks to its relatively simple processing model, easy scalability, and fault tolerance. It fits well with the aforementioned need for full table scans. It was pointed out that the join capabilities of this model are limited, with sort/merge being the primary large-scale method being used today". See previous entry on MapReduce/Hadoop.

Important Science Differences:
  • "The longevity of large scientific projects, typically measured in decades, forces scientists to introduce extra layers in order to isolate different components and ease often unavoidable migrations, adding to system complexity. Unfortunately, those layers aretypically used only to abstract the storage model and not the processing model."

No comments: