Friday, August 28, 2009

VLDB2009: Sloppy by choice

One of the recurring themes at the VLDB2009 conference was how to create massively scalable database systems, by decreasing expressiveness, transaction coverage, data integrity guarantees—or any combination of these. The theoretical justification for this is partly explained by the CAP Theorem. Much has already been written about database key-value/object stores, cloud databases, etc. But there were still a few surprises.

Ramakrishnan's keynote
Yahoo!'s Raghu Ramakrishnan (whose textbook many readers of this article will know) gave a keynote on the subject. It was actually new to me that Yahoo! is also entering the cloud business; it's getting crowded in the sky, for sure. As we know, the business idea is this: A large operation like Amazon already has a massive, distributed IT infrastructure; so letting other companies in isn't that much of an extra burden. And the more users of the hardware, the easier it is to build a cost-efficient setup which can still handle spikes in performance demands (with many users it's very unlikely that their systems run at peak performance at the same time). Nice idea, but it may take a long while before users convert to using software which runs in the cloud, and it remains to be seen how many cloud vendors which can survive that long.

Anyway, Raghu Ramakrishnan presented a much-needed comparison of cloud database solutions (pages 55-60 in this presentation). The consistency model of Yahoo!'s cloud database system, PNUTS, does not provide ACID, but nor is it 'sloppy' to the degree of BASE. Another nice aspect of Yahoo!'s cloud systems is that much of it is based on open source software. Yay Yahoo!

When to be sloppy
At the conference, some claimed that cloud storage can also be used for important data, but no one gave a plausible example. In cloudy times, we should not forget that not all data are about tweets, status updates, weblog postings, etc. There are actually data which is actually important. That's why I liked the Consistency Rationing in the Cloud: Pay only when it matters presentation: It provided at framework for categorizing data in degrees of acceptable sloppiness, based on the cost associated with potential inconsistencies, versus the savings gained from the lower transaction overhead.

Panel on cloud storage
On Wednesday, there was a panel discussion on How Best to Build Web-Scale Data Managers, moderated by Philip Bernstein. Random notes from the discussion:
  • A Surprising, and somewhat strange viewpoint from Bernstein: We should not ditch ACID (not surprising coming from Mr Transaction himself), but we should give hierarchical DBMSes a new chance. According to Bernstein: The reason why it will not fail this time is that we have become so good at handling materialized views, and they allow us to make sure that fast queries are not restricted to restricting/scanning one one dimension. Bernstein failed to alleviate my fear of the return of another major drawback of hierarchical databases: navigational queries.
  • While there's much not-so-important data out there, the phenomenon of moderate amounts of important data hasn't gone away (not every business is a twitter.com). So although the non-ACID, non-relational database systems may have a lot of attention, it doesn't matter for the makers of "traditional" DBMSes, because RDBMS business is doing great.
  • Sub-question: Why do web start-ups seem to make use of key-value stores, and not use Oracle's DBMS (for example) when they need to scale to beyond a single data server? There wasn't much opposition to the view that—in addition to being administration labor intensive—the cost of an Oracle cluster is way out of budget in many businesses. So: If database researchers want to help prevent relational+transactional research from becoming increasingly irrelevant, it's time to help the open source database projects. While I agree that researchers can make a difference in the open source world, but I's skeptical to the perception of RDBMSes being abandoned; whatever numbers I've seen actually indicate the opposite. And while Facebook—for example—has a key-value store, their non-clickstream-data is actually in sharded MySQL databases, as far as I've heard; sharded MySQLs will never win relational database beauty contests, but at least it's tabular data, accessible with SQL, and with node-local transactions.
  • Interesting point: SAP, an application with undisputed business significance, is well known for using nothing but the most basic RDBMS features. With that in mind, one should be cautious to denounce cloud databases for lack of expressiveness.


Finally, a couple of pictures from the nearby TĂȘte d'Or Park:

No comments: