Troels' blog: 2009

Monday, December 28, 2009

PostgreSQL innovation: Exclusion constraints

It seems like the next generation of PostgreSQL will have a new, rather innovative feature: Exclusion constraints. The feature is explained in a video from a presentation at a recent San Francisco PostgreSQL Users' Group; the presentation couples the new feature with the time period data type.

In a perfect World where databases implement the full ISO SQL standard (including non-core features), exclusion constraints could be nicely expressed as SQL assertions, but the perfect World hasn't happened yet. And I think that I know the reason for this: SQL assertions seem like a strong cocktail of NP-hard problems - they are probably very hard to implement in an efficient way.

In our less than perfect World, it's nice that PostgreSQL will soon offer a way to specify exclusion constraints other than the UNIQUE constraint.

The presenter, Jeff Davis, has an interesting blog, by the way. An on the subject of video, Vimeo has a little collection of PostgreSQL video clips that looks interesting.

Sunday, October 04, 2009

SQL comparison update: PostgreSQL 8.4

I finally got around to adjusting my SQL comparison page, so that the improvements in PostgreSQL 8.4 are taken into account. PostgreSQL is now standards-compliant in the sections about limiting result sets; actually, PostgreSQL is the first DBMS that I know of which supports SQL:2008's OFFSET + FETCH FIRST construct for pagination. PostgreSQL's new support for window functions is also very nice.

My page doesn't cover common table expressions (CTEs) yet, but it's certainly nice to see more and more DBMSes (including PostgreSQL, since version 8.4) supporting them. Even non-recursive CTEs are important, because they can really clean up SQL queries and make them more readable.

SQL comparison updates: Oracle 11R2, diagnostic logs

I finally found some time to update my page which compares SQL implementations. I performed a general update regarding Oracle, now that Oracle 11R2 has been released. And I added a new (incomplete) section which describes where the different DBMSes store diagnostic logs.

I turned out that very little has changed in Oracle since generation 10. The only remarkable new SQL feature in Oracle 11 is support for CTEs and proper CTE-based recursive SQL (introduced in version 11, release 2) -- but as I don't cover this topic on my page, updating the Oracle items was mostly a question of updating documentation links. Oracle still doesn't support the standard by having a CHARACTER_LENGTH and a SUBSTRING function, for example. This is simple, low-hanging fruit, Oracle(!) Sigh. It seems like Oracle's market position has made them (arrogantly) ignore the SQL standard.

Monday, August 31, 2009

VLDB2009: Non-cloud storage

Not all of VLDB2009 was related to cloud storage. Luckily, local and SAN storage is still being explored. Here are some notes from selected presentations.

Mustafa Canim: An Object Placement Advisor for DB2 Using Solid State Storage: What data should be put on disks, and what data on solid state disk storage? Solid state drives (SSDs) shine at random read I/O, but in most situations, there's limited funds, so only part of a database will be eligible for SSD placement. It turns out that if a naïve/simplified strategy like let's place all indexes on SSD is used, only a small overall performance gain is measurable, and it's hardly a justification of the expensive SSD storage. But if placement of database objects (tables/indexes) is based on measurements from a representative workload, then data which is often seeked to can be placed on the SSDs. Canim has created a prototype application which uses DB2 profiling information to give advice on the most optimal use for xGB of SSD storage, and he demonstrated very convincing price/performance gains from his tool. The principle should be easy to apply to other DBMSes.

Devesh Agrawal: Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices: Since SSDs do not shine at random writes, an SSD-optimized index structure would be highly welcome. The Lazy-Adaptive (LA) Tree index is a (heavily) modified B+ tree which buffers writes in a special way, yielding significant performance improvements.

Nedyalko Borisov and Shivnath Babu had a poster and a demonstration about a prototype application they have created: DIADS. DIADS integrates information from DB2 query access plans with knowledge about the storage infrastructure (all the way from files on the disk, through the LVM and SAN layers, to individual disks) to help diagnosing performance bottlenecks. This would certainly be useful for DBAs, and it could probably bridge the worlds of DBAs and storage people. They are considering making it an open source project. By the way: I believe that I've heard Hitachi claim to be selling a tool with a similar objective: Hitachi Tuning Manager.

Finally, a little rule of thumb from Patrick O'Neil's The Star Schema Benchmark and Augmented Fact Table Indexing (part of the TPC Workshop): Throughput of magnetic disks has grown much more than seek latency, so at the moment, 1MB of scanning can justify 1 seek.

Friday, August 28, 2009

VLDB2009: Sloppy by choice

One of the recurring themes at the VLDB2009 conference was how to create massively scalable database systems, by decreasing expressiveness, transaction coverage, data integrity guarantees—or any combination of these. The theoretical justification for this is partly explained by the CAP Theorem. Much has already been written about database key-value/object stores, cloud databases, etc. But there were still a few surprises.

Ramakrishnan's keynote
Yahoo!'s Raghu Ramakrishnan (whose textbook many readers of this article will know) gave a keynote on the subject. It was actually new to me that Yahoo! is also entering the cloud business; it's getting crowded in the sky, for sure. As we know, the business idea is this: A large operation like Amazon already has a massive, distributed IT infrastructure; so letting other companies in isn't that much of an extra burden. And the more users of the hardware, the easier it is to build a cost-efficient setup which can still handle spikes in performance demands (with many users it's very unlikely that their systems run at peak performance at the same time). Nice idea, but it may take a long while before users convert to using software which runs in the cloud, and it remains to be seen how many cloud vendors which can survive that long.

Anyway, Raghu Ramakrishnan presented a much-needed comparison of cloud database solutions (pages 55-60 in this presentation). The consistency model of Yahoo!'s cloud database system, PNUTS, does not provide ACID, but nor is it 'sloppy' to the degree of BASE. Another nice aspect of Yahoo!'s cloud systems is that much of it is based on open source software. Yay Yahoo!

When to be sloppy
At the conference, some claimed that cloud storage can also be used for important data, but no one gave a plausible example. In cloudy times, we should not forget that not all data are about tweets, status updates, weblog postings, etc. There are actually data which is actually important. That's why I liked the Consistency Rationing in the Cloud: Pay only when it matters presentation: It provided at framework for categorizing data in degrees of acceptable sloppiness, based on the cost associated with potential inconsistencies, versus the savings gained from the lower transaction overhead.

Panel on cloud storage
On Wednesday, there was a panel discussion on How Best to Build Web-Scale Data Managers, moderated by Philip Bernstein. Random notes from the discussion:

A Surprising, and somewhat strange viewpoint from Bernstein: We should not ditch ACID (not surprising coming from Mr Transaction himself), but we should give hierarchical DBMSes a new chance. According to Bernstein: The reason why it will not fail this time is that we have become so good at handling materialized views, and they allow us to make sure that fast queries are not restricted to restricting/scanning one one dimension. Bernstein failed to alleviate my fear of the return of another major drawback of hierarchical databases: navigational queries.
While there's much not-so-important data out there, the phenomenon of moderate amounts of important data hasn't gone away (not every business is a twitter.com). So although the non-ACID, non-relational database systems may have a lot of attention, it doesn't matter for the makers of "traditional" DBMSes, because RDBMS business is doing great.
Sub-question: Why do web start-ups seem to make use of key-value stores, and not use Oracle's DBMS (for example) when they need to scale to beyond a single data server? There wasn't much opposition to the view that—in addition to being administration labor intensive—the cost of an Oracle cluster is way out of budget in many businesses. So: If database researchers want to help prevent relational+transactional research from becoming increasingly irrelevant, it's time to help the open source database projects. While I agree that researchers can make a difference in the open source world, but I's skeptical to the perception of RDBMSes being abandoned; whatever numbers I've seen actually indicate the opposite. And while Facebook—for example—has a key-value store, their non-clickstream-data is actually in sharded MySQL databases, as far as I've heard; sharded MySQLs will never win relational database beauty contests, but at least it's tabular data, accessible with SQL, and with node-local transactions.
Interesting point: SAP, an application with undisputed business significance, is well known for using nothing but the most basic RDBMS features. With that in mind, one should be cautious to denounce cloud databases for lack of expressiveness.

Finally, a couple of pictures from the nearby Tête d'Or Park:

VLDB2009: TPC Workshop

Monday, I attended the Transaction Processing Council (TPC) Workshop about the latest developments in database benchmarks. The workshop was kicked off with a keynote from a true database hero, Michael Stonebraker. Mr. Stonebraker has not only published significant research papers—he is also initiated a number of projects and startups: Postgres (the precursor to the great PostgreSQL DBMS), Vertica (one of the pioneers within the "column-oriented"/"column-store" database realm), StreamBase, and others. Stonebraker held it that benchmarks have been instrumental in increasing competition among vendors, but there are aspects to be aware of: As the benchmarks allow the vendors to tune the DBMS (and sometimes create setups not resembling most setups, like using five figure disk counts), this doesn't improve the out-of-box experience—an experience which all too often needs to be significantly tweaked. Stonebraker also criticized the TPC from being too vendor focused, instead of having focus on users and (simple) applications. And he urged the TPC to speed up the development of new benchmark types (I'm thinking: geospatial, recovery, ...), partly by cutting down on organizational politics.

Personally, I'm astonished that some (most?) of the big DBMS vendors prohibit their users from publishing performance findings. This curbs discussion among practitioners, and it decreases reproducibility and detail of research papers ("product A performed like this, product B performed like that"). I doubt that this actually holds in a court of law, but it would certainly take guts (and a big wallet) to challenge it. I'm also annoyed that the vendors don't really support the TPC much: The TPC-E benchmark (OLTP-benchmark, sort-of modernized version of TPC-C) is two years old, yet only one vendor (Microsoft) has published any results yet.

Nevertheless, references to TPC benchmarks were prevalent at the conference, being referred to in several papers.

I'm planning to try running TPC-H on our most important database systems, to see if it is feasible to use it for regular performance measurements—in order to become aware of performance pro-/re-gressions. By the way: It would be of great if IBM (and others) published a table of reference DB2 TPC findings for a variety of normal hardware configurations. That way, a DBA like me could get an indication of whether I've set up our systems properly.

Other speakers had various ideas for new benchmarks, including benchmarks which measure performance per kWh, and benchmarks which expose how well databases handle error situations.

A researcher from VMWare pledged for benchmarks of databases running in virtual environments. He presented some numbers of a TPC-C-like workload running on ESX guests, showing that a DBMS (MSSQL, I believe) can be set up to run at 85%-92% of the native speed. Certainly acceptable. And much in like with what I'm experiencing. I hope that figures like this can gradually kill the myth that DBMSes shouldn't run in virtual hosts—a myth which results in a situation where many organizations don't realize the full potentials of virtualization (increased overall hardware utilization/lower need for manpower/less late-night service windows, as workloads can be switched to other hosts when a server needs hardware service).

I forgot to take a picture from the workshop, so here's a picture of me being sceptical about traditional DBMS vendors—next to the Saôme river.

At VLDB2009

I'm attending the 35th VLDB conference in Lyon: VLDB2009. The conference portrays itself as the premier international forum for database researchers, vendors, practitioners, application developers, and users. Of the 700 people (from 44 countries), I'm one of the few practitioners at the conference; and though there's a risk that the conference will be too research oriented, I've signed up, especially hoping to get the latest updates and thoughts on

probabalistic databases
column-oriented databases
performance quantification
cloud databases

During the next couple of days, I'll share my experiences here.

I—and others—have tweeted a bit from the conference, as well.

Thursday, August 27, 2009

MTU adjustment needed on Orange wifi

I'm on the road. On the hotel, there's wifi, provided by Orange (France Télécom). The connection has been strange: Most of the time, I couldn't connect to Facebook and MSN Messenger (XMPP/Jabber connections worked fine, though); SSH connections were also unstable. Finally found out that it helped changing my network settings such that the MTU parameter was decreased to 1200.

Tuesday, May 26, 2009

Storage system experiences

Background
Some people dislike Storage Area Networks (SANs). This may be due to complexity; and it may be because SAN management is often delegated to a special group of administrators which may then become an organizational bottleneck.

Personally, I very much like SANs. I do not miss messing around with disks and cables and enclosures in the server room. In my opinion, the complexity of a SAN is easily balanced out by easy and timely allocation of storage chunks to servers. And as a database administrator, I'm very fond of being able to share RAIDs between several servers, resulting in storage backed by a large amount of spindles. Also, it's nice to be able to monitor I/O activity across a large set of systems.

At work, we use two different FC SANs. One SAN connects Windows/ESX/Linux Dell servers to a Hitachi AMS500 storage system, while another SAN connects IBM servers to IBM DS4800 controlled storage.

I've had 2½ years of experience with the IBM storage system which keeps 24TB of data. We've had the AMS500 for ½ a year; it stores around 20TB of data. Both storage systems are configured with RAID5/6 logical drives with varying speed profiles, for different kinds of use patterns.

GUI management interface
The GUI management interface for DS4800, Storage Manager (SM), is vastly superior to AMS500's Storage Manager Modular (SMM). SM lets you assign text lables to your LUs, while SMM only works with numeric LUNs; this means that SM lets you get away with less separately maintained systems documentation. Also, it's much easier to view snapshots of I/O activity in IBMs Storage Manager. And IBMs Storage Manager was very easy to install while Hitachi's Storage Manager Modular required a fair amount of tweaks during installation, due to Java runtime issues (seemingly because Hitachi's software is bundled with an ancient JRE).

Command line interface
For both storage systems, the command line tools use rather awkward syntax and calling conventions. I wish that the programmers of these tools would lean more towards modern Linux/unix command line conventions.

Health monitoring
For both storage systems, you are encouraged to install software which regularly checks storage system health; in case of trouble, it alerts you by e-mail and "phones home" to the respective support organizations. Again, the IBM software is easy to install, while we had problems getting the Hitachi software to install on contemporary server software.

The AMS500 lets you turn on SNMP, so that you can easily poll for the health of the system from your central monitoring system; nice. DS4800 doesn't seem to offer this.

Stability
During my 2½ years with the DS4800, we have had three serious breakdowns: Two related to firmware trouble, and one related to a controller hardware defect. This is too much, I think.
We haven't had experienced instability with the AMS500 during the ½ year that we've had it in production.

UPDATE Dec 2009: We got our first stability problem with AMS500 :-(
A misconfigured ESX-cluster generated a large number of I/O requests for a LUN which had been deleted, and the AMS500 started getting periodic absence seizures. In AMS500's defence: Had we been up-to-date with regard to AMS500 firmware, this wouldn't have caused trouble.

Performance
Benchmarking storage systems is hard, especially because you normally don't have the luxury of being able to shut down all other I/O than that generated by the benchmark. So although I've conducted an extensive set of performance measurements, I can only say that the systems seem to perform equally well.

One difference, though: Hitachi recommends that you stick with rather narrow RAIDs; if needed, these may then be joined into larger storage areas using a special LUSE feature, or using logical volume management at the hosts. This is somewhat annoying: We would generally like to have wide RAIDs comprising a large number of spindles; AMS500 makes it a bit more complicated to meet this goal.

Price
At the time of our AMS500 acquisition, Hitachi's storage system was substantially less expensive than IBMs comparable offerings. The comparison may be a bit unfair, though, because our procurement is tightly controlled by government procurement contracts.

Support
We haven't used IBMs support offerings much, because we have historically used consultants from an IBM partner for support of our DS4800. But, regarding software and documentation, IBMs support is mostly very good: It's easy to download updates for both firmware, management software, and documentation. IBMs documentation of how to use DS4800 multipathing with Linux is inconsistent and confusing, though.

Hitachi's support is a mixed experience. They have appointed a technician from their Danish office to us, and this works very well: He is easy to get in touch with, and he provides good answers. On the other hand, Hitachi's distribution of software and documentation is miserable: The software is hidden behind a confusing extranet, and even after months of mail/phone correspondance with the extranet support (and other parts of Hitachi's organization), we haven't been able to log in and download software or documentation. So we have resorted to ask our tech contact to send us CDs via snail mail once in a while. Another Hitachi annoyance is licence keys: Why on Earth do we need to enter licence keys when installing multipath driver software (on Windows); as if there would ever be a black market for that kind of software. And when the going gets tough in operations (e.g. if a new server needs to be quickly installed after a server breakdown), it's frustrating to have to spend time trying to dig out that CD with licence keys. Argh! What are they thinking?

Multipathing
Multipathing with the DS4800 works well with Windows and PowerVM servers, but we never got it to work perfect with Linux on Intel. It's strange that IBM puts a significant amount of work into Linux, but still can't make it simple and easy to integrate an important storage product with Linux.

Multipathing with AMS500 works well (and out of the box) on Linux and ESX if you follow certain conventions (Red Hat knowlegebase article; Hitachi ESX configuration document). However, Hitachi's Windows HDLM drivers have trouble discovering new LUs, resulting in the need for several reboots when a new LU is mapped to a Windows host; when the discovery is up and running, things work fine with Windows, too.

Overall experiences
Pro IBM: Good management software, simple distribution of software and documentation.
Con IBM: Hich price. More breakdowns than should be expected, I think.

Pro Hitachi: Good price, good stability so far.
Con Hitachi: Bad management software, very bad distribution of software and documentation.

Troels' blog