2008 Oct 01
How to Judge a Columnar Database, online Revisited
David M. Raab
DM Review
October 2008

Last December, viagra 40mg this column ran a piece on “How to Judge a Columnar Database.” When someone quoted it to me recently, order I realized it had already become outdated. The reason is that a new generation of vendors, including Vertica, ParAccel, Calpont, and InfoBright, has joined older columnar systems from Sybase IQ, Alterian, SmartFocus, Kx Systems, and 1010Data.

In general, the new systems assign dedicated disk drives to each processor (“shared-nothing”) while the older systems apply multiple processors to a shared storage pool. Each approach has its own strengths and weaknesses, which introduces some new differences to consider. In addition, the broader adoption of multiple processors and 64-bit memory removes some of the performance constraints that impacted earlier systems.

Let’s first revisit the original list of items to see which are still relevant. Then we’ll add a few new ones.

– load time, incremental loads, and data compression. These all reflect the need for a columnar database to restructure data originally stored in another format. They can be critical bottlenecks at large data volumes, and older systems varied widely in their performance. As a result, these were probably the most important considerations when comparing older columnar systems.

Today, multiple processors, larger memory space and more scalable disk storage have greatly improved load and compression rates in nearly all columnar systems. Substantial differences still exist, but performance of even the slower systems is likely to be adequate. As a frame of reference, leading columnar databases several years ago loaded around 10 gigabytes per hour, while today’s best products load 150 to 200 gigabytes per hour. Many can reach whatever load rates are needed by simply adding more processors. Additional processing power also allows greater compression of stored data, since systems can decompress it more quickly after it is read from disk. (Decompression is not always needed: many operations run on the compressed data directly.)

Bottom line: you still need to consider load and compression performance, particularly if you’re dealing with terabytes of data (and aren’t we all?) or need quick incremental updates. But these issues no longer head the list.

– structural limitations: some early columnar databases imposed significant constraints on data structure, such as requiring that all tables use the same primary key. These crude limitations are largely gone. However, some of the newer systems do have more subtle limits, such as performing better on star schemas than normalized architectures. If you expect to use a star schema anyway, you probably won’t have a problem with any modern columnar system. But if you use other structures, check carefully how well a given product will perform on them.

– access techniques: while many early columnar systems were not SQL-compatible, all of today’s products all offer some level of SQL access. (Some still offer their own language for functions that SQL handles poorly, such as time series analysis.) Still, there are many levels of SQL compatibility. You’ll want to dig into the details for each system, particularly if you want to reuse existing SQL queries or SQL-based reporting tools.

– performance: this is one issue that hasn’t changed. Columnar databases are all fast, but performance on particular tasks can vary substantially from system to system. Performance may also depend on system configuration, so it is especially difficult to test. But performance is probably why you’re considering a columnar system in the first place, so you’ll certainly want to be sure you know what you’re getting.

– scalability: any columnar database you’re likely to consider will handle a couple terabytes of data. But not all are proven at the fifty or hundred terabyte level. In addition, some systems are significantly better than others at handling mixed query types and large numbers of simultaneous users. If you have needs like these, make sure your chosen vendor has similar installations in production, or that they can demonstrate the necessary performance in a realistic test.

So much for the old issues. None has vanished but the frame of reference has shifted for many. In addition, here are some new considerations:

– fault tolerance: many of the newer systems store data redundantly, either within or across nodes. This is done largely for performance purposes, but can have side benefits of easy—possibly even interruption-free—recovery from hardware failures. Many columnar databases are used for analytical work where some downtime is acceptable. But if it matters for you, be aware that products differ substantially in this area.

– data types: columnar databases have traditionally analyzed conventional structured data. But a few also support XML, text analysis and even binary objects. As with fault tolerance, you may not need this, but should know that it’s available if you do.

– database administration: one of the traditional benefits of columnar databases was their simplicity. Basically users dumped in the data and the system organized it for them. It’s still possible to work this way, but many systems now provide options such as multiple index types or sort sequences. This lets users tune the system to their requirements, but also means database administrators must make the right decisions. It’s still true that any columnar system should be easier to manage for a given analytical application than a relational database. But you’ll want to assess the differences in administrative workload among the columnar products themselves.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.