David M. Raab
DM Review
December 2007
Columnar databases are an idea whose time has come – again. First introduced in the 1970’s in (still available) products like Model 204 and ABABAS, clinic this approach has resurfaced recently at Vertica (www.vertica.com) and to some extent at QD Technology (www.qdtechnology.com). These join resurgent columnar database vendors Alterian (www.alterian.com) and SmartFocus (www.smartfocus.com), whose original products date to the 1990’s.
As the name implies, columnar databases are organized by column rather than row: that is, all instances of a single data element (say, Customer Name) are stored together so they can be accessed as a unit. This makes them particularly efficient at analytical queries, such as list selections, which often read a few data elements but need to see all instances of these elements. In contrast, a conventional relational database stores data by rows, so all information for a particular record (row) is immediately accessible. This makes sense for transactional queries which typically concern one record at a time.
Columnar databases were largely eclipsed by relational systems in the 1980’s and 1990’s, when huge improvements in hardware price / performance allowed relational databases to compete effectively despite their analytical inefficiency. Columnar technology may be re-emerging today because analytic databases are now so large that the hardware required to use relational systems is too expensive even at current prices.
Today’s columnar systems combine the columnar structure with techniques including indexing, compression, and parallelization. But the fundamental questions asked in evaluating these systems are the still the same.
– load time: how long does it take to convert source data into the columnar format? This is the most basic question of all. Load times are often measured in gigabytes per hour, which can be unbearably slow where tens or hundreds of gigabytes of data are involved. The question often lacks a simple answer, since load speed can vary with the nature of the data and choices made by the user. For example, some systems can store multiple versions of the same data, sorted in different sequences or at different levels of aggregation. Users can build fewer versions in return for a quicker load, but may later pay a price in slower query times. Realistic tests based on your own data are the best path to a clear answer.
– incremental loads: once a set of data has been loaded, must everything be reloaded every time there is an update? Many columnar systems allow an incremental load, taking in only new or changed records and merging them with the previous data. But close attention to detail is critical, since incremental load functions vary widely. Some incremental loads take as long as a full rebuild; some result in slower performance; some can add records but not change or delete them. Often incremental loads must be supplemented periodically with a full rebuild.
– data compression: some columnar systems greatly compress the source data so the resulting files take a fraction of the original disk space. There may be trade-offs: uncompressing the data to read it can slow performance. Other systems use less compression or store several versions of the compressed data, taking up more disk space but gaining other benefits in return. The most suitable approach will depend on your circumstances. Bear in mind that the difference in hardware requirements can be substantial.
– structural limitations: columnar databases use different techniques to mimic a relational structure. Some require the same primary key on all tables, meaning the database hierarchy is limited to two levels. The limits imposed by a particular system may not seem to matter, but remember that your needs may change tomorrow. Constraints that seem acceptable now could prevent you from expanding the system in the future.
– access techniques: some columnar databases can only be accessed using the vendor’s own query language and tools. These can be quite powerful, including capabilities that are difficult or impossible using standard SQL. But sometimes particular functions are missing, such as queries that compare values within or across records. If you do need to access the system with SQL-based tools, determine exactly which SQL functions and dialects are supported. It’s almost always a subset of full SQL; in particular, updates are rarely available. Also be sure to find whether performance of SQL queries is comparable to performance with the system’s own query tools. Sometimes the SQL queries run a great deal slower.
– performance: columnar systems will usually outperform relational systems in nearly all circumstances, but the margin can vary widely. Queries involving calculations or access to individual records may be as slow or slower than a properly indexed relational system. Create a set of sample queries and test them against a prototype system.
– scalability: the whole point of columnar databases is to get good performance on large databases. But you can’t assume every system will scale to tens or hundreds of terabytes. For example, performance may depend on loading selected indexes into memory, so your hardware must have enough memory to do this. As always, first ask whether the vendor has existing systems running at a scale similar to yours and speak to those references to get the details. If you would be larger than any existing installations, then be sure to test before you buy.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.