David M. Raab
DM Review
July 2008
It’s long been clear that relational databases are ill-suited for analytical processing. The problem is fundamental: relational databases are designed to retrieve all elements from a few records, try while analytical queries typically read all records but only a few elements. Until recently, story database developers were able to overcome the problem with clever design and powerful hardware. Today, the largest databases—hundreds or thousands of terabytes—are too big for this to work at an acceptable price. So the quest for alternatives is on.
But let’s back up a bit. Relational databases are inefficient at analyzing all size databases, not just the largest. Companies may have met their critical needs by throwing money at the problem, but they have also rejected less important applications which couldn’t justify the cost. This suggests there is a significant hidden demand for analytical applications of all sizes if costs can be reduced.
The insight is important because many analytical technologies do not handle very large databases. They are sometimes dismissed for this reason. But this is a mistake: if alternative technologies are more cost-effective than conventional databases for smaller scale projects, they still add value in those situations.
With that in mind, let’s look at the major technologies available today for analytical systems.
– multidimensional databases. These systems read cubes of aggregated data. This makes them fundamentally different from the products listed in the rest of this article, which give direct access to the raw details. This is a critical distinction for analytical applications where the required queries are not known in advance. So now that we’ve mentioned products like Business Objects (www.businessobjects.com), Hyperion (http://www.oracle.com/hyperion), Cognos 8 (www.cognos.com) and Applix TM1 (http://www.cognos.com/applix/) – and acknowledged that they have long been the primary alternatives to relational databases for analytical purposes – let’s move ahead..
– in-memory databases. These include Panoratio (www.panoratio.com), PivotLink (www.pivotlink.com), and QlikView (www.qlikview.com). The technical details differ, but in general all these systems load the source data into memory in a compressed, non-relational format and query against it. Part of their value comes from the compression itself, which lets them handle more data simply loading relational tables into memory. But compression varies greatly with both the technology and the data and is sometimes little better than 1:1. At least much value comes from the non-relational structures, which can be more flexible and powerful at analytical queries than SQL-based systems. This makes the alternative systems cheaper to deploy because there is less fine-tuning of data structures. Still, conventional servers rarely run more than 32 or 64 gigabytes of memory. This means that, even with compression, a pure in-memory system would almost never hold more than fifty or one hundred gigabytes of source data.
– simple columnar systems. These are “simple” in the sense of running on conventional servers, not the massively parallel kind. Vendors include database marketing stalwarts Alterian (www.alterian.com) and smartFocus (www.smartfocus.com). These products organize data by columns (i.e., data elements or fields) rather than rows. Because analytical queries typically read a few elements in all records, the columnar structure lets the database engine load only the information it needs. This reduces the amount of disk access and speeds the result. Like the in-memory systems, these tools have query languages that are more flexible than SQL, making them easier to work with. Scalability is still limited, however: these systems rarely exceed a terabyte of source data.
– massive columnar systems. This is the category receiving the most attention today. Vendors include Calpont (www.calpont.com), EXASOL (www.exasol.com), ParAccel (www.paraccel.com) and Vertica (www.vertica.com). Each is unique but these systems share several basic characteristics: SQL compatibility, massively parallel “shared nothing” hardware, and some flavor of a columnar data structure. These vendors generally supplement the columnar approach with other techniques, such as partitioning and storing multiple copies of the same information in different sort sequences. Some support in-memory data storage as well. These systems do scale, although it seems that most implementations are less than 10 terabytes.
– database appliances. These are also massively parallel systems with a SQL-compatible database. Rather than a columnar approach, they focus on innovations to get the best performance from their hardware at volumes into the hundreds of terabytes. Competitors include DATAllegro (www.datallegro.com), Greenplum (www.greenplum.com), Kognitio (www.kognitio.com) and Netezza (www.netezza.com). The claim here is scalability at a much lower cost than MPP leader Teradata (www.teradata.com).
– extraction-based systems. For tasks like monitoring Web traffic, data volumes are so huge that it makes more sense to scan traffic and extract selected attributes than to store it all. Products including Altosoft (www.altosoft.com), Skytide (www.skytide.com) and Visual IQ (www.visualiq.com) offer this capability.
– outliers. There are a few analytical systems that don’t fit into any of the above categories. illuminate Systems (www.i-lluminate.com), InfoBright (www.infobright.com) and QD Technology (www.qdtechnology.com) work in the sub-terabyte range. All are SQL-compatible, disk-based rather than in-memory, and run on standard servers. illuminate uses a very flexible database approach of its own devising: essentially, it stores each value once and attaches index structures that show all the contexts in which that value appears. Infobright and QD Technology employ a conventional relational structure and focus on compression. Sybase IQ (www.sybase.com) is a variation on a columnar database that reaches very high volumes (it cites a test with 155 terabytes of input) without massively parallel hardware.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.