1997 Sep 01

Open vs. Proprietary Systems
David M. Raab
Relationship Marketing Report
September, 1997-February, 1998

Mention a “proprietary database”, and most systems departments react like a vampire to sunlight. But apart from the sheer entertainment value of making systems people squirm, why would you bring up the topic? And just what is a proprietary database engine, anyway?

It turns out that the first question has an easier answer than the second. People want proprietary databases for a simple reason: performance. These systems are much faster and usually more flexible than their non-proprietary counterparts. Great. So what are they?

Thereby hangs a tale. Once upon a time, when the smallest computer was the size of a refrigerator, all computer systems were “proprietary”–that is, the hardware and software were controlled by the manufacturer. Computers from different manufacturers didn’t work together and software had to be tailored to each product. For sanity, most firms bought all their equipment from the same manufacturer–you were either an IBM shop or a Burroughs shop or maybe a DEC shop–and, because the cost of changing vendors was so high, the vendors could charge very high prices without fear of losing their customers. Kind of like professional football.

This cozy situation had disadvantages both for vendors attempting to break into new accounts and for computer departments stuck with high costs and less-than-competitive service. The reaction was the “open systems” movement, which aimed to develop standards that let the same software operate on computers from different manufacturers. Key standards included the Unix operating system and the Structured Query Language (SQL) used to access relational database management systems.

The open systems movement was a qualified success. It did succeed in driving down prices, particularly for hardware. But Unix never quite achieved true interoperability–each hardware vendor added its own little flourishes–and stalled at about 15% market share. SQL did achieve a vendor-independent standard, although RDBMS vendors added proprietary extensions as well. Microsoft established a de facto SQL standard through its Open Database Connectivity (ODBC) middleware, which translated SQL query statementes into the dialects of the various database vendors. Of course, ODBC could only accept SQL functions that were understood by all database management systems, so it was unable to take advantage of vendors’ proprietary extensions. Developers wanting to deliver the best possible performance still wrote “drivers” to address the major relational databases directly.

But however far the industry has strayed from open systems in practice, it remains firmly committed to the ideal. Among both vendors and IT departments, “openness” often ranks with motherhood and lower taxes as a self-evidently desirable goal. Of course, the definition of openness often varies with the convenience of the user. Vendors tend to think of “open” as describing standards they control, so long as they are published. Systems groups tend to equate “open” with their internal corporate standards–even though these may be limited to the products of a single vendor.

All of which explains why a “proprietary” database may be deemed undesirable, even though the term itself has no rigorous definition. The practical definition is clear enough: unacceptable deviation from the standard of SQL-compatible relational database. Where companies vary is their definition of “unacceptable”: it may refer to any database that cannot be accessed by standard SQL query tools, any database that lacks an ODBC driver, or any database that is not Oracle. Sometimes the vendor can cure the fault by building an appropriate interface; in other cases, anything other than the company’s chosen “open” tool is too “proprietary”.

Why do companies care whether their database is “proprietary” or not? Some of the logic is, well, illogical–because “openness” is good, because the company’s chosen standard has to be best, because if the alternative product were any good the IT department would have known about it already. But there are also sound business reasons to avoid proprietary systems. They require training in new technologies and operating procedures, which means additional expense. They may not allow the company to add new functions or higher data volumes as corporate needs evolve. The vendor might go out of business or fail to keep the system updated or provide poor service or raise prices unreasonably. They might not be able to communicate with other corporate systems–an increasingly important concern as marketing databases are integrated more closely with other enterprise functions.

Given all these disadvantages, can the performance advantages of proprietary databases really be worth the bother? In a word, yes–or at least, sometimes. Benchmarks published by proprietary vendors regularly show performance improvements of 10 or 100 or even 1000 to 1: that is, queries that might take two hours using a conventional relational database may take one minute on a proprietary system. These figures are unaudited and doubtless chosen to display the vendor’s products to greatest advantage. But they are also consistent with reports from end-users and are done on the kinds of complex queries that matter most to marketers.

The reason these huge performance advantages are possible is that standard relational databases are terribly ill-suited for many marketing tasks. These databases were designed and optimized for traditional transaction processing–which generally involves finding one record or group of related records quickly, and making a change or addition such as entering a new order or updating somebody’s address. SQL was developed with functions to support these activities and the database’s query and index mechanisms have been tuned to run them. But marketing activities generally involve looking at large numbers of records–often checking the entire database to find records that meet very complex criteria. Doing this in SQL often involves contortions that would make a Yoga master proud.

For example, take a seemingly simple request: Give me a list of good auto insurance customers to offer home-owners insurance. The relational database must first generate a table with the total by policy holder of auto insurance claims. It must then link this to a table of customers and the create a new table of customers whose total claims are under a defined threshold. It must then somehow generate a list of customers who do not have home-owners policies–especially tricky in SQL–and select those who appear on both lists. The SQL for this process can easily run for pages. If you want to make it really difficult, have the system select the best customers by finding those with the lowest 10% of total claims (requiring a separate pass through the database to first calculate the distribution of claims), ask it to exclude customers who have been solicited for home-owners insurance in the past six months (another set of calcluations to match back to the list), randomly select 5,000 names for a test (there is no “random sample” function in native SQL), and then list the customers in priority sequence, showing a cumulative expected response rate (SQL can’t cumulate as it passes through a set of records). It might take days for a SQL expert to write this kind of query–and hours, or even days more for the database to execute it. There is also an excelletn chance of returning an incorrect result–such as double counting claims by people with more than one policy–that is virtually impossible to detect. But this is exactly the sort of thing marketers want to do every day.

* * *

Last month’s article described in general terms the problems that conventional relational databases have with common marketing queries. But generalities are not enough–especially in discussions that challenge deeply held beliefs. Questioning the natural and universal superiority of someone’s chosen relational database is sometimes taken as a personal insult.

So, for the record, here is a more detailed (but still not exhaustive) list of relational database shortcomings. Actually, most of these are limitations of standard Structured Query Language (SQL) rather than relational databases themselves. Also for the record, let’s make clear that many of these problems can be overcome by relational systems–given careful design, clever programming, and enough hardware. Relational solutions will be discussed in a later article in this series.

– file scans. This is the process of reading every record in a database, one after the other. It is a time-consuming task that slows performance dramatically. Unfortunately, relational databases tend to do this when they are faced with a query that cannot easily be limited to a small number of records. Remember, relational databases have been tuned for transaction processing, where most queries look for a single record or small group of records with a unique identifier such as an account number. They do this by creating binary tree (“b-tree”) indexes that list all data values present in a single field, with pointers to the location of records with each value and a fast method to find the index entry for a specific value. (I’m being sloppy here, or maybe just showing my age–relational databases have columns, not fields, and rows, not records. For that matter, they store data in tables, not files. But the older terms probably make more sense to readers, so I will continue to use them. The relational purists are mad at me by now anyway.)

B-tree indexes are great at selecting on a single value that occurs on a small number of records–such as an account number in a transaction table, where presumably any one account owns just a tiny proportion of the total records. But they will fail, and the database will revert to the dreaded file scan, if either the lookup is based on a combination of conditions (because the relational database can evaluate only one index at a time) or the specified value occurs on more than a small fraction of the total records. The fraction of records matters because of data in a conventional relational database is retrieved by “page”, not record. Each page typically stores scores or even hundreds of records. Think about it: if each page stored 100 records and a selection included 2% of the file (one out of fifty records), then nearly every page in the database would probably hold at least one qualified record. Using the b-tree index in this situation would actually be more work than a file scan, since you would still load each page at least once and have the additional overhead of reading the index itself. In fact, you would probably load many pages twice to retrieve two different records. Most relational databases are clever enough to recognize this situation and default to the file scan instead of doing the additional work.

As this example suggests, the efficiency of many query operations is determined by the algorithms that tell the system how to execute a query, not the inherent characteristics of relational databases themselves. These algorithms are called the “query optimizer” and are among the most closely guarded secrets of database vendors. The major relational databases have traditionally tuned their optimizers for fast transaction processing, but have now begun to add features to accommodate decision support and marketing-type queries as well. Unfortunately, they still have a long way to go.

– complex joins. The “join” is a fundamental relational database operation. It involves taking two data tables, finding rows in each table that share the same value for a common “key” field such as customer ID, and creating a new table with data from the matching rows. This combined table is known as a “result set”. A simple join involves a one-to-one relationship: that is, each key value appears just once in each of the two original tables. For example, you might have a table of cable TV accounts and a table of addresses; presumably, each account relates to just one address and each address has just one account. A more common join is one-to-many: for example, one customer linked to many transactions. This is still fairly simple–assuming that each transaction associated with a single customer. The result set of a one-to-many join has as many rows as the larger of the two tables, with data from the smaller table repeated as many times as necessary. For example, if a customer and transaction table were joined, the result set would have one row per transaction, with the customer data repeated on each row.

Unfortunately, the real world often involves joins that are more complex than one-to-one or one-to-many. For example, a selection may require relations among records within the same table, using two different columns as the “key”: this is known as a “self-join”. Imagine an employee file where each employee record holds the employee’s own ID and the ID of the employee’s manager. To produce a simple list of the employees with the names of their managers, you must match the manager ID in each employee’s record with the employee ID in each manager’s record. This is just as confusing as it sounds.

A query may also need to include records from one table that do not have counterparts on the other table (such as orders that are not associated with customers), or to exclude such records. These are known as an “outer join” and “inner join”, respectively. There are also hybrid situations: include orders with no customers but not customers with no orders.

The correct type of join is often not obvious, and can be difficult to specify correctly even after it is known. The problem is compounded because a query requiring complex SQL is often sent to a technical expert who does not fully understand the intent of the marketer who asked the original question. Nor–precisely because the SQL is so complex–is it likely that the marketer will be able to check the programmer’s logic by reviewing the SQL code after it is created. And finding errors by examining the results is difficult because SQL produces only the final output, so the data cannot be traced through intermediate steps.

– correlated subqueries. This is a term that even SQL experts sometimes have a hard time understanding. It describes a query that is executed by first running separate queries (“subqueries) to identify certain records, and then by matching (“correlating”) related records from the subquery result sets. For example, a query might need to find total lifetime purchases and lifetime returns for each customer. One subquery would read the order table to calculate lifetime purchases for each customer by cumulating (“grouping” in SQL terms) purchases on customer ID. A second subquery would read the returns table to calculate lifetime returns in a similar fashion. The result would be two tables (purchases and returns), each with a single row per customer. These two tables would then be joined using the customer ID as a common key. The result would be a single table with one row per customer ID and three columns: customer ID, purchases and returns. Once purchases and returns are combined on the same row, they can be manipulated easily in standard SQL operations.

This is a fairly simple example, but even it could easily be messed up. A common rookie mistake is to first join the order and returns tables and then do the cumulation–a simpler approach that avoids the subqueries. Trouble is, this gives the wrong answer. The merged database would double count records of customers who had more than one purchase and more than one return, because it would join each order record for a customer to each return record for that customer. For example, if a customer had two orders and one return, the resulting database would have one record showing the first order and the first return, and a second record showing the second order and the first return again.

This particular problem is known as a many-to-one-to-many (or many-to-many) join. Doing it wrong results in what is known as an “unintended cross product”. Like other complex joins, many-to-many joins are a challenge for front-end tools that attempt to shield users from having to write SQL code. The insidious aspect is that the “direct” approach involves perfectly legal SQL; thus it will execute successfully. Worse still, because the system will add together all rows for the same customer before producing the final result, the user will have no way to see that the double counting occurred. If only a few customers have multiple returns–and therefore the numbers do not look especially unreasonable–the user may never realize that a mistake has been made.

– multi-pass operations. In general, SQL specifies relationships among sets of records, but relationships among records within a single set. Yet many common marketing activities, such as ranking records or producing running totals, require that the records be processed in the proper sequence. These activities typically require multiple passes through a data set: one to sort the records correctly (which SQL does easily), and another to perform whatever calculations are required. Other activities requiring multiple passes include splitting a file into ranked groups such as deciles, calculation of moving averages such as the average size of the last five transactions, assigning sequences such as first or second purchase, and calculation of ratios such as the percentage of total revenue represented by each transaction. Multiple passes can slow down response considerably, because SQL must reread the entire file during each pass.

In addition, because SQL is not designed to perform calculations or comparisons on successive rows in a database, it lacks procedural programming structures such as do/while/end loops. This means that multi-pass operations often involve either advanced SQL tricks or manipulating the data in a conventional programming language. A very common database marketing requirement–splitting a file into mutually exclusive segments, each defined by its own query criteria–typically requires a separate query (often itself needing multiple passes) to create each segment, and then additional passes to remove duplication among the segments.

– unsupported functions. Standard SQL lacks a variety of functions that marketers often require. These include random and Nth selections, cumulative and rolling sums and averages, and ratios against a whole. Specialized functions for time-series and spatial analysis are also missing. Many group-based selections, such as finding customers with any, every or no record of a given type, can only be done through subqueries. This is a significant practical shortcoming, since it includes queries such as “find customers who have never purchased this product” or “find customers who always pay cash.”

* * *

Last month’s article described some pretty significant shortcomings of relational databases for marketing applications. Alternatives do exist, though these themselves are far from perfect. These “proprietary” systems fall into two major categories: inverted files and bit vector map (usually shortened to bit map) indexes.

Inverted files are systems where all instances of each data element are stored together–that is, one “record” holds all first names, another holds all last names, another holds all Zip codes, etc. In relational database terms, it is as if the contents of each column were stored together, instead of the contents of each row. The key advantage of the inverted (or “columnar”) approach is that it matches the way most marketing queries work–users typically need to look at all occurrences for handful of data elements. This is starkly different from a typical transaction processing query, which looks at all data elements associated with a single record. In essence, the inverted database does a “file scan” on selected elements, which is not quite as efficient as reading those elements for selected records only, but definitely beats doing a file scan on all elements of all records–which is where a relational database often ends up. The inverted database simply reads less data, which means it can go faster. Period.

Actually, there are other advantages as well. While relational databases rely on indexes or aggregations for speed, inverted databases do not. This means the inverted databases can give users (reasonably) fast response to questions they never anticipated. This is especially important in marketing applications, where unanticipated questions are frequent and there is usually neither time nor staff to index or aggregate the underlying database before getting the answers. Similarly, because inverted databases are not limited to the capabilities of Structured Query Language (SQL), users can easily execute calculations and comparisons across records that are difficult-to-impossible in SQL itself. Inverted database developers can even create query languages with specific functions, such as group-based qualifiers (“find customers with any transaction in this date range”), that SQL does not allow.

It is also relatively easy to add new elements to an inverted database. Because each data element is stored independently, there is no need to change record size or reallocate existing space within a record layout to squeeze in a new field. This is an advantage even over relational databases, which theoretically allow addition of new columns without changes to a fixed “record layout”: in fact, because the physical storage in a relational database is organized by row, they must physically reload the database to get any new elements placed correctly. In all, judging from vendor proposals for comparable systems, an inverted database probably requires one-quarter to one-half as much the technical support as the same system on a conventional relational database. As systems get larger, the advantage probably increases–and some functions that are fairly easy in an inverted database become so unwieldy in a relational database that they are effectively impossible. Just try asking the administrator of a 100 GB data warehouse to add several new columns overnight.

For all their advantages, inverted databases are not perfect. They lack conventional indexing mechanisms, so finding data associated with a single customer is relatively slow: you are still doing the equivalent of a file scan to find the single record. The inverted file is usually updated from a conventional row-oriented source, which implies a time-consuming conversion and loading process. (Current systems run up to several gigabytes per hour, which is reasonably fast but still implies several days of processing for a very large file.) Inverted systems may be slow at, or simply not allow, “incremental” updates that append new records–say, yesterday’s transactions–without rebuilding the entire database. Users typically cannot change data by calling it up on a screen and editing it, although they can usually make replacements via batch processes. Despite the general speed of the systems–5 to 20 million records per minute on PC or Unix servers–their reliance on file scans becomes an increasing disadvantage as files get very large. Even at 20 million records per minute, it would take nearly an hour to scan a single field on a billion-row table, and several hours to resolve a query with several fields. Performance may be much worse if a query involves complex calculations.

The inverted systems also have an essentially hierarchical data structure. This means they allow one-to-many relationships to flow in a single direction only: for example, a system may link one customer to many transactions, each involving a single product, but cannot simultaneously link the transactions to a small table that shows the attributes (line, size, style, color, etc.) of each product. This means that if users want to be able to query on product attributes, each attribute must be stored directly on the transaction records–potentially storing the same information over and over on millions of records. Most inverted databases use compression mechanisms to limit the actual space occupied by such data, which in fact can be compressed very efficiently because it contains a small number of unique values (there may be only five different sizes, eight colors, six product groups, etc.). But even a small amount of space per record adds up when millions of records are involved.

Probably more important, storing the attributes at the “bottom” of the hierarchy means that any attribute-based analysis must read all the detail records: for example, a query involving products with a certain size, color and size combination would need to read those three fields for all transactions (three file scans), rather than looking at a small product table, generating a list of the appropriate products, and then reading the product codes on the transactions (one file scan). This is significantly less efficient–and, in fact, inverted databases often slow considerably when faced with queries that summarize data along dimensions other than the embedded hierarchy (which is usually, but not necessarily, customers-to-transactions).

It also turns out that inverted databases often are limited in the number of layers their hierarchy can manage: some systems are limited to two levels, while others extend to four or more. This isn’t quite as bad as it sounds, since the systems can often allow different types of data–say, transactions and promotions–to coexist on the same level in the hierarchy. They may also be able to simulate additional levels by grouping records within a level: for example, storing individual records with a household ID, and then grouping on household ID in a report. However, this type of grouping does involve compromises in convenience and performance, so it is not uncommon to see companies build several versions of the same database with different physical levels of grouping.

Bit map indexes create a separate string of 1’s and 0’s for each value taken by a given element in a database: 1’s represent records having the value, and 0’s represent records that do not. The number of strings is determined by the number of unique values: a gender field might have three strings, for male, female and unknown; a state field might have 50 strings, one for each state, and so on. Bit maps are very easy to work with: finding the set of records with any of several values simply requires adding the strings for those values. This set-oriented approach makes bit maps inherently compatible with relational databases and Structured Query Language, although SQL functions requiring calculations or comparisons can be problematic. Bit maps are much faster than conventional relational database operations: the bit strings are usually small enough to hold in memory; only the strings related to relevant fields must be read; and the string comparison operations are inherently simpler than operations on the actual underlying data. Even on millions of records, a complex query using bit maps is often returned in a few seconds.

The problem with pure bit maps is that a separate string is needed for each value occurring in a field. This means that when there are a great many different values, there are too many strings to be useful. (The number of unique values is called “cardinality”: a low cardinality field has few different values and a high cardinality field has many different values.) The 1’s and 0’s of a pure bit map also cannot be used for calculations or be read directly to determine the actual values they represent.

These limitations have traditionally restricted the use of bit maps to special situations such as statistical summary files or counting systems. But more recently, products including Sybase IQ and Mercantile IRE have extended bit map approaches to allow representation of high-cardinality data and even allow calculations. While the details of their methods are confidential, the general approach of these vendors is to combine bit maps with other techniques including inverted files, precalculated summary statistics, and data compression. The result is that in an increasing number of situations, queries can be resolved without moving from the index to the underlying relational database. In fact, Sybase IQ–which can actually recreate the original data from the contents of its indexes alone–is often used without any ties to separate relational data store. Of course, in this case it is not technically an “index”, but an independent database.

Even with the extensions of vendors like Sybase and Mercantile, bit map approaches still have their drawbacks. They do not support the full set of SQL functions, meaning that certain queries must be passed on to the underlying relational database (if it exists) or cannot be processed at all. The bit map structures themselves must be built, which requires some sort of a loading process from the original data source–a process that can take significant amounts of time for a large system. Some products offer limited capability to do incremental data loads (that is, add new records to an existing database), but a full rebuild using all the original data is usually required. Even systems that allow incremental updates do not permit online editing of individual records, as would be done in a transaction processing system.

While bit maps are not quite as flexible as inverted file systems, they are significantly faster and, in some ways even more important, their SQL-compatibility makes them fit more easily into a relational database world. For example, it is possible to use standard relational database query tools to access data held in bit map systems–something that most inverted products do poorly if at all. Some relational database vendors, including Oracle and Red Brick Systems, have actually incorporated bit map indexes into their own systems. These implementations are closer to pure bit maps than the enhanced products of vendors like Sybase IQ and Mercantile, however. Mercantile in particular has incorporated capabilities, such as multiple comparisons within a query string, that go beyond standard SQL functionality.

Other technologies also exist to supplement or replace conventional relational databases. Products like MegaPlex FastCount, HOPS International HOPS, and Brann Viper all involve some combination of compression, inverted files, bit maps, and careful physical sequencing of the data–although details are closely held. They are true databases, with the ability to store and retrieve entire records without reference to an external system. They are usually loaded through batch processes rather than accepting online updates, although HOPS is an exception. HOPS can also accept SQL queries through an ODBC (Open Database Connectivity) driver, while the other systems have their own query languages.

Dynamic Information Systems Corporate (DISC) uses “inverted indexes”, where each unique value occurring in a field is stored along with a list of the records having that value. This is somewhat similar to bit map indexes, since it is organized by unique values. But storing the list of records rather than bit strings lets the system handle high cardinality situations more efficiently. Inverted indexes can select intersecting sets of records without reading the underlying data and can build indexes that incorporate data from multiple tables (“prejoined” indexes). They often store summary statistics about each value (such as the number of associated records), allowing them to resolve some queries almost immediately since no additional processing is required. But the system does not allow calculations and must draw from a conventional database to retrieve the actual records identified by queries against its indexes.

* * *

The story so far: confused and disoriented by the conflicting definitions of openness, our hero has refused to accept the limitations of conventional databases and set off in search of a better way. Stumbling through a weird landscape of proprietary solutions, he finds that each alternative has drawbacks of its own. Frustrated but still determined, he returns to the land of standard systems and asks whether perhaps these couldn’t be improved after all.

Okay, nobody will mistake this for a lost work of Sophocles. But it still offers a certain dramatic tension: what can be done to make relational databases work better? The answer is, quite a bit.

Perhaps the most elegant approach is better indexing. This involves a variety of alternatives to the standard binary tree (“B-tree”) indexes traditionally used by relational databases. Entries in a B-tree index contain the value that will be looked up–the “key”, often a unique identifier such as Customer ID Number–and the number (or physical address) of the record with that value. Entries are stored at the tips of a tree-like branching structure and are reached through a series of comparisons: for example, finding the number 3 might involve first branching on whether the number was greater than five, then branching on whether it was greater than two, and finally branching on whether it was three or four.

Because B-tree indexes use a minimum number of comparisons to find a specific record, they are well suited to the single-record look-ups of online transaction processing. B-trees also accept new values efficiently, since the system only needs to add one more split at the end of the appropriate branch and is unaffected by the physical location of the new record. This means new records can be added without physically rebuilding either the entire index or the underlying data table. The index will eventually become “unbalanced” if many more splits are added to some branches than others. At this point the index is rebuilt.

But the virtues of B-trees are irrelevant in marketing applications which neither look up single records based on known values nor insert new records as they are created. These applications need faster ways to deal with large masses of records, often spread across multiple relational tables.

Broadly speaking, indexes can do this in two ways: they can find the right records more quickly, or they can resolve queries without reading the underlying records at all. Cluster, hash and prejoined indexes primarily do the former, while inverted indexes primarily do the latter. Bitmap indexes do some of both. Here are some details:

– cluster indexes. These indexes place related records in physically adjoining locations. For example, all customer records in the same state may be placed together, or each customer’s transaction records could be stored alongside the customer’s household record. Both strategies aim to improve performance by reducing the number of different disk locations that must be loaded to answer a specific query. Attentive readers will remember that this is a specific problem with relational databases, where data is actually retrieved in “blocks” that may contain hundreds of records each–so increasing the proportion of useful records contained within a single block can reduce the number of blocks read and increase performance significantly.

From a marketing perspective, the obvious problem with cluster indexes is they require a fixed definition of which records are “related”. This often changes from query to query in a marketing situation, where one selection may be based on demographics and the next on transactions. Still, there are some applications where the same selection criteria are nearly always used, such as SIC code in business list rentals. In cases like these, a cluster index can be helpful. The dependence on physical placement of data also makes it inefficient to insert new records into a cluster index, since some number of records must be rearranged to “make room” for the new entry, but this is more of a concern with online transaction processing than most marketing applications. Similarly, the fact that cluster indexes are not as efficient as B-trees at finding individual records is not tremendously important to marketers.

– hash indexes. These indexes use some characteristic of the data itself to determine the physical location. A trivial example would be an index based on record number–where the record number itself defines the physical location of the associated record. Hash indexes require an algorithm that can generate a specific value from the input and relate this value to a fixed physical location. This is only practical in highly structured situations, such as multidimensional databases where data is classified according to a combination of predefined values. Imagine for example a database with entries by month, by region and by product code: all possible combinations are known in advance, and it is possible to map each combination to a physical location similar to a cell in a spreadsheet. This approach is in fact used by multidimensional database vendors, although they must wrestle with the waste implied by allocating some physical space to each combination, including those with no actual value–the equivalent of blank cells in a spreadsheet. The proportion of empty combinations is referred to as “sparsity”, and different multidimensional vendors have their own tightly-guarded methods for reducing the waste to acceptable proportions.

Although multidimensional databases are often used for structured types of marketing analysis, many database marketing applications require a much more fluid approach. This has largely prevented application of hash indexes to a broader set of marketing database situations.

– prejoined indexes store the relationships among records in two tables in a relational database. This saves the work of rediscovering these relationships each time the two tables are joined in a query. In conventional transaction processing, where joins involve only a few records and can be accomplished efficiently with B-tree index searches, the benefit usually isn’t worth the cost of building the prejoin index and keeping it current as the underlying files change. But since marketing queries often join very large numbers of records, the performance benefits from a preexisting join index can be substantial. Marketing files are also usually not being updated in real time, so there is less overhead needed to keep the prejoin index current.

A prejoin index does raise a flexibility issue, since it only makes sense if the tables are consistently joined according to the same relationship. But this imposes little real cost, since the number of meaningful join paths between two tables is usually quite limited: although customers might occasionally be related to transactions on something like Zip code, it’s a safe bet that the two tables will usually be linked by Customer ID. Where multiple paths are more common, it is possible to create alternative join indexes. But in any event, the presence of a join index would never prevent the system from doing a conventional join if one were required.

– bit map indexes have been discussed in a previous article. In brief, a bit map index creates a separate string of bits (1’s and 0’s) for each value that can be taken by given data element, where each bit represents whether that string’s value occurs in a specific record in the file. That is, the first bit in each string represents the first record, the second bit in each string represents the second record, and so on. The value at any given bit position is set to 0 in all strings except the string representing the actual value in the corresponding record.

Traditionally, bit maps have been viewed as a proprietary alternative to conventional relational databases. But they are now available as an alternative index type within standard relational databases including Oracle and DB2. These systems use simple bit maps, which are limited to numeric and coded elements with relatively few values and cannot be used in calculations. The proprietary bit map databases, including Sybase IQ and Mercantile IRE (now owned by Harte-Hanks) are considerably more powerful. The main advantage of simple bit maps indexes is that they are very fast at assembling sets of records–a very useful capability for marketing systems, which usually deal with sets rather than individual records.

– inverted indexes–not to be confused with inverted files–create a list of all records with each value of the chosen “key”. For example, a state index might have one entry for each state, containing the record numbers of the customers in the state. Like a cluster index, this works well when it is possible to determine in advance how the records will need to be grouped. But because the inverted index works with lists of records, rather than being dependent on the physical placement of the records themselves, users can accommodate different groupings by creating a different indexes for each approach. The system then simply selects the appropriate index based on the particular query. This makes the inverted index considerably more flexible than a cluster index.

Inverted indexes have other advantages as well. In addition to the list of records associated with each key, the index entry usually also stores a count of the included records, so that queries requiring only a count can be answered instantly by reporting the precalculated value. Counts involving a range are almost as fast, since the system must merely sum the counts for index entries of the appropriate values. Even counts involving more than one indexed field can still be resolved without looking beyond the indexes themselves, by comparing the record lists in each index. This approach can support any set-oriented operation, such as reporting how many records appear on both lists, either list, or one list but not the other. Because the record numbers themselves are available in addition to the counts, it is easy to go from the count to the actual list of records selected by the query, or to refine an existing query result without querying the entire original file.

Inverted indexes can be built on nearly any kind of data and can even include different types of records–say, the combination of household and transaction records for a certain set of customers. Although their greatest advantage over B-trees comes when there is a limited number of unique key values, they can handle very large numbers of unique values if necessary. In the extreme case, where there is just one record for each key value, an inverted index is essentially the same as a B-tree–depending on the mechanism (which could well be a B-tree) the system uses to locate the index entries themselves. This makes inverted indexes considerably more flexible than conventional bit map indexes, which become unwieldy when they must manage large numbers of unique key values. Of course, inverted indexes slower than bit maps in many situations, since comparing sets of record numbers takes more processing than comparing strings of bits. Inverted indexes also share simple bit maps’ inability to support mathematical calculations within queries. Again like bit maps, inverted indexes are difficult but not impossible to update as individual records are added to a file.

* * *

Indexes are attractive supplements to conventional relational database technology because they offer huge performance gains without changing the underlying data structure. But sometimes more radical solutions are necessary.

Precalculation is probably the most common approach. At its simplest, precalculation involves identifying frequently-requested calculated values and storing them on the database, so they do not have to be recreated each time they are needed. Common examples would be storing last purchase date or total lifetime purchases on a customer record, so they need not be recalculated from the transaction file. Nearly every marketing database does some of this, sometimes with processes that run for several days.

The challenge is determining which values to precalculate. System designers must predict which values will be used frequently enough to justify the cost of calculating and storing them. And they should regularly review actual usage to check whether new values should be added or current values are no longer used.

Aggregation is really a specialized form of precalculation. But while simple precalculation involves placing calculated values on an existing record, aggregation often refers to creating entirely new tables to hold summary data. The strategy is widely used in data warehouse design, where common queries often begin with summary data–say, sales by month by region–and only later drill down to detail such as a specific product in a specific store on a specific day. Like other forms of precalculation, aggregate tables can provide a tremendous performance improvement compared with regenerating the summary information from the underlying details. Also like other types of precalculation, determining which aggregates to create requires careful analysis of actual user behavior. Unlike simple precalculation, aggregate tables require a fair amount of software “intelligence”, so queries will automatically use the appropriate aggregates when they exist. Only a few pieces of specialized software are currently able to do this, although it is reasonable to expect that the function will eventually become part of standard relational database capabilities.

Denormalization involves storing data in structures that differ significantly from the “normalized” forms of most transaction processing databases. “Normalization” is a design technique that attempts to store each item of data only once–something that helps to ensure consistency and fast updates in a transaction processing environment.

Normalized databases typically have many tables, since there are so many different entities, each with its own relationship to the others. Imagine, for example, a database with household demographics. The “obvious” organization would probably be to store one record per customer, repeating the household characteristics such as number of children and Zip-based demographics on each record. But a normalized database would split the record into at least three tables: one for household-specific information such as number of children; one for customer-specific information such as name and age; and another for geographically-based information such as average income of the Zip code. A more elaborate design would probably have additional tables, such as physical address table (in case there is an apartment house with characteristics that apply to more than one household) and account table (which might relate to some but not all individuals in the household).

Normalization is just shy of a religion among traditional relational database designers, so purposely denormalizing a database strikes many as unnatural, if not Satanic. But fast, consistent online updates are not a primary concern with marketing databases, while the many tables in a normalized design lead to multiple joins, complex query syntax, end-user confusion, and slow response. A design with fewer tables is easier and faster to work with, despite its theoretical shortcomings.

Star schemas are a specialized type of denormalized design, commonly used in for multidimensional and data warehouse applications. A star schema has a large central “fact” table surrounded (like the points of a star) by smaller reference tables with common data classifications (time, geography, product group, customer type, etc.) The fact table holds the actual information being reported (often numerical measures such as price and cost) as well as keys pointing to the reference tables. Often the fact data itself is aggregated–for example, transactions might be summarized by date range and product group–although it can be stored at any level of detail (referred to as “granularity).

The benefit of the star schema is that most queries are first processed in the reference tables, which are small and thus fast to work with; these generate the keys that can then be used to extract the appropriate rows from the central fact table in a simple selection. This approach is called a “star join”, and is something that conventional relational databases are just beginning to add to their normal operations. Speed improvement for complex analytical queries can be several orders of magnitude–or the difference between finishing and not finishing at all.

While the classic star schema has reference tables that are much smaller than the central fact table, versions have also been developed where one of the reference tables is a large customer table. This has never been called a Laurel and Hardy schema but perhaps should be. It takes some special treatment but can be appropriate in marketing database applications.

Another variation on the star is the “snowflake” schema, where the reference tables themselves are divided into a denormalized central table surrounded by its own reference tables. Some star schemas incorporate different levels of aggregate records directly within the same fact table, although it is more common to store the aggregates separately. It is also possible to create a “star index”, which is essentially a prejoined index among the reference tables.

The details of the database design are rarely of interest to end-users. But it is important to realize that not all relational query and reporting software can work with a star schema, and even “star aware” systems may be work with only a particular schema variant.

Language extensions include a variety of special functions that database vendors add to standard Structured Query Language. In the past, these extensions often provided support for transaction processing requirements, but today they are increasingly addressed to analytical activities needed for marketing. Common analytical extensions include support for cumulative, ranking or ratio functions, for multidimensional processing, and for time/date manipulation. Of course, these extensions are non-standard by definition, so they make any system that uses them considerably less “open”. But they can offer substantial productivity improvements, both in terms of query simplicity and response time. This makes them hard to resist.

Managed query environments represent an attempt to simplify the use of SQL-based systems, even though they may not actually improve performance. The best known of these is probably Business Objects, although there are many others including Andyne GQL, Brio BrioQuery, and Speedware Esperant. The systems install a “thick semantic layer”, meaning that they present data in ways that are simple and easy for users to understand. This might involve using non-technical field names (perhaps with different users seeing different names for the same element), presenting data in hierarchies that do not match the physical organization of the actual database, limiting the elements presented to a given user, making available different sets of elements representing different views of the file, predefining relationships such as joins so that users don’t have to, incorporating functions that are not part of standard SQL, and allowing users to create new data elements from existing data. Typically these conveniences are set up and maintained by technical staff. The SQL code generated by these systems is usually fairly standard, so they may not be able to take advantage of specialized indexes or design techniques.

Partitioning involves the distribution of data into subsets that are likely to be queried independently. For example, a history file might be stored in separate tables for each month. This would make it easier and faster to run queries that compared one month to another. It would also mean that more extensive indexing could be applied to the older data, since those tables would not change even if the current month were still being updated. Partitioning also simplifies some administrative tasks such as purging of old data and backup of changes. It is widely used for very large databases.

Some marketing database applications lend themselves to partitioned designs, often along the lines of customer groups. Unfortunately, like other solutions that depend on predicting how queries will segment the file, partitioning will often benefit only queries that run along the expected lines. Depending on the situation, partitioning may even degrade performance on queries that run in other directions.

Parallel processing is the heavy artillery of relational database performance improvement. The basic idea of parallelism is to split a job into pieces that are executed simultaneously on several processors. The two basic hardware approaches are Symmetrical Multiprocessing (SMP), where processors share memory and disk storage, and Massively Parallel Processing (MPP), where each processor has dedicated memory and storage. A hybrid strategy, Non-Uniform Memory Access (NUMA) allows processors to be tightly connected to some shared memory and loosely connected to others. Parallel processing is often linked to a data partitioning strategy.

Although parallel processing is a hardware solution, it requires software that can take advantage of the multiple processors. MPP traditionally required specialized databases such as Teradata, although other vendors now provide MPP support as well. SMP has been easier to implement with conventional relational database systems. The details and maturity of vendors’ parallel software capabilities vary, but this is an area of rapid development and improvement. Performance improvements compared with non-parallel systems can be tremendous. Unfortunately, high-end parallel systems are often quite expensive.

* * *

“I ask for a meal and you offer a cook book,” grumbles our hero. “There is clearly no shortage of approaches to improve performance of relational database systems for marketing applications. Unfortunately, no single solution provides a perfect answer, and each partial solution adds cost and complexity. How am I to choose?”

By now it should be clear that open and proprietary databases have different strengths and weaknesses. This means the practical question is not which technology is ultimately “better”, but how to choose the appropriate technology in each particular situation. There are many factors to consider.

Response time is the most obvious issue. But measuring response time is not a simple task. Results for a given technique, such as bit map indexes, will vary depending on how it is implemented in different tools. Even the same tool may give different results depending on the hardware, database size, data structure, and skill of the person setting it up or using it. And a tool that performs well at one task may do poorly at another.

In short, the only certain way to assess the response time of a particular system is to test it on your own data, hardware and applications. Taken literally, this would mean deploying the actual system for a trial period–something that involves more cost, time and risk than most businesses can afford. The challenge then becomes determining which elements can be scaled back without distorting the final results. Of course, the answer depends on the situation: if the planned database is very large, it is dangerous to test on a smaller data set because performance often is not smoothly proportionate to database size. Similarly, it is dangerous to test a single-user installation when the actual system will have many simultaneous users.

A more defensible shortcut is to test a subset of the required queries and functions–so long as these are representative of the complete workload. This requires understanding both the specific activities that users will perform and what aspects of those activities are likely to stress the system. For example, here are some typical test queries that are carefully designed to test particular capabilities:

“Show customers whose balance is greater than the average balance.” This two file passes: one to generate the average balance and a second to select based on that balance. Although fairly simple from a processing standpoint, it might be difficult to generate with a simplistic query tool.

“Find customers with more than two orders having more than two line items.” This requires correlated subqueries–first to identify orders with more than two line items, and then identify customers with more than two of these. It takes a nasty piece of SQL code and could be a significant performance problem.

“List insurance policyholders with no homeowners policy and at least one auto policy with total claims less than $1000.” In addition to the subquery to find total auto claims, this requires “not” logic to find customers with no homeowners policy. “Not” conditions can be difficult to code properly.

“Show me the customer names, the number of times they’ve ordered, and the number of times they’ve complained.” This requires separate queries of the order and complaint tables, since joining the two tables directly would result in duplicate rows for people with more than one order and more than one complaint. Many query generators cannot handle this requirement.

“Find customers who opened a new account within three months of receiving a promotion.” This requires comparing dates on two separate records. Any cross-record comparison can be difficult; relative date calculations are another challenge. Another potential performance problem.

“Find customers who have returned more than 25% of their purchases, rank them by total order volume, and select a 500-record Nth sample of the top 50%” This requires calculations within a query, ranking, selecting a top percentage, and taking an Nth sample. Most of these are problematic in standard SQL; some (like calculations in simple bitmaps) are impossible in particular systems.

It’s pretty easy to restate most of these queries in terms that are relevant to different industries. Still, not every installation will require each class of query. So even though exposing the limits of a product can be fun in its own right, tests should really be restricted to tasks that are part of the actual requirements.

Of course, identifying these requirements is not so simple. The users themselves may not really understand the exact functions they need, particularly if they are new to database marketing. And even if they do know how they will use the system at first, their needs are almost certain to change over time in unanticipated ways.

One partial solution to this problem is to define a few key scenarios that describe expected business processes such as a marketing campaign or profitability analysis. By choosing a concrete project, these scenarios make it easier for novice users to visualize all the tasks that must be performed. Scenarios can identify specific requirements–say, producing output for different segments as separate files–that would elude a more generalized requirements analysis yet can turn out to have significant practical implications. Scenarios also make it much easier for the user to put into context different components of system performance: while subsecond queries may seem essential in theory, their importance can fade once users see that it takes many minutes to set up each query or hours to run a selection. The number of scenarios needed to cover all major applications of system is often quite small–no more than a half-dozen in some cases.

As the scenarios will highlight, query performance is just one consideration when comparing solutions. The time to load and index data may also be a critical issue, particularly in large databases where such processes can run for days. There is often a difference to consider between building a file from scratch and making incremental updates that append or modify records in an existing file: some technologies do not permit incremental updates or do them poorly. It is relatively easy to benchmark load and indexing performance, although results in these areas can be particularly sensitive to both hardware and the skill of the person doing the setup.

Data loading tests will also give a good measure of the amount of storage required by a particular technology. Some systems, such as bit map indexes, can compress data so it occupies significantly less space than in its raw form. Other methods, such as multidimensional databases that precalculate a great many computed values, can actually expand the data to many times its original volume. Relational databases usually store the data in something approaching its raw volume, although indexes and workspace can easily double the total amount of storage required.

The relative degree of compression or expansion may also depend on the types of data: some systems do well at compressing numeric data, while others may compress text data better. Technologies such as bit map indexes or multidimensional engines may have problems with “high cardinality” fields (having many unique values for a particular data element). Other systems may have strengths or weaknesses in how they store dates, yes/no values, or missing (null) values. Similarly, systems may vary in the types of manipulation or calculations they can perform with the different data types.

Ability to update single records directly–say, to record the outcome of a telemarketing call–is another critical area to consider. This is a particular strength of relational databases, although even relational systems often end up running their online updates and analytical queries against separate files. This is because fast online updates requires a “normalized” data structure (one that avoids redundantly storing the same data), while analytical processing often runs best against “denormalized” structures that allow some redundancy in order to reduce the number of data tables. Proprietary systems often do not allow online updates of single records, although there are some exceptions. Of course, not all projects require such updates.

Projects will also vary in their requirements for other specific functions, such as the ability to do calculations within queries, to generate specific kinds of reports, or to produce extract files quickly. A system’s ability to provide these functions is determined by a combination of the underlying database technology and the specific capabilities provided by the system developer. As always, there is no substitute for defining an exact set of requirements and then testing them rigorously on the proposed system itself. These tests will reveal not just whether a system has the capability to perform a particular task, but also how much work is involved. There are wide variations among user interfaces, often involving a trade-off between flexibility and ease-of-use. It takes careful testing to determine what balance is appropriate for a particular project–or even whether a single tool can meet the needs of all different users.

In situations where custom software development will be required, the evaluation must also assess the programming language and functions available. This issue is most pressing in proprietary environments, where access may be possible only via the vendor’s own programming language or applications programming interface (API). The flexibility, efficiency and power of such access methods can greatly impact the suitability of a system for a particular purpose. Even among relational databases, a particular vendor’s set of SQL extensions may be more or less relevant to the project at hand. In the best cases, vendor extensions can provide huge improvements over the standard SQL capabilities. Of course, relying on such extensions means giving up the vendor- independence that is part of the rationale for using “open” systems in the first place.

Systems also vary greatly in the administrative effort required to implement and maintain them. Because relational database performance is so sensitive to data structure and indexing, these systems in particular require extensive tuning by database architects and administrators. At the other extreme, inverted databases can be quite simple to modify because they do not rely on indexes and retrieve each data element more or less independently. Although administrative effort is difficult to forecast precisely, it is possible to get some sense of the requirements by looking at existing installations for a particular product. Depending on the technology, staffing for the similar systems could range from a part-time administrator to several full-time technicians.

Hardware costs closely resemble administrative costs: almost certainly higher for open solutions, but hard to predict precisely in advance. Again, the best approach is to look at similar existing installations.

Two additional issues are often raised in the proprietary vs. open debate. The first is the matter of “integration”. While it is clearly easier to integrate the marketing database with operational systems if both use relational technology, the difference may not matter unless integration is to be done in near-real-time. Today, “integration” is often limited to monthly posting of data such as model scores from the marketing system onto the operational system. In this type of situation, the relational ease of integration is simply irrelevant.

The second consideration is the highly emotional issue of “captivity”. Here, the advantage of relational databases over proprietary systems may be more apparent than real. As discussed previously, there are many practical barriers to switching from one “open” system to another. So it is not clear that users of proprietary systems are any more locked in to their vendors than users of open systems. This is particularly true today, when even most “proprietary” systems allow access by SQL-based query and reporting tools via ODBC or other drivers.

* * *

Readers of this column may feel like the man who demanded a one-armed lawyer–so he would never again hear, “On the other hand…” Unfortunately, the choice of database technology is a complex task with no simple answers. The best advice is to keep a few points in mind:

– there really are significant advantages and disadvantages to the different techologies. This means it is worth the effort required to find an appropriate solution. In fact, it may mean the difference between project success and failure.

– the technology choice is often a highly emotional decision for many participants. Defusing the emotions is often the key to an effective selection process.

– decisions must be based on the specific requirements of the project at hand. Evaluating potential technologies against real requirements is the only way to reach a sound decision–and will effortlessly switch the discussion from emotional generalities to matters of objective fact. For a decision of this importance, no other basis is acceptable.

Postscript: Reader Jim Kuschill of Frequency Marketing, Inc. informs me that last month’s article failed to make an important distinction between b-tree and binary tree indexes. “Binary trees” have a single value at each node; the system navigates the tree based on whether the value at the current node is higher or lower than the value being sought. “B-tree” nodes point to ranges of values; the system may navigate several levels of increasingly-specific nodes before coming to a final node with the actual values and their related records. Binary trees usually access data stored in memory while b-trees usually find records stored on disk.

Jim also points out that my description of a “clustered index” (data stored so that related records are physically adjacent) is more properly called clustered data storage, since the data is clustered, not the index.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.