1994 Dec 01

Dun & Bradstreet Information Services Information Warehouse
David M. Raab
DM News
December, 1994

A new class of marketing data­base is emerg­ing: systems with hun­dreds of millions of records and billions of bytes of data. These systems are very large because they include individual transactions–every telephone call, every bank check, every item purchased–instead of the traditional customer-level summaries. Smaller direct marketers, whose volumes are more manageable, have built this type of system for some time. But until recently, the sheer scale made it impossible for the largest companies to implement something similar.

The task has been made more difficult by large companies’ desire to use standard relational databases such as Oracle and Sybase. This is partly because MIS departments prefer solutions they know and trust–a perfectly reasonable policy in a discipline whose reigning cliche is “you can tell the pioneers by the arrows in their backs.”

But there is also a more spe­cific reason. Leading-edge practi­tioners now want two-way com­munica­tion between their marketing and operational systems. That is, they want the marketing system to see customer transactions as they occur and to re­spond while the customer is still engaged. (Imagine the “cross sell” prompt in a good telephone order entry system–“we have a lovely scarf that matches the blouse you just ordered”–except that the prompt is seen by the customer, not the order taker.) Such communication virtually requires that the marketing database use the same technology as other corporate systems.

The problem has been that stan­dard relational databases are tuned to process individual transactions, not to handle queries on masses of records. Conven­tional indexes and Structured Query Language (SQL) pro­vide partial solutions at best, and at worst fail entirely when presented with very demanding require­ments. A related problem has been the time to consolidate and load huge files of transaction records into a marketing database–a problem not faced by operational systems, where the trans­actions are added as they occur.

Companies have adopted two main strategies to overcome these obsta­cles. One approach is to use more powerful hardware: in particular, “parallel processing” systems where many processors work simultaneously on different parts of a task. The other approach uses software that applies specialized indexes to standard relational database files, giving much faster performance on queries. The two strate­gies are highly complementary, because parallel processing works best at speeding the data load, while specialized indexes help with queries.

Parallel processors can speed queries as well, but the consensus among people who have worked with them is that the benefits are still limited. Interestingly, although the greatest attention is paid to massively parallel processing (MPP) systems, most actual implementations use a less complex technology called symmetric multi-processing (SMP). The basic difference is that MPP systems assign separate memo­ry and disk storage to each processor, while memory and disk are generally shared in SMP systems. The greater amount of sharing makes SMP somewhat easier to implement.

Here are two systems targeted at very large databases. Epsilon’s MarketWise, reviewed previous­ly, would be another.

dbINTELLECT (EDS/dbINTELLECT Tech­nologies, 800-324-3245) is a combination of technolo­gies and services aimed at supporting interactive database marketing systems. The system itself uses a three-level structure with a “transaction engine” at the base, components such as address standardization in the middle, and tailored customer applications at the top.

The core of the system is the transaction engine. This allows conventional software (so long as it is written in C or C++) to handle very large vol­umes of data by running several copies of the soft­ware simultaneously–without modifying the software itself or physically splitting the data. For example, the system allows Postalsoft’s address standardization and merge/purge software to run simultaneously on many processors in a SMP machine. EDS says this allows a $400,000 Unix system to outperform a $10 million mainframe on those tasks.

This approach removes one of the major obstacles to parallel processing: the lack of specialized software to take advantage of the hardware’s power. dbINTELLECT can use conventional software be­cause the transaction engine, rather than the compo­nent software, allocates the data so that multiple processors do not conflict. Each copy of the compo­nent software simply runs the data is presented, without regard for what is happening elsewhere.

Having the component software read data from the transaction engine, rather than the underlying files, also means the components can work with data from any physical hardware or database engine–assuming the transaction engine itself is compatible with the hardware or software involved. The initial version of dbINTELLECT, released in October, is available on the Oracle and Informix relational databases and is being run primarily on symmetric multi-processing systems from Silicon Graphics. It also runs on SMP hardware from Cray, and MPP systems from Encore and (now bankrupt) Kendall Square Research. EDS is working to add other operating systems, including Windows NT; other databases, including Sybase and SAS; and other hardware, including Maspar, Tandem, Terradata and Hewlett-Packard.

The transaction engine includes its own parallel data manipulation language, a data dictionary that can unify data from multiple systems, and the ability to assign descriptions and business rules to specific data items. The data manipulation language provides specialized procedures for marketing data­base functions such as data conversion and extraction. These functions can be used by the component systems for tasks including data transformation, address standardization, duplicate identification, selection, and data visualization. Applications for individual clients are built by linking and customizing components.

Some of the existing components have been custom-developed by EDS, while others are from third parties. Third-party products include address standardization and name matching from Postalsoft, ModelMAX neural network modeling from ASA, and data visualization tools co-developed by EDS and Silicon Graphics. EDS developed the graphical selection interface, which includes marketing capabili­ties such as Nth samples and record tagging. The firm also plans to integrate MegaPlex’s Fast Count Engine and Expressway Technologies’ Expressway–extremely fast query tools that employ inverted files (MegaPlex) and bit-map indexes (Expressway) for performance that parallel-processing hardware cannot yet match. (Fast Count and Expressway process queries at rates approaching 100 million records per minute, while EDS’s benchmarks showed Oracle running a test query at 1.5 million records per minute on a single processor and about 10 million records per minute on 16 processors.) The EDS selection inter­face will be linked to MegaPlex, Expressway or the underlying relational database, depending on which is most appropriate for a particular installation.

(Incidentally, Expressway was recently purchased by Sybase, which is expected to help expand the product and smooth implementation problems reported by some early users.)

EDS plans to add more marketing functions, such as advanced campaign management and statisti­cal modeling, as they are required by clients. These will initially be built as custom applications, and then embedded in components as appropriate. Users can also write their own components in the C or C++ programming languages, or access data with ODBC-compliant query and reporting tools.

The first production installation of dbINTEL­LECT is planned for early next year at Neodata, a large computer service bureau that is partly owned by EDS. EDS has built a demonstration system with over 200 gigabytes of data (a total of 3.2 billion records) running Oracle on a 32-processor Silicon Graphics machine. (The EDS White Paper describing the 200 GB test process is an eye-opening primer on the practical challenges of a project of this scale–processes such as data loads and index builds ran up to five days, and would have taken much longer without careful tweaking and use of the multi-proces­sor configuration. Copies of the White Paper are available from Bob Fetter at EDS at 303-530-4033.)

dbINTELLECT is aimed at very large in-house installations and at service bureaus that wish to serve a number of smaller clients. List prices start at $1.25 million for databases up to 10 million records and range up to $5 million for a system with several hundred gigabytes. But actual costs will depend on the components and amount of custom consulting involved.

Information Warehouse (Dun & Bradstreet Information Services, 800-618-3087; 201-605-6270 in New Jersey) is also aimed at 50 GB and larger systems. The company relies on symmetric multi-processing to speed up data loads and indexing, using more traditional UNIX hardware such as HP 9000 and IBM RS/6000 systems with up to four processors. But its main technical thrust has been to incorporate a specialized query engine that uses bit-map indexes on standard Sybase files.

The resulting system provides the speed for interactive analysis on very large files, while retaining the industry-standard data format. D&B reports that on one 110 gigabyte system, Information Warehouse returns 90% of its queries in under 30 seconds, and 95% in under four minutes.

The engine is coupled with a point-and-shoot MS Windows front-end with broad segmentation, selection and reporting capabilities. This front-end–the same as used on D&B’s PC-based MarketSpec­trum system–is an built with PowerBuilder. It could actually work with any ODBC-compliant database, although D&B does not plan to offer it apart from its own systems. The front-end automatically adjusts to changes in the underlying files, reducing the amount of customization needed for each new installation.

Information Warehouse files are currently all built and housed at D&B, although the firm would license the software to run in-house if a client re­quired it. Consolidation and other building logic is customized for each client, and big systems–with dozens of input files–can require several days to update, even using the symmetrical multi-processing systems. D&B is exploring the use of massively parallel systems to speed this portion of the process, although it does not expect MPP by itself to yield substantial improvements in query speed. D&B is also working on incremental updates of files and indexes, which would avoid time-consuming reloads of an entire database.

Although each Information Warehouse system is custom-designed, all use a basic four-tier structure of company, site, individual and event. Each event is dated, allowing the system to “roll back” to an earlier state for tasks such as time-series analysis. Promotions and responses are all treated as events, enabling the system to do response analysis by comparing the two groups of records.

The interface allows users to create multi-cell promotion campaigns. It can set priorities among cells, assign key codes, place messages on labels and select by attributes such as titles. D&B can provide custom programming for more sophisticated direct marketing capabilities, such as Nth, random or split Zip Code selections within a cell, or limiting the number of pieces sent to a given address. Queries can be saved and reused, and the company is building a scheduler that will be able to run them automatical­ly.

Built-in reporting is currently limited to tabular lists, labels, prestructured forms and pre­structured cross tabs. Once a selection is run, users can browse a sample of the records selected. They can also view graphs of frequency distributions associated with certain reports. The company is adding user-defined cross tabulations, expanded graphing capabilities, and the ability to do calculations and either store the results or use the results to select records for promotions. The system can also export data to 35 different file formats, including most popular spreadsheets, mapping and statistical systems.

Information Warehouse was launched in 1993. It has six current installations, several in excess of 100 gigabtyes of data. Pricing for a com­plete solution–including software, D&B compiled data, processing and modeling services–can run from $1.5 million to $4 million per year.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.