David M. Raab
Relationship Marketing Report
December, 1999
.
The last two columns in this series have looked at ways to segment the universe of marketing-related systems. Although no fully satisfactory scheme has emerged, one distinction was present in nearly every attempt: batch vs.and real-time systems. The general argument was that the technologies needed for these two types of systems are so radically different that they need to be treated separately.
This proposition is worth closer examination–both to understand the nature of the technical differences, and to see how some systems manage to bridge the gap.
First, let’s get the definitions straight. Batch systems execute a sequence of steps without external inputs, while real-time systems wait for user input between steps in a transaction. Batch systems typically apply the same process–such as calculating a model score or assigning a customer segment–to many records in a single job, while real-time systems typically execute a process against a single record per job.
These differences in function result in different goals for system design. For a batch system, the key goal is to move through a large data set as efficiently as possible. The goal for a real-time system is to retrieve and update individual records with minimum delay.
Although batch systems usually process large numbers of records, they generally work with one record at a time: they read the record and its associated data, process it, store the outcome, and then repeat the process for the next record. Efficiency is determined primarily by the time it takes to assemble all the data needed to process each record. In a flat file system, this is done by either combining data from multiple sources into a single record before the process begins, or by sorting multiple files in the same sequence so the system can step through them in parallel without extensive searching. This sort of sequential processing is especially well suited to files stored on tapes rather than disk drives, since it allows the system to physically read the records in the sequence they appear on the tape. If the processing were not sequential, then the system would have to search for each set of records from one of the tape to the other. (Remember all those images of spinning tapes from TV shows and movies in the 1960’s and 70’s? That’s what was going on.)
In contrast, a relational database is explicitly designed not to place records in a specific sequence. Instead, relational systems rely on indexes to link the related data and typically load the data itself onto disk drives that can quickly access records that are not physically adjacent. Still, because sequential access is inherently more efficient than even the fastest disk drive, many of the largest-volume batch systems create an ordered extract that is then processed like a flat file.
Relational systems also often improve efficiency by “denormalizing” the data, which means storing the same piece of information in more than one record. This violates a cardinal rule of relational database design, which says each item should be stored only once. The rule exists to ensure data consistency and speed updates. But violating it will reduce the number of tables that must be searched and read to process a record. This can yield major performance gains.
Batch systems can get away with denormalization and sequential processing because they are not subject to the same constraints as real-time systems. Most real-time systems don’t know which record will be needed next, because they are reacting to unpredictable events such as which customer will place an order or call for service. Therefore the real-time systems need search mechanisms like indexes on account numbers, which allow them to find any particular record quickly. By contrast, a batch system will eventually process all records in its set, so has no particular need to locate a specific record first. Real-time systems also must be kept internally consistent at all times, since two transactions relating to the same account might occur almost simultaneously, and different kinds of transactions might occur in different sequences. This makes it much more dangerous for real-time systems to violate the relational principal of “normalization”–storing each piece of information only once–than for batch systems, which exist in a much more controlled environment. Similarly, real-time systems are also more focused on the update speed that normalized designs provide.
So, to oversimplify a bit, batch systems use sequential processing and denormalized data structures (few tables with some redundant data), while real-time systems use indexes, random access and normalized structures (many tables with no redundant data). While it’s possible for one system to do both, most software is optimized for one or the other. This is why the distinction is so fundamental when attempting to classify different marketing products.
Specifically, traditional data warehouses and database marketing systems tend to use batch processing techniques–after all, most queries are looking for patterns or segments in the entire database, not picking out a single customer or account. By contrast, front-office systems for customer service, sales automation or contact management are real-time systems that must be designed to work with one customer at a time.
The problem, of course, is that today’s goal is to merge the back-end marketing database with the front-office customer contact system. This lets users define customer strategies in the back-end system–which has the rich history data and analytical capabilities–and execute the strategies in the front-office system during the real-time interactions. So designers are being asked to make one system handle both batch and real-time processing.
As with most computer processing challenges, there are two basic solutions: brute force and elegant design. Given the continued drop in hardware costs, brute force is often the best approach. But in some situations, elegant design is still worth the effort.
In dealing with real-time marketing systems, the classic application of brute force is parallel processing. This involves systems that split a single batch job into many smaller jobs and run them all simultaneously. IBM’s SP2 and NCR’s Teradata are the most common examples of massively parallel systems, although other vendors have products as well.
Massively parallel systems do have the ability to give high performance on both batch and real-time jobs. But the hardware is expensive and developers must usually tune the application software and data structure for optimum performance.
This tuning is costly and time-consuming, which is bad enough. But it also means that the resulting system may perform poorly when faced with unanticipated demands. For example, one common tactic in parallel system design is to store data from different date ranges on separate hard drives (each served by its own processor). This works great when queries look across all date ranges, since the different processors can work on the different date ranges simultaneously. But if queries suddenly focus on a single date range, the system will slow considerably because only one processor can access the necessary data. (Reality is a bit less grim, since parallel systems can usually give several processors access to the same data if necessary. But performance will still suffer.)
A newer brute force approach involves “main memory” databases, which essentially move the underlying data from a disk drive into high speed, random access memory. Specialized database management systems that do this include TimesTen (www.timesten.com) and Angara Data Server (www.angara.com). These systems can access records ten to twenty times faster than if the data were stored on a disk drive; they can also employ specialized indexes that reduce performance impact of bringing together related records from many different tables. The most important current application of this technology is managing Internet interactions, where systems may need to access huge volumes of data in real time. But the fast access provided by the main memory systems allows them to complete batch processes extremely quickly as well.
For companies that are unable or unwilling to apply brute force solutions, the alternative is a system design based on conventional technology. Since the same conventional data tables generally cannot provide adequate performance for both real-time and batch tasks, this usually involves maintaining separate data tables for the two types of applications, and somehow synchronizing them. The simplest approach is to first load all data into a conventional marketing database–structured for batch processing–and periodically create extracts that are structured for access by real-time systems or feed data into the real-time systems’ own tables. The problem with this method is that batch processes are used to update the conventional database and to generate the extracts. This means the marketing system cannot feed adjusted information as a transaction occurs. So the marketing feed itself is something less than real-time.
A slightly more sophisticated approach is to update the table that supports the real-time systems at the same time that the main marketing database is updated. This avoids the lag due to batch extracts, but still must wait for the batch updates of the main database. The only way to avoid this second lag is to update the real-time table directly, rather than filtering data through the main marketing system first. Some systems–particularly those designed for Internet marketing–do maintain a profile database that is updated in real time in this fashion. In addition to simply capturing the new transaction, such a system might recalculate derived values such as cumulative purchases and model scores, and use the adjusted values in managing the interaction. The new data would be periodically added to the main marketing database during its regular batch update. This sort of synchronization is about the best that can be done with conventional technology.
As marketers continue to integrate real-time front-office systems with batch-oriented marketing databases, vendors will face increasing pressure to combine batch and real-time processing in a single system. As we’ve seen, this is a difficult task using today’s standard (relational) technologies. Buyers looking for an integrated system should look carefully at each vendor’s approach to this challenge, to ensure the system they purchase will meet both current and future needs.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.