DataDelta, Inc. DataDelta
David M. Raab
DM News
August, 2005
After years of obscurity, customer data integration is now in the spotlight. Most attention goes to sophisticated data hubs, which consolidate and share customer information across different systems. These hubs differ primarily in whether data is copied to a central database or read directly from the system of origin. Discussions focus on the merits of each approach and the best way to execute it.
But the true heart of a customer data integration system is its matching engine. This determines which records are associated with each customer. Matching engines are distinct from the data hubs; in fact, many data hub vendors use matching engines built by someone else. Typical choices are well-established matching specialists like Trillium Software, Firstlogic, SAS DataFlux or Identity Systems Inc. (formerly Search Software America). Some data hub vendors don’t provide any matching engine at all, but simply integrate with whichever one their client chooses.
The focus on data hubs diverts attention from the matching engines. Yet matching engines differ significantly in speed, accuracy, flexibility, cost, and ease of use, so choosing the right one is important. The only meaningful way to compare matching engines or to tune them for peak performance is to run them with actual data and examine the matches they produce. This is a pain-staking process because most systems require considerable tuning to yield optimal results. Tuning typically involves simultaneous adjustments to several settings, which can have unexpected as well as intended consequences. As a result, users must reexamine all match results to assess the full impact of any change. Many matching engines help by tagging matches with a reason code. But even this does not isolate the matches added or dropped as the result of a particular change.
DataDelta (DataDelta, Inc,. www.datadelta.com, 336-510-8885) identifies differences in matching results. It can compare results from two different matching engines or results from the same engine with different settings. Other applications include testing the impact of new data sources on a matching results and isolating differences in results when shifting from one matching engine to another.
Input to DataDelta is two or more flat files, each containing output from a matching engine. The system reads only two data elements per input record: a unique record ID, typically a customer or account identifier that ties back to other information such as name and address, and a group ID, which links all records that have been identified as matching each other. The record IDs must be consistent across the input files (that is, the same customer must have the same record ID in each file), but the group IDs can be assigned independently. Because DataDelta works only with record and group IDs, input files do not need to include any personally identifiable information. Even when account numbers are used as record IDs, they can be encrypted without affecting results. This makes it easier to use DataDelta without raising privacy or security issues.
DataDelta compares the match groups contained in the two input files. Each group is assigned a to one of four classes: same (the same records are grouped together in both files); splits (one file had two groups containing records that the other placed in a single group); merges (one file has a single group for records that the other placed in two or more groups); and networks (one file created multiple new groups by splitting and merging groups in another file). All records in a given group are assigned the same case number; for merges and networks, records in all the affected groups are assigned to the same case. Since a split from the perspective of one file is a merge from the perspective of the other, the system lets the user specify which file to treat as the base.
This may sound simple but it’s hard to do efficiently with conventional database technologies. DataDelta makes the comparisons using a proprietary database engine that can process up to ten million records per minute. Summary reports show the numbers of cases, records and groups for each type of change. The reports also show the average and maximum number of records and groups for each type of case. Detailed reports show distributions of these statistics for groups and cases of different sizes.
These statistical reports can provide some useful insights. But the real value of DataDelta lies in the coded records themselves. These can be exported in files accessible by any standard data analysis software. Typically they would then be linked through the record ID to additional details such as names, addresses and customer segments. This allows users to see the actual data involved in each match so they can decide whether they agree with the match engine results. Similarly, the results can be linked to reason codes provided by the match engines, letting users determine which rules are producing good or bad results. Analyzing results by segment would let users focus on matches affecting particular groups such as high value customers or from different geographic regions. Such evaluations must be done with external tools; DataDelta itself does not include analysis software.
Some match engine vendors provide utilities to perform similar analyses. But if one of these is not available it takes considerable effort to use standard reporting tools to highlight the types of changes identified by DataDelta. Because testing and tuning require many repetitions, the time savings from using DataDelta are substantial. More important, the time savings allow more thorough testing which can generate more accurate matching results. These can yield cost savings and operational improvements worth much more than the time savings themselves.
DataDelta is sold as licensed software with pricing based on data volume and other factors. Licenses typically cost from $25,000 to $75,000. The system runs on Windows and Unix servers. DataDelta is being launched in August and has been tested in several pilot projects. The vendor also offers a $6,000 “Single Customer View Accuracy Analysis Service”, which provides DataDelta reports and does not require purchase of the software.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.