David M. Raab
DM News
April, 2000
.
As enteprise-wide customer management systems become more common, corporate IT groups find themselves responsible for name and address matching. Scalability requirements lead corporate IT to favor mainframes or large Unix systems from vendors like Group 1 Software, Firstlogic, Harte-Hanks Trillium and Innovative Systems Inc. Although these products generally have Windows NT versions for smaller installations, the pricing and technical complexity of the NT versions still places them beyond the reach of a small, unsophisticated buyer.
Lower cost alternatives exist from firms including PeopleSmith, Mailer’s Software and QMSoft (now part of Sagent). But even these require a fair amount of skill to set up properly.
dfPower (DataFlux Corporation, 919-846-9000, www.dataflux.com) is designed to let non-technical users set up their own data cleaning processes. The system includes two modules: a standardization tool that can transform words by looking them up in a table, and a matching tool that identifies similar name and address records. Both are powerful and remarkably easy to use.
The standardization module lets users clean up common variations in data, such as misspellings, abbreviations, and alternate versions of company names (such as “IBM” for “International Business Machines”), by creating a list of such terms and their standard equivalents.
Building this type of translation table is an important step in cleaning a customer database, and it can be a great deal of work. dfPower makes it as painless as possible. An analysis function generates a list of all words in a specified field in a database, counts the number of times each occurs, and then displays them ranked by frequency or alphabetically. Users can build the table with no typing–they select one word on the list and then highlight all the other words that will convert to it. The system also has a “smart clustering” ability that will automatically group different forms of the same word–such as “first” and “1st” and will ignore an initial “the”. Users can choose whether to work with individual words or the field as a whole, and whether to ignore case differences. A “permutation drill down” shows the individual records in which a given word appears, to determine whether they should all be treated the same–for example, a drill down on “St” would reveal that it is sometimes an abbrevation for “street” and sometimes for “saint”.
Translation tables are built separately for each field, and then combined into sets that can clean different fields in a single pass through the file. These tables can be saved and reused. The current version of the system comes with prebuilt tables for common words in address, company, name, state and city fields. A future version of the system will build new tables automatically by looking at similar elements in records the system has matched. This would be most useful for things like company names and product codes.
dfPower does not standardize addresses by matching against postal tables. DataFlux plans to release a separate product for this in April.
The matching module brings together similar records. Users specify which fields to compare and what level of matching, expressed as a percentage from 0 to 100, is required for each field. The actual matching algorithms are hidden from the user, but incorporate a variety of methods including phonetics, transpositions, string comparisons, and name derivatives (such as “Bob” and “Robert”). Different algorithms are applied depending on the type of data in a field, and are more algorithms are applied as the user allows a “looser” match. The results are summarized in a matchkey; records with the exact same matchkey are matches.
This approach is a clever compromise between simple matchcodes that just compare a specified number of characters, and complex schemes that require the user to pick algorithms and assign weights to each field. The dbPower method is considerably more accurate than a simple matchcode, but easier to set up and understand than the systems where the user must control each detail. One problem is the match must be acceptable for all specified fields: even if three of four fields match perfectly, a significant difference in the remaining field would yield a non-match. The vendor recommends that users work around this limit by creating separate matchcodes on individual fields and then writing a program to compare the different combinations–a workable solution, but probably beyond the capabilities of an unskilled user. Another problem would arise if different types of data are stored in the same field, since the appropriate algorithms wouldn’t be applied.
Once the match process is run, users can generate a report with summary match statistics, append the matchkeys to the original database, or create a file that has duplicate records flagged or eliminated. The system will choose the first record in each set as the survivor; other than overriding this manually, users cannot ensure it saves the “best” record. Matching processes can be run on demand or as deferred batch jobs.
Although the functions within dfPower are largely hidden from the user, DataFlux also sells a system development kit that lets programmers work embed the functions in other systems. These functions are written in ANSI C and include parsing, data type identification, and gender coding. They can be applied to real-time matching processes such as customer lookup or Internet processing. dbPower and the underlying functions can connect directly to relational database tables through ODBC, making integration much easier than if the data had to be placed in a flat file.
dfPower is priced at $15,000 for both modules plus 20% annual maintenance. It runs on Windows 95 and Windows NT. About 400 copies have been sold since it was introduced in 1998. The system development kit costs $30,000 to $75,000 depending on the functions and application. It runs on Windows and Unix environments. About 100 licenses have been sold. Later this year, DataFlux plans an “application service provider” offering, where users will send their data via the Internet to be processed in real time on DataFlux’s own servers, later this year. Charges will be based on the number of names processed.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.