2001 Feb 01
DataMentors DataFuse
David M. Raab
DM News
February, 2001

People who maintain marketing databases are often held personally responsible for name and address matching errors–which almost always seem to involve an important customer, friend of the boss, or both. But while the nuances of data matching get painstaking attention once a system is installed, few marketers explore these issues beforehand. The common attitude is that matching systems all give roughly the same results, so detailed evaluation is not worth the effort.

It’s an understandable mistake. Today’s first-rank matching systems, from vendors including Trillium, Group 1, i.d.Centric and Innovative Systems, all take the same general approach: they use key word and pattern tables to split each name-address record into elements, reassemble those elements in a standard format, and link records that match on specified combinations of elements. But many other systems use matchcode or statistical techniques that are much less effective because they lack the knowledge built into massive word and pattern tables. And even among table-based systems, subtle differences yield slightly different results. Small differences matter when millions of names (or the boss’s cousin) are involved.

DataFuse (DataMentors, 813-960-7800, www.datamentors.com) is a new table-based matching system. While similar to its peers, DataFuse offers unusual precision–for example, it can treat the same word differently depending on whether it appears at the beginning, middle or end of a line. The system also provides great flexibility, letting users include any number of elements in a matching rule and control the sensitivity of the rules themselves.

The result, according to the vendor, is significantly greater accuracy than other products. In this context, a “significant” difference is rather small: DataMentors points proudly to a test where it changed six percent of the households identified by another product. The net improvement may have been less, since every change was not necessarily correct.

DataFuse works in a five step process. The first step is to identify the type of data on each line of a name-address record. The system uses word tables to decide whether a line should be suppressed, contains a street name, or contains a city or state. The city-state table holds common misspellings, abbreviations or variants of geographic names, and will replace these with a standard version to improve later matching. This step also applies existing linkages, such as a customer ID, eliminates common false values such as a date of 11/11/11, and applies no-mail indicators based on missing address data or key words such as “deceased”.

The second step splits the name line into elements such as first name and last name. It uses at least six separate tables. The system first standardizes or deletes common words and phrases. It then codes each word as commercial (Corporation, Marketing), a specific type such as title (Mr., Mrs.), or a generic type such as mixed alpha-numeric. It can also assign actions, such as ignoring the word and whatever word is next. The sequence of codes is then found in a table that defines how each word is treated. For example, “John and Jane Smith” might be coded “FRFA”; the FRFA table entry might treat the first and third words as first names, treat the fourth word as a last name, and create two separate name lines: “John Smith” and “Jane Smith”. Once name parts are identified, the system applies a gender table to the first names to assign gender codes.

The third step applies a similar process to address lines: it standardizes and codes each word on the line, looks up the code sequence in a table, and assigns the element types as the table specifies. DataFuse can then call third-party postal software to apply U.S. Zip+4 codes and CASS standardization. Tables can be modified to support international names and addresses, but all records for a given file use the same tables.

The fourth step is data matching. Here DataFuse offers almost total flexibility. There are more than twenty matching methods such as phonetic comparisons and string comparisons; each returns a score indicating the similarity between two elements. Users can combine these methods into rules that specify which elements are compared, which method is applied to each element, and what score qualifies as an element match. Multiple rules can be applied to the same file, to let different combinations of element matches qualify as a record match. For example, a user might want two records to match if they have either the same house number and street, or the same PO box and Zip code. Like most matching systems, DataFuse does not compare each record to all others. Instead, the user chooses a sort sequence and how many adjacent records will be compared. Users can sort one file several ways and apply different matching rules to each sort. Users can also apply different levels of matching, such as household vs. individual, in the same run.

The final step identifies the primary record in each match group, performs calculations such as profitability coding or decile assignments, and optionally applies geodemographic information using third-party software. DataFuse has a powerful scripting language for such calculations. This scripting language is also used for tasks such as saving previous household ID codes, to trace additions and deletions to a household over time.

Output of DataFuse is a flat file with the coded records. The system also stores which rule caused each match and provides summary reports to show the impact of each rule. Rules and other processes are defined by writing script files, although DataMentors plans to release a graphical user interface later this year.

DataFuse was introduced in early 2000. It currently has four installations, including two service vendors who use the system for multiple clients. The system runs on Windows NT and can process about 200,000 records per hour, depending on the hardware and complexity of the rules. This is roughly comparable with competitive products. Pricing is based on the number of records processed and begins at $50,000.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.