David M. Raab
DM News
October, 2003
.
Most direct marketers probably assume that software to match names and addresses originated with merge/purge systems. But there is actually a long history of earlier matching technology. For example, the original Soundex algorithm, designed to overcome spelling variations by building phonetic name indexes, was patented in 1918 and used extensively in setting up the original Social Security system files in the 1930’s.
Government agencies have continued to develop matching systems independently of direct marketers. Apart from the prosaic reason that the two groups had little contact, there is also a subtle but significant difference in their requirements. Merge/purge systems were primarily developed to pool names for mailing lists. Since the price of an error was the cost of a duplicate mail piece, accuracy could be compromised to gain speed and efficiency. Government systems were typically used to search for individuals in a single existing file, whether of criminal suspects, immigration documents, or tax records. Accuracy and real-time response were much more important; handling big batch jobs with disparate sources was not.
Applications such as enterprise-wide customer relationship management actually require functions more similar to governmental search systems than to traditional merge/purge. So it’s not surprising to see an increasing number of commercial products with origins in government applications, as well as an increasing number of products with both types of users. Nor, given the higher priority placed on accuracy, is it surprising to see technical innovations that promise more reliable results.
ChoiceMaker 2.1 (ChoiceMaker Technologies, 646-336-4441, www.choicemaker.com) illustrates these trends nicely. The system was originally developed to help the New York City Department of Health find duplicates in its registry of children’s immunization records. It has since been sold to commercial as well as other government clients. And it employs technology that, according to the vendor, has proven more accurate than competitors in several head-to-head tests.
Actually ChoiceMaker combines several innovative technologies. At the lowest level, the system is written in Java, which lets it run on nearly any hardware and connect to nearly any data source. Inputs are defined with a schema that not only identifies the available fields, but can also specify relationships across data tables, incorporate validity checks, parse entries into separate elements, and create derived values such as Soundex codes. Processing rules are written either in Java or in ChoiceMaker’s own ClueMaker language. ClueMaker extends Java with specialized matching functions such as field swaps (e.g. comparing first name in one record against last name in another record) and data stacking (allowing multiple values in a field, such as old and new address). ClueMaker statements are automatically converted into Java for execution.
ChoiceMaker uses this technology to read, parse and standardize input in fairly conventional fashion. The processed data is then stored in a reference table. When a new record is presented for matching, the system selects records from this table for comparison. Like other systems, ChoiceMaker limits this selection to records that are similar enough to be potential matches. ChoiceMaker adjusts the selection based on the distinctiveness of the input: for an unusual name like Guardado, all records with the same name may be returned; for a common name like Nelson, the selection might be restricted to matches on name plus Zip code. The fields to use in these selections and the maximum number of names to return for each search are specified during system setup. The determination of how many selection criteria are needed is made automatically by the system, using precalculation statistics on the frequency of different values within the reference table. A handful of other matching systems use similar techniques, but most matching software is much less advanced.
Once the candidate records are returned, ChoiceMaker matches these against the input. This is the most unusual, and sophisticated, aspect of ChoiceMaker. The system first evaluates “clues” that indicate whether records match or differ: same first name, phonetically similar last name, different birth years, and so on. These clues are written in ClueMaker and can be quite complex–for example, checking whether a pair of records contains one address in the Midwest or Northeast and another in Florida or Arizona, to find people who head south for the winter. Clues may yield “match”, “differ” or no result if appropriate data is not available. Where some gradation is appropriate–such as degree of near match or match on common name vs. match on unusual name–separate clues are created for each level. This is part of the reason a typical installation uses about 200 clues against many fewer data elements.
The system must combine the individual clue results to reach a final decision. ChoiceMaker does this by assigning statistical weights to the clues and comparing the combined weights of the “match” clues vs. “differ” clues. Record pairs with a clear result are classified automatically; others can be flagged for manual review.
The weights are determined using a machine learning technique called “maximum entropy modeling”. This involves submitting several thousand records with matches already marked; the system then automatically derives the set of weights that most closely predict the marked matches. Such automated training is highly unusual in the world of matching software: even the most advanced systems typically rely on users to manually refine match rules by looking at missed or false matches and making adjustments.
Of course, ChoiceMaker still requires significant human effort: to define input data, specify parsing and standardization rules, build new clues, create test cases, and review results. The vendor says it takes about two weeks of labor to set up a sophisticated matching process. Whether this is more or less than other systems would depend on the circumstances: for unusual matching problems, ChoiceMaker would probably have an advantage. The system includes several tools to help with development, but considerable expertise is still required.
ChoiceMaker was originally developed in 1998 and has several current installations. Pricing depends on the application and can range from $7,500 for a development license to hundreds of thousands of dollars for a large implementation.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.