DataLever Corporation DataLever
David M. Raab
DM News
February, 2003
.
Name matching software has graduated from avoiding duplicate catalogs to helping the government watch for terrorists. In this environment, any error can be devastating: a missed match can lead to a fatal attack, while a false alarm can disrupt the life of an innocent person. Unfortunately, both errors will occur: no software is infallible. The good news is that the sofware used for surveillance is considerably more sophisticated than the merge/purge systems familiar to most direct marketers.
Many of these products have their roots in work previously done for law enforcement or intelligence agencies. Others were developed for commercial applications such as consolidating customer records. But whatever their origins, all matching systems must perform two key tasks: selecting records to compare, and determining which records match. Here are two products that take significantly different approaches.
NameSearch (Intelligent Search Technology, 800-287-0987, www.intelligentsearch.com) focuses on the selection problem. The core of the system is an ability to generate sort keys that bring together the records most likely to match.
The first step is to clean the input records, through removing extraneous words and characters, standardizing multi-word phrases, and replacing nicknames and diminutives with standard forms. This sort of processing is performed by nearly every matching system. Like the others, NameSearch relies on tables and rules that specify how to handle particular words and phrases. Recognizing that different rules apply in different cultures, NameSearch has separate sets for Anglo, European and Middle Eastern names. A graphical interface lets users modify the rules as desired.
The second step in the key building process is to replace the name with a phonetic equivalent. This is a particular strength of NameSearch, which uses phoneticization techniques designed specifically to be superior to the common Soundex and NYSIIS algorithms. Phoneticization is applied extensively to the least common names, while common names are lightly phoneticized. This lets the system find as many variations of the uncommon names as possible, while still limiting the number of candidates returned for matches against common names. Common names are listed in frequency tables provided by the vendor. Useres can modify these tables or run a utility that calculates frequencies within a particular set of input. Users can also choose among three different phoneticization routines, which provide varying degrees of precision.
The keys themselves are created by stringing together the standardized, phoneticized name elements. NameSearch typically generates multiple keys by arranging the elements in different sequences. This lets it automatically find matches when elements appear in different orders on different records, such as Smith, John vs. John Smith.
Once NameSearch keys have been generated for a set of records, match candidates can be identified by specifying a range of key values. NameSearch can generate multiple ranges for a single input record, representing increasingly broad searches. But it’s up the user to write the programs that actually find and extract the specified records.
NameSearch does provide a half-dozen comparison routines that return a score to indicate the likelihood that two records match. These also rely on rules and phonetic comparisons and give some control to the user. Again, the user must build a supporting system to make use of the match scores once they are generated.
DataLever (DataLever Corporation, 303-546-7943, www.datalever.com) is a very different product. It provides a complete data manipulation environment, with tools to extract data, analyze file contents, make changes, parse, standardize, index, geocode and generate reports, in addition to finding matches. Users combine these tasks into projects using a graphical flow chart. A new server module will let users schedule projects for automated execution and provide a central repository to share project components.
Within the matching process itself, DataLever focuses primarily on sophisticated comparisons. Selection of names is quite simple: the user specifies a sort sequence and the number of adjacent records to test. In contrast, the comparison process involves detailed evaluation of individual fields, which itself requires that the data has been accurately standardized and parsed. DataLever includes sophisticated tools for the standardization, parsing and comparison functions.
Let’s start at the beginning. Matching in DataLever is treated like any other project, by building a process flow using standard system tools. Some of these tools, including the standardizer, parser and matcher, are themselves constructed with standard DataLever functions–meaning that users can examine and modify them if desired. The standardizer handles both name and address data, including postal standardization for the U.S. and Canada. The parser converts text to a sequence of word types, and then uses a pattern table to identify specific data elements. For example, it might read “J and M James” as the word type sequence “single letter, conjunction, single letter, name”. It would then find this sequence in the pattern table, which might interpret it as “first initial, conjunction, spousal first initial, family name”.
The system comes with standard tables of patterns, name, company and address words. The system can identify records that do not match an existing pattern, so users can create a new pattern if appropriate.
Once the data is standardized and parsed, it is sorted and fed to the matching process. This process relies heavily on comparisons between individual data elements, which is why accurate parsing is so important. Users specify the elements to include, and for each element specify one of four comparison methods, a threshold score to qualify as a match, a weight assigned to the element score, and an error penalty if the threshold is not met. Element scores are combined into a total score, which determines whether the record pair is considered a match. Users can specify multiple match rules and different sort sequences. DataLever provides prebuilt templates for consumer, business, and business-contact matches.
When matching is complete, DataLever can again use its standard capabilities to combine overlapping match sets, consolidate data from matching records, output all pairs or only survivors, list marginal matches for manual review, generate reports, and perform other types of processing. In addition to running as independent processes, DataLever functions can be embedded in other software.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.