by David M. Raab
DM News
June, viagra 100mg 2007
.
Most direct marketers think of data matching in terms of merge/purge: a way to identify and remove duplicate names across multiple lists. But merge/purge is rarely a concern in the larger world of data processing, store There, remedy matching is a component of customer data integration (identifying data in different systems that belong to the same customer) and master data management (consolidating data relating to all kinds of entities). Matching is also part of search applications that help users find people, products, documents, locations and other entities even when they don’t have complete or fully accurate information.
These are complex applications with many moving parts: multi-table data structures, relationship hierarchies, data acquisition, indexing, ranking and display. But matching remains a critical core function.
The specific purpose of matching is to find records that refer to the same entity, even though the records themselves are different. In a strict sense, matching involves direct comparisons of data strings. But in the real world, this is often supplemented by external reference data such as a list of all known products or all the names used by a business. This external data often allows connections that could never be inferred from strings alone, such as the fact that the John Jones who used to live in Chicago is the same John Jones who now lives in San Diego. For names and addresses, external knowledge allows parsing of data into elements such as first name, last name, and street number, so the same elements can be compared across different records. This external knowledge includes information about specific words (“David” is likely to be a first name, “Nebraska” usually is a state, “Bob” is a nickname for “Robert”) and information about common formats (“the final line in an address is likely to be in the order of city, state, and postal code, unless the first word is ‘attention’”). As that last example suggests, external knowledge implies rules as well as simple lists, and can get very complex.
In practice, parsing and standardization based on external knowledge are critical to successful name and address matching. But even the most sophisticated knowledge-based processing cannot remove all errors in a set of data. In fact, standardization and parsing can introduce errors of their own. To make matters worse, external knowledge may not be available once you move beyond well-understood structures like mailing addresses. So, in the end, there is always a need to compare two strings and decide whether they are similar enough to call them a match.
What differentiates matching engines is how they make this comparison. Simple matching systems often create a “match key” by extracting a few significant digits (say, first name initial, first three consonants in the last name, house number, city and state) and allowing a match if these are the same. Other systems use phonetic standardization such as Soundex to compensate for spelling errors. Some allow a match if strings have no more than a specified number or percentage of differences among the characters. Still others apply statistical techniques that take into account not only the similarity of the strings, but how common they are: so a common name like David Jones not be considered a likely match for David James, while an unusual name like Zydrunas Ilgauskas might match with Sid Iglakis. Often the systems assign separate match scores for different elements and then use weights or rules to assign a match score for the record as a whole.
Netrics Matching Engine (Netrics, 609-683-4002, www.netrics.com) applies a mathematical technique called “bipartite graph matching” to measure the similarity of strings. The general idea is to mimic human decisions by finding similar sequences of letters, even if they occur at different locations within two strings. This can compensate for data entry errors and deal with information that has not been parsed into separate fields. It also means the method can be applied to problems other than name and address matching. Netrics says its approach is more accurate than simpler methods such as matchkeys and Soundex, and more efficient than character-difference comparisons.
Like other matching engines, the Netrics engine returns a score that shows the similarity of the strings it compares. The system can also highlight matching blocks of text, making it easier for people to review why the system found a similarity.
Netrics also provides a Decision Engine that can use similarity scores to decide whether a pair of records is considered a match. The Decision Engine starts with examples of known matches and non-matches. With name and address records, these would typically be parsed into separate elements, although they could also be unparsed text blocks. The sample records are run through the Matching Engine and then the Decision Engine, which infers the decision rules (basically, weights and cut-off ranges for element similarity scores) that distinguish matches from non-matches. The system automatically adjusts its rules until its own decisions are acceptably consistent with the “correct” answers provided as part of the input. Users can provide additional examples of particular types of matches if the system performs poorly at identifying them. A couple thousand sample pairs are typically required for training. The Netrics approach is considerably easier than having users specify the rules explicitly.
Netrics is used both to search for individual records in a reference file and for batch deduplication such as merge/purge. It loads the data into system memory, which allows quick performance. The system has been tested on databases with hundreds of millions of records, returning as many as 25 matches per second. The product was released in 2000 and has more than 100 installations, mostly in healthcare and government agencies. About half the installations involve name and address matching, while the balance involve other types of data. The software is usually purchased through business partners, such as applications providers and systems integrators, who incorporate it into products they deliver to their clients. Pricing is based on the number of processors in the host computer, starting at $50,000 for a two-processor server.
Leave a Reply
You must be logged in to post a comment.