David M. Raab
DM Review
June-October, 2001
.
Underlying every grand plan for customer relationship management is a centralized customer database–one that consolidates information about each customer from sources throughout the company. This complete picture of each customer is the foundation for understanding what actions will get the greatest value from the relationship.
No one with meaningful CRM experience would minimize the effort needed to build such a database. In fact, most practitioners quickly acknowledge that it is by far the greatest technical challenge involved in a new CRM project. (Most, but not all: vendors of integrated front office systems often assume their standard operational database will be the central customer repository. But this breezy confidence typically crumbles when they are told they must integrate data from back office operations and from whatever touchpoints–there are always a few–that run outside of the main front office system.) Still, once the difficulty of building the database is solemnly noted, attention usually swings to more exciting tasks like choosing vendors and fighting political battles. The nuts and bolts of customer data consolidation are set aside as problems to resolve during implementation. The unspoken assumption is the available tools all give roughly equivalent performance, so there is no point to assessing them in detail.
In an IT industry where people have strong opinions on all conceivable, and a few incomprehensible, technical issues, this relaxed indifference is a remarkable anomaly. Though it’s tempting to see it as evidence of hitherto-unsuspected reserves of human rationality, it more likely reflects ignorance of the issue at hand. Inept marketing by vendors who have failed to effectively distinguish their products could play a role as well.
In fact, there are significant differences among customer data consolidation tools and techniques. These relate to the very specific task of matching customer names and addresses–an esoteric process that is unfamiliar to most corporate IT groups, even if they have experience with the types of consolidation required for non-customer data. The difference is that most non-customer consolidation revolves around exact matches, conversion rules, or translation tables. These processes are by no means trivial: researching, building and maintaining them is a major task in a company with many complex systems. But they do ultimately produce a set of rules that determine unambiguously whether or not two records match (at least in most cases).
Name and address matches are inherently less certain. There is often no way to know when looking at two customer records whether or not they refer to the same person. The names are spelled similarly–is it a real difference or a data entry error? City names differ within the same postal code–is the code wrong or is one city name a colloquial variation or vanity address? Women with two different last names share the same address: are they separate people or one woman with her married and unmarried names? The same name appears at two different addresses: is it two people, one person who moved, or one person with two addresses? Two dissimilar records have the same phone number: are they the same person, or has one moved and the number been reassigned? Even a unique identifier like Social Security Number can be misreported, miskeyed or just plain missing. As privacy concerns and regulations accumulate, identifiers like telephone and Social Security number will be less available, so using them as a matching shortcut will be even less useful. And as life becomes generally more complicated, people have more non-matching attributes: how many phone numbers do you have? How many e-mail addresses? Do you receive mail at a Post Office box for privacy or business reasons?
And let’s not even get started on households or business matching.
In short, there is no way to create a straightforward mechanical process for name and address matching. But systems to provide approximate matches do exist. In fact, there are three levels of such systems, each building on the foundation of its predecessors.
The most basic matching systems were built to identify (merge) and remove (purge) duplicate names on mailing files. Major vendors are Group 1 Software and FirstLogic i.d.Centric (Postalsoft); other competitors include Sagent and SAS’s DataFlux subsidiary. Low-end, PC-based alternatives are available from Mailer’s Software and Peoplesmith.
What these systems essentially do is compare one record with another. But it would be horribly ineffective to simply treat each record as one large string and compare the strings to each other: there are so many variations in how addresses may be formatted that legitimate matches would be rejected because the strings didn’t line up. So merge/purge systems first split the input records into standard fields, such as first name, last name, street, city, and state. There are usually about a dozen such categories, including things like title (Mr., Mrs.), generation (Jr., Sr., III), street type (St., Ave., Blvd.), directional (North, South) and apartment number.
Often the input record is already split into such fields–hopefully, the data was captured that way in the first place, which is by far the most effective approach. If not, the merge/purge system will parse the record into components, looking for key words, standard formats (e.g. a 5 digit string is probably a Zip code), positions within the record (usually the name line comes first, then the street line, then the city/state/Zip) and positions within each line (usually the first name comes before the last name). Parsing is not perfect, particularly when records in the same file have been entered in different formats (e.g., a mix of last-name-first and first-name-first) or when business addresses are involved. But most systems can get most records parsed correctly.
The second preparation step usually involves standardization. This mostly involves looking up words and their equivalents in huge translation tables. Some standardization changes nicknames and variations to a standard name: so Elizabeth, Liz, Beth and Betty are all changed to Elizabeth. Titles might also be standardized to change Mister to Mr. In addition, postal standardization is applied to ensure street and city names are spelled consistently and, when possible, to ensure that the postal code matches the rest of the address. This requires more than simple table translations; in fact, postal standardization involves complicated parsing, string matching and validation processes, which are generally embedded in systems outside the merge/purge product. Postal standardization is sometimes run before the merge/purge process begins: the output would be a file in which the postal elements were parsed and standardized, although the merge/purge system would still parse and standardize the name line.
Once the data is parsed and standardized, the merge/purge system sorts it to bring together records that are likely to be matches. This avoids having to compare every record to every other record, which would be cost-prohibitive. Most systems generate a sort key based on components of selected elements–say the first three characters of the Zip code, first three letters of the street name and house number. Some systems generate multiple keys and run through the file several times in the different sequences; this avoids missing matches because of a flaw in one element of the key, such as a bad Zip code.
In older merge/purge systems, the matches themselves were often identified by a match key that was essentially an extended version of the sort key: for example, it might have the first three characters of the Zip code, first three letters of the street name, house number, and first, third and fourth letters of the last name. Records with the same key were assumed to be actual matches. While very efficient from a processing viewpoint, this method is not very accurate–it rejects records because of minor differences and accept records that are obviously different even though the keys were identical. Changing the composition of the key usually means trading false matches against missed matches without reaching a satisfactory level of both. This method is no longer used by major merge/purge products, although it still sometimes appears in less sophisticated systems.
Today’s standard approach is to move through the sorted file comparing groups of records. One method is to comparing all records within a certain distance of each other (e.g., up to ten records away). Another is to compare all records within a “break group”, which is set of records sharing key elements, such as the same Zip code and street name. The break group method is more flexible, since it will look at all similar records even when there are many sharing a particular set of values. But some systems limit on the size of the break group itself, in which case even adjacent records may not be compared if they fall on different sides of an arbitrary intra-group split.
The comparisons themselves measure the degree of similarity between each pair of elements within the parsed records and then apply a business rule to determine whether this combination of similarities constitutes a match. Similarity is determined by comparing the strings of text within each element. Of course, exact equality is easy to find, so the challenge is to identify and rank near matches. Systems may calculate the number of characters that are the same, check for sequences that are the same, adjust for transpositions, or even allow for letters that are adjacent on a typewriter keyboard. Some systems look for phonetic equivalents. Some preprocess the string by removing vowels or double letters. Some treat numbers differently. Different matching methods are applied to different field types: it makes no sense to apply phonetic matching to a Zip code. Sometimes the user can define the algorithm that applies to each field; in other cases, the algorithm is predetermined. Generally the algorithms produce a score that indicates the closeness of the match between the two strings. Sometimes the system automatically combines the element scores into a record-level score, and the only job of the user is to decide what record-level score will count as a match. In other cases, users specify what scores count as element matches and how these are combined to qualify for a match. For example, a rule may be the system must match on last name, street address, and at least five of eight other elements. Even systems that allow precise user control will provide default settings that the majority of users accept in practice.
The trial-and-error labor needed to fine-tune match settings is rarely worthwhile in the world of direct mail list processing, where many different lists are processed together and time is usually short. Companies building a customer database may be more willing to invest in improving long-term results by tailoring the system to the quirks of their own source files.
Once the matches are identified, a merge/purge system is designed to choose one record to keep and to discard the others. In mailing list preparation, the selection often depends on which source would charge least to use the record–so merge/purge systems generally let users specify list priorities, and sometimes have distribution functions to randomly allocate matches across sets of source files. These systems also have standard reports to show duplication across input systems, again to help with direct mail analysis and list rental payments. Matching for in-house customer databases has a different set of concerns. Users are more likely to require tools to consolidate data from the different sources and to pick the best information where conflicting data appears. Such functions are not necessarily available in merge/purge systems. But they are important capabilities of the more advanced customer matching systems that will be discussed next month.
* * *
The first article in this series described the most basic type of customer matching software, merge/purge systems. These parse incoming addresses into elements such as first name, last name, house number, street name, city, state and postal code. They then standardize these elements, correcting for variations such as misspellings, nicknames, and alternate place names. Finally, they compare the elements in pairs of records, calculate a similarity score, and flag as matches any pair scoring above a user-specified level.
Merge/purge systems are relatively fast, cheap, and easy to set up. But applying the same scoring formula to all records inherently fails to take into account significant differences in particular situations. For example, matching an uncommon last name should count for more than matching a common one. The second class of customer matching software is able to take such differences into account.
These systems work by looking for patterns in the input records and applying different rules to different patterns. Patterns are applied at two levels: to identify data elements and to determine treatment of record pairs. Pattern-based element identification is particularly good at working with complex name lines, such as “John Smith and Jane Doe”, “Jane Doe Smith” and “Mr. and Mrs. John Smith”. A simple parsing routine would look at the first and last word on the line, and come up with first and last names of “John Doe”, “Jane Smith” and either “John Smith” or “Mr. Smith”. That is, it would conclude each name is significantly different, and miss the presence of two individuals altogether.
A pattern-based parser would recognize common first names, last names, titles and conjunctions, look at the patterns these are forming, and apply rules to identify the elements correctly. Such a parser would also adjust for generational indicators such as Jr., Sr. and III, industry terms identifying relationships such as “ITF” for “In Trust For”, and business aliases such as “John Smith dba Smith Supplies”.
As with most standardization and parsing processes, this approach relies heavily on key word tables that identify how different words are commonly used in different contexts. The scope and variety of these tables is critical to the accuracy the parsing process. Most pattern-based systems let users modify these tables to reflect conditions in their particular files, such as specialized industry terms, company-standard abbreviations or local geography. The pattern tables themselves can also be modified to accommodate known input peculiarities, such as a practice of flagging the last name with a special character (“Henry @James”). In effect, key word and pattern tables provide the knowledge that a human reviewer would intuitively bring to bear. Since they have greater memory capacity and behave consistently regardless of personality or fatigue, the tables are in some ways superior to human reviewers, particularly on routine processes. (But where accuracy is critical, most firms still rely on manual review and research to resolve ambiguous cases.)
Pattern-based matching rules rely on elements identified at the parsing step, and apply different rules to different element patterns. These patterns may look at the sequence of element types: for example, a pattern that identified a female first name followed by two possible last names (“Jane Doe Smith”) might trigger a rule to treat the middle name as a potential last name for matching purposes. Or rules might take into account which elements are present–for example, giving higher weight to a matching first name if there is also a matching middle initial.
Different systems take different approaches to how rules and patterns are defined. Some are highly structured, offering fixed elements, match types (e.g. perfect, close or none), and outcome classes (e.g. accept, reject, or ambiguous); in this case, the user must only determine how to classify each of the large but finite number of possible combinations of element match types. Other systems let users write rules in a scripting language that defines what to look for and how to react; this gives almost total flexibility. Whatever process a vendor applies, nearly all systems provide a default set of patterns and rules to help get started. Because users can identify exactly which rule was applied to accept or reject a given match, it is relatively easy to modify the default rules by reviewing outcomes and making adjustments over time.
The rule-based approach also lets users apply additional processing only to ambiguous matches–thus allowing a more detailed review of the available data when needed, without performing unnecessary processing on simple cases. One application of such processing is to resolve cases of “chaining”–where record A matches record B and record B matches record C, but records A and C do not match each other. Users may define rules to determine when to accept such matches and when to reject them. This sort of incremental processing combined with the greater inherent accuracy of pattern-based matching lets pattern-based systems find 90% to 95% of possible matches, compared with rates of 50% to 70% for merge/purge systems. Of course, your mileage may vary.
On the other hand, merge/purge systems run faster: multiple millions of records per hour, compared with one million or fewer per hour for pattern-based matching. These figures are crude guidelines, since speed varies greatly for all types of matching software depending on the hardware and algorithms involved.
Pattern-based and merge/purge systems also differ in ways other than matching techniques themselves. Because the pattern-based systems were designed primarily to match customer records, they maintain persistent customer identifiers from one update to the next. This is unnecessary in a merge/purge system, which is built largely to remove duplicates from a group of lists that are rented for one-time use. Maintaining a persistent customer ID is relatively straightforward, since it largely involves appending the ID to the input records in each matching session and carrying it through to the output. But it does involve some nuances, such as ensuring that the same ID is applied if a customer vanishes for a few cycles and then reappears, or if the customer moves and a record later shows up at the old address. When IDs are applied to households as well as individuals, things get more complicated still–now you need rules to handle household mergers such as weddings, and household splits such as divorces or children leaving to college. In fact, household definition is often a very contentious part of the database development process, since different users have different definitions that make sense for their own purposes. Multiple household definitions, each with its own set of IDs, are quite common in large consumer marketing databases.
The desire to build a permanent customer database also leads pattern-matching vendors to include extensive facilities for data consolidation. These range from simple functions to aggregate values such as purchases recorded in different billing systems, to complex rules to select the “best” version of an element such as a Social Security Number or primary address. Although this sort of consolidation does not rely directly on pattern-based matching, it may use the system’s assessment of the quality of different input records to help determine which record to treat as most reliable.
Pattern-based systems are also much more likely than merge/purge software to provide an API for real-time processing of individual records. This is commonly used to integrate the matching process with operational systems such as order entry or customer service, to quickly identify individuals as existing or new customers. Most operational systems provide their own, simple matching routines, but it makes sense to leverage an advanced pattern matching system if the enterprise has already purchased one. This provides results that are both more accurate and more consistent than the operational system would provide by itself, as well as ensuring that searches are made against the entire customer universe rather than only the customer records residing in a particular operational silo.
Major vendors of pattern-based matching systems include Harte-Hanks Trillium Software, Innovative Systems Inc., Group 1, and Postalsoft i.d.Centric. The latter two vendors also sell merge/purge systems, but their pattern-matching software uses different technology. Vality also sells pattern-based matching software, but relies on users to build their own keyword and pattern tables–a major undertaking that the other vendors avoid by providing users with prebuilt tables and rules. A newcomer to the market is DataMentors, which draws on its founders experience building pattern-based matching systems at pioneering marketing database vendor OKRA Marketing.
* * *
The purpose of name and address matching software is to identify sets of records that refer to the same person. The simplest matching systems do this by directly comparing the records to each other. Certainly this is the most obvious approach. But as matching software evolved, developers found that external data can help the process considerably. Even basic merge/purge systems rely on tables of names, business terms, cities, and other information for parsing and standardization. Address standardization in particular relies not simply on tables of common terms and spellings, but on files that list all known valid addresses. In the U.S. and many other countries such files are prepared by the local postal service. Sometimes they must be gathered or updated through other means.
The main advantage of a fixed reference table is accuracy. It provides a way to determine whether two similar records really refer to the same entity: if the closest match for both is the same reference record, they can be assumed to be the same. Of course, there are limits to this approach, since the reference table itself may be missing a valid entry or the input record may be so badly mangled that no reasonably close match is found. So most systems allow input records without a near match on the reference table to retain a separate identity. Sometimes these records are added to the reference table itself with a special code to indicate their origin. That way if a similar record appears again, the system will at least recognize it as matching the previous record.
Reference tables can also yield significant processing economies, particularly if the same table is shared across multiple installations. It is obviously more efficient to build a comprehensive address table once and then share the copies, than for each firm to assemble an address table on its own. Similarly, it is more efficient for a service bureau to run the records of many clients against the same reference table than to load a separate reference table for each client. This is true even if the client-specific reference tables, which would presumably be limited to that client’s customers and prospects, were each smaller than the single common reference table. Running against a common reference table also lets the service bureau keep that table loaded constantly rather than loading and unloading the individual client tables on, say, a monthly basis. This means each client’s records can be processed more often–nightly or perhaps even in real time. In addition, the common reference table could itself be updated continuously with new and corrected data, so each client would get the benefit of the most current information.
But there is a fly in the reference table ointment. Processing records against an address reference table alone will not identify duplicates among individuals. This requires comparing names as well as addresses. If name-level matching is needed, then a name-level reference table is needed as well. Even merge/purge and pattern-based matching systems that use address reference tables must still load the client’s own customer and prospect tables for name matching. So the full advantages of reference-based matching are not available to these systems.
Over the past few years, a handful of vendors including Acxiom, Experian and Donnelley Marketing/InfoUSA have introduced name-level reference table matching. The challenge in developing these systems is to build the reference table itself: after all, this involves nothing less than a database with every individual in the country. No government agency provides such a file in the U.S. Thus each vendor needed to assemble its own database from a variety of sources. These include public records such as telephone directories, voter registrations and real estate listings, as well as private sources such as catalog merchants and financial institutions. While this is a costly and complicated process, it is certainly possible with today’s technology.
The basic process is that each vendor run the records from its various sources through a conventional matching process. Records identified as belonging to a unique individual are assigned a fixed ID. The reference table thus consists of all significant variations among input records: where several versions exist for the same individual, there will be several reference table records with the same ID. When clients submit their own files, these are matched against the master table and the system returns the original record plus the matching standard ID. The reference table itself never leaves the custody of the vendor, and clients see only the information they provide plus the ID the vendor has assigned. This contrasts with address reference tables, which are frequently installed on in-house systems.
Because the master table may contain several records describing the same individual in different formats, an input record using any of these formats can be matched directly. This reduces the amount of processing-intensive “near match” logic, providing faster and more efficient performance. Even real-time processing of individual records is possible, although most reference-based matching still runs in batch.
The mix of inputs as well as processing techniques vary from vendor to vendor, so results from the different reference-based systems are not necessarily the same. But all vendors report significant improvements–often, nearly double the match rates of conventional merge/purge or pattern-based systems. On a reasonably well maintained file, this might translate into two to eight additional duplicates per hundred records input. A reference-based system also eliminates the much smaller number of false duplicates that occur when two records are similar enough to match but actually refer to different individuals.
Why to reference-based systems find so many additional duplicates? There is more involved than greater precision in matching. Specifically, the reference tables can include a history of the same individual at different addresses or under different names (e.g. before and after a marriage). These connections, derived from change of address transactions, legal records, financial institutions and similar sources, cannot possibly be made by comparing name and address records directly. While some false connections are inevitable, each vendor has tuned its rules to keep errors at what it considers an acceptable minimum. Users with different preferences cannot change these rules directly, although most vendors let clients apply their own splitting or combination rules after the standard processing. This contrasts with merge/purge and pattern-based matching systems, which let clients tighten or loosen matching rules to meet their individual purposes. The reference-based matching vendors argue this is unnecessary because their standard processes yield such accurate results. Clients can also propose corrections to the reference tables, although not all clients are willing to share such information and the vendors decide whether or not to accept a proposed change. When corrections are made, vendors can notify clients by publishing the list of affected IDs. Because the vendors keep track of which IDs have matched to each client’s input, they can send each client only the list of relevant IDs.
In addition to providing greater accuracy and operational efficiency, reference-based systems hugely simplify the sharing of data among different companies. The standard ID is the key. When two list owners wish to combine information on common customers, they need only compare their lists of IDs–an easier and more accurate process than conventional matching, and one that does not require sharing actual names and addresses. In practice, such comparisons would be done by the reference table vendor rather than the companies themselves, because license agreements forbid sharing the standard IDs with outside firms.
Standard IDs provide similar efficiencies for appending data from third-party sources to in-house lists–again, the third-party data list is coded with the standard IDs and these are matched against the IDs provided by the list owner. This sort of matching could be done on a periodic basis, or list owners could be notified when any interesting data appeared about one of their customers. This opens up some intriguing, if Orwellian, marketing possibilities.
In fact, the privacy implications of reference-based matching have received relatively little public discussion. The vendors argue these systems enhance privacy because they yield more accurate matches and, by linking all related records, make it easier to comply with opt-out requests. But widespread use of the same reference table also means that any errors in that table will be propagated widely rather than limited to a single company’s internal systems. Easier and cheaper cross-company matching also encourages firms to share data more widely, leading to more comprehensive customer profiles that could easily be misused by the inept or abused by the malevolent. Because the reference-based systems are technically designed for matching rather than data sharing, they do not appear to be governed by existing privacy regulations. They are affected indirectly, however, as reduced access to data such as credit records makes the tables themselves potentially less accurate. As such systems are more widely understood, they may eventually be subject to the same rules as other lists for individual disclosure, review and opt-out. But, at least in the U.S., it’s hard to imagine any regulations being passed that significantly diminish these systems’ effectiveness.
In sum, reference-based matching is often more accurate, more efficient and easier to deploy than merge/purge or pattern-based matching systems. On the other hand, prices are higher than for other technologies and some enterprises may balk at sending their customer list to an outside vendor. But where circumstances permit, reference-based matching is an option well worth exploring.
* * *
Let’s assume you’ve decided to invest some serious effort in choosing a customer matching system. How do you go about it?
You’ll start with technical specifications, like hardware, operating systems and integration methods. These may eliminate some contenders, but today most systems run on all the common platforms. You might try to narrow the field further by considering just one of the three classes of matching systems described in previous articles–string-based, pattern-based or reference-based. But while it’s generally true that reference-based are the most accurate and string-based are the least accurate, simply knowing this does not mean that one class of product is more appropriate than another. This is because the difference in performance depends on the applications and specific data involved. For example, a power utility’s list of current customers is likely to be quite accurate, while a list of inactive catalog buyers will contain many duplicate accounts and outdated addresses. If a file is highly accurate to begin with, moving to a more powerful system may not increase performance enough to justify the higher acquisition and operating costs. And even if you did limit yourself to a single class of systems, there are still significant differences among the products within each group.
In short, there is no way to make a really sound decision without testing each product against your own data. The process has three main steps:
– assemble test data. This is often the hardest part of the project because the data is not readily available and IT resources to assemble it are scarce. Ideally, the test data would include complete files from each system that will eventually provide inputs. This would test the matching system’s ability to handle data gathered through different processes and stored in different formats. It would also provide the highest possible number of duplicates to detect. In fact, the test data should really include several sets of input from each system, taken at different dates. This would ensure the data contains old and new versions of customers who have moved, changed their names, opened or closed accounts, and gone through other transformations the matching system may be intended to detect.
Alas, comprehensive data is rarely available. Even if it is, the volume is likely to be greater than the matching software vendors are willing to include in a test. So some form of sampling is usually necessary.
Constructing a sample for a matching test is unusually tricky. The statistician’s usual instinct is to take a random or Nth sample–but this is exactly the worst thing to do for matching tests. These methods tend to remove adjacent records, which are the most likely to be duplicates or members of the same household. A better approach is to select all names in limited geographic area, such as a state or metropolitan region. A relatively large area will also catch many people who have moved, although those who entered or left the region will be missed. More than one geographic region should be chosen to get a mix of urban and rural areas and to include any regional differences. This is particularly important in companies where different areas are served by different operational systems–a common situation at firms that have grown by acquisition. For these companies, using multiple regions ensures that inputs from all those systems are represented.
If the volume remains too high even when the sample is limited to a handful of regions, it may be further reduced by selecting on last name–say, all names beginning from A through F. This will still include most duplicates, although it will likely lose women who have changed their name after marriage or divorce.
It is also worth inserting records known to contain special situations, such as tricky parsing problems, name changes, frequent movers, household splits, or multiple generations (i.e., Sr., Jr. and III). These can be fictional records to test string- and pattern-based matching, but should be real people when testing reference-based systems. To avoid having such records stand out during processing, they should be physically mixed in with the other data and in exactly the same format. This may require constructing plausible values for fields that are populated in other records in the same file, such as account ID or telephone number. The number of such fields should be limited, since data not used for matching should be removed from the test file to reduce security risks and processing costs. Any individual or household link that comes from a system that would be replaced by the new matching software should also be removed. Such links should not be discarded, however, since they can later be compared with links created by the new systems.
Each record should include a source system indicator and file date, since the matching system might need different rules for records from different sources or from the same source at different times. Every record should also be assigned a unique identifier to simplify later analysis of how the matching systems performed.
The final step in test file preparation is creation of record layouts and counts needed to help load the data into the matching system itself. Some users prepare two test files: one for initial system setup and tuning, and the other to generate test results. This is analogous to the standard approach of predictive modelers, who build a model on one data sample and then validate it against a separate data set. In both modeling and matching, the purpose is to ensure the system is not generating unrealistical results by tuning itself to anomalies in the test data. This is generally not an issue for matching systems, however, so split test files are rarely used.
– run the tests. In most cases, the tests will actually be run by the vendor. This is faster and easier than installing the software in-house. But you will still need to provide instructions regarding matching rules and household definition. You also want to get some idea of the effort involved in setting up the system. It may not be practical to watch the vendor’s staff set up your particular job, because the work is performed in small steps by different people over several days or weeks. But it should be possible to walk through the operation, seeing each task performed on whatever data happens to be active. This will give some idea of the system features and staff skills involved. It should also be possible to get statistics on the computer resources and staff time consumed in working on your job.
– compare the results. Each system will have its own standard reports. Data conversion, standardization and parsing will generate statistics on missing data elements, address corrections, postal coding, and similar items. Individual records are sometimes coded to show the exact changes that were applied. This makes it easy to find records that had specific types of changes, so you can verify their accuracy. The matching portion of the system will show the number of records input, number of unique individuals identified, and (usually) number of unique households. Most system also classify the matches, either by certainty level or by the reason they were considered to match. The systems should also provide listings of records that were matched, again typically grouped by category. Visual inspection is very useful for string- and pattern-based matching, but less helpful when reference-based systems bring together records that are superficially unrelated.
While the most obvious statistic to compare across systems is the number of matches found, it is important to realize that matches may be incorrect–so a higher match rate is not necessarily a better result. In fact, there are three statistics to balance: correct matches, incorrect matches, and missed matches. Unfortunately, the “truth” is usually not known for all matches on a file, with the important exception of test cases inserted for this very reason. So the primary method of comparing systems is to look for situations where one system has identified a match and another system hasn’t, and to determine which system is correct. This misses situations where all systems have made the same error. But it does allow a meaningful comparison of the different systems to each other.
Identifying the disagreements among systems requires getting a file from each vendor with the original data plus whatever individual and household IDs have been assigned to link records that match. Since each record will also contain its original unique ID, the files can be joined to allow comparison. The comparison report takes a bit of work to create, although some matching vendors have written programs to do it automatically.
Except when the correct answer is known because of test cases or pre-researched linkages, judging which system is correct about any given match is a challenge of its own. Users mostly rely on a visual comparison, particularly where string- and pattern-based systems are involved. In some situations, users actively research the questionable matches via telephone calls or other validation methods.
Once the relative accuracy of the different systems has been established, there is still a business analysis to be done. This weighs the costs of the different systems against the values of found, missed and false matches. These values depend on the business situation–a false match has little cost when sending a clothing catalog, but could cause a lawsuit where financial accounts are concerned. Such priorities should be discussed with vendors in advance, since most systems can be tuned to adjust the balance between false hits and misses.
While accuracy and business value will be the primary factors in selecting a matching system, they are not the only ones. Some buyers reject reference-based systems because they require off-site processing and on-going service relationships. Some focus on processing speed, or computer resource consumption, or the staff effort required. Some care deeply about the quality of reports, options to review and override questionable matches, or control over matching rules and reference tables. Some need to handle international data or perform complex transformations. Nearly every decision is affected by salesmanship, customer service and vendor background.
Systems differ significantly along all these dimensions. Unfortunately, too many buyers focus on these other issues and neglect to test the performance of the software itself. Given the major differences in accuracy among the different products, this can be a big mistake.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.