1997 Jun 01
Qualitative Marketing Software Inc. Centrus
DupeKiller, Inc. DupeKiller VPP
David M. Raab
DM News
June, 1997

Data warehouses may be the latest trend in computer systems, but they are governed by the oldest rule: garbage in, garbage out. Many companies have unhappily discovered that the latest technology and most brilliant design are worthless if their warehouse is filled with unreliable data. The result has been a new surge of interest in data quality issues.

Among the very thorniest of these issues is matching records belonging to the same customer. This involves several interrelated tasks: identifying name and address data elements that may be mixed together in one or more fields; checking for valid addresses against postal files; and finding matches at the individual, household or business levels.

Specialized software to perform such tasks has traditionally come from two sets of vendors. One group, including Group 1 Software, i.d.Centric (The Company Formerly Known As Postalsoft) and Pitney Bowes Software Systems, draws on technology developed for mail preparation and duplicate elimination. The second group, including Innovative Systems Inc., Harte-Hanks Data Technology, Harland Company and Group 1’s NADIS product, uses technology created to build consolidated customer information (CIF) files for banks and other organizations. Both sets of vendors identify data elements (first name, last name, street address, city, state, Zip Code) within a record through a combination of rules and tables showing how specific words are likely to be used. But vendors in the first set also use postal tables to create consistent addresses that match postal standards. The second set of vendors use algorithms to find records that match on key elements, with or without postal standardization.

Centrus (Qualitative Marketing Software Inc., 800-782-7988 or 813-725-9727, www.qmsoft.com) combines the postal-based approach with something different–street address files prepared for mapping systems. These files are based on the TIGER files prepared for the Census Bureau and enhanced by commercial organizations such as GDT, ETAK, and BLR.

These files offer the advantage of a second, independent source of data that sometimes is more current than postal records. QMSoft, in a massive job that requires about 1,000 hours of computer processing, merges the two files into a combined database. As a result, it is able to both find more matches against the postal file and to provide more accurate latitude/longitude coding (“geocoding”). In one test performed by the firm, it found Zip+4-level geocodes for 98% of a standard set of test records, compared with an 87% hit rate by another standard package. The system performs both geocoding and USPS CASS-certified address standardization in a single pass through the file, at rates of 800,000 to 1.2 million records per hour on a Windows NT server. In addition to postal and latitude/longitude codes, it applies Census tract and block group information.

Centrus is provided as modules which are accessed through a graphical user interface. The foundation module is Address Coding, which performs the postal and geographic coding and also splits addresses into their elements. The system’s Name Parsing Module isolates individual and business names and titles, using tables of first names, last names, gender, nicknames, professional titles and national (ethnic) origin. Name Matching will automatically create both informal (“Dear Bill”) and formal (“Dear Mr. Clinton”) salutations, derive formal first names from nicknames, and create salutations like “Dr. Welby” from “Marcus Welby, M.D.”

There is also a Spatial Coding Module which can perform radial analyses, such as finding the dealer nearest to a given address; do point-in-polygon analysis, such as finding what sales territory an address belongs to; and do extractions based on Zip+4 areas displayed by the system on a map. A Demographic Coding Module can append demographics and cluster codes from Claritas or Equifax NDS. The system provides an online “QuickFind” function that lets the user enter an unformatted address and immediately receive the formatted address, demographic and geographic codes, and the nearest major street intersection.

In August, the vendor is scheduled to add a PAVE-certified postal presort module and an interface to the Postal Service’s NCOA/FastForward file of address change notices. Additional modules due by the end of 1997 will perform merge/purge, householding, and analysis of primary market areas based on the location of existing customers.

The Centrus user interface runs on Windows 95 and Windows NT workstations and servers. System functions are also available as libraries that can run on Windows or Unix servers, where they can be embedded in other applications via DLL or OCX calls or through Remote Procedure Calls from a mainframe.

The cost of Centrus depends on the modules and data purchased. A basic system including Address Coding, the ability to link to external databases via ODBC, basic file handling and the user interface, costs about $20,000 for a single user. This includes postal plus GDT data with a year of bi-monthly updates. In later years, maintenance and updates cost $1,500 to $2,000 for postal data only or $7,500 for postal plus geocoding data. Other modules cost about $5,000 each with additional charges for data.

QMSoft began business in 1989 as a marketing service bureau and introduced its original geocoding product, StarCoder, in 1992. The firm offers several standalone products that are the equivalent of Centrus modules, as well as geographic databases and services. Centrus was introduced in October 1996 and currently has 350 users at over 140 companies.

DupeKiller VPP (DupeKiller, Inc., 888-434-2129) is something of an oddity–a merge/purge product that does not rely on address standardization. Instead, the system uses proprietary techniques to sort records so that likely dupes are next to each other and then compares each pair of adjacent records using “variable path processing”. This method applies different matching rules to each record depending on the characteristics it identifies. It is somewhat similar to the algorithms used by the CIF vendors, although DupeKiller has received a patent for its approach. The vendor reports that in one benchmark, it found 78,000 duplicates including just 1,000 false dupes, compared with 45,000 duplicates including 13,000 false dupes found by a “conventional” merge/purge product. DupeKiller has five installations, including three service bureaus, and is priced at $100,000 plus $1,000 per month in maintenance.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.