1993 Aug 01

List Cleaning on the PC
David M. Raab
DM News
August, 1993

Anyone who works with mailing lists soon learns that all lists are not created equal. Some are all capital letters and others are mixed upper/lower case; some have the first and last name in a single field and some have them separate; some have city, state and Zip on one line and others break them apart. The variations are endless.

But before a list can be processed through merge/purge, personalization or address standardization, the variations must be ended. Vendors like Group 1 and Pitney Bowes’ LPC have mainframe software to do this, but not everyone has a mainframe or wants to hire a mainframe service bureau. Many PC users have written their own systems for these functions. Peoplesmith Software (617-545-7300) provides three products that form an alternative.

– Right Fielder takes unstructured label files and puts each type of information (name, company, street, city, state, zip, etc.) in its own field.

– Personator splits names into prefix, first name, middle name, last name, and suffix/title, and also determines the gender by looking up the first name in a genderizing table. (A basic table of 7,500 names is standard with the product. A 100,000 name table, including common misspellings, can be purchased separately under the name of GenderBase).

– DataLift improves the appearance of lists, by converting to mixed upper/lower case, expanding abbreviations and correcting punctuation.

All three products come in standard and “professional” editions. The major differences are speed (the professional versions can access expanded or extended RAM), and that the professional versions can work directly with ASCII files, while the standard systems require conversion into a dBASE format. (Conversion utilities are included with all products. They are so fast–between 3 and 10 million records per hour–that some people use the Peoplesmith software for ASCII to dBASE conversion alone.)

The professional versions can also be run from DOS batch files, although the set-ups must be predefined using the interactive menus.

The programs are all written primarily in the Clipper programming language, with some C and Assembler code for speed in critical areas. The systems all have an excellent user interface, with Lotus-style menus, on-line context-sensitive help, and clear listing of available function key options when applicable.

Manuals are equally well done, with detailed explanantions of how each system works and surprisingly frank discussions of the limitations. In fact, sections of the Peoplesmith manuals could serve as a primer on the real-world complexities of list cleaning.

A typical list project would begin by using Right Fielder to take unstructured data and put it into a file where each type of information (name, address, city, etc.) is in a specific field. Right Fielder can work with fixed field, flat or comma delimited ASCII data, as well as single- or multi-column print image files and files (such as bank statements) where the address is in a fixed position.

To define a fixed field ASCII file, the user enters the length of each field in sequence. If you don’t already know the layout, this can be pretty inconvenient, since it can take a fair amount of trial and error to get it right. (Other systems give an image of the first few records and let the user simply mark where one field ends and the next begins.) On the other hand, setting up a multi-column print image file is simplicity itself: the system shows the first several lines, and the user just moves a vertical bar to where the second column begins. The system takes it from there.

Once a file has been defined, Right Fielder must be told how to process it. All setup is handled on a single screen, and the settings can be saved for later reuse either with another list or in batch processing.

Up to nine fields can be analyzed from the input file, and their contents can be assigned to as many as 15 output fields, classified by name, department, company, address, city, state, zip, country and telephone number. Output fields are always separate from the input data, so the original information is never lost. If you are certain a particular type of data is always in a particular location, you can tell the system to ignore that field.

The setup gives an impressive amount of control to the user. Choices include whether to stress speed or accuracy in processing; whether to discard unidentified fields; and how to handle exceptions including unrecognized lines, truncations, questionable records and international addresses.

The user can also specify whether to flag records when a particular output field is blank (e.g., no name, no state); instruct the system to ignore a fixed number of right-most characters on a line (real world application: strip out a keycode that is right-justified on the same line as the name); define whether the ampersand (&) is most likely to indicate a name, company or department line (or ignore it); and determine whether to give the last line special treatment.

In addition, the user can indicate whether a file contains mostly business or residential addresses, and what sequence the information is expected to be in. These last two elements give the system some ability to make “judgements” beyond those based on looking up key words in a table. Table look-up–that is, parsing each field into words, looking for each word in a table, and then finding the type of line the table says that word is typically associated with–is the primary technique of the program. It also makes the basic assumption that city, state, zip, country and phone number will appear at the bottom of each input record, and is not fooled if the last line is something like “Attn: Hillary”.

Our tests showed the sequence option to be something of a mixed blessing. When the option was employed, the system made some mistakes that it avoided when no sequence was assumed. On the other hand, using a sequence allows the system to flag records that appear to be out of sequence. These can later be called up for manual review and correction.

The program has other limits as well. It cannot handle split apart name, address and city/state/Zip when they are in a single record; it cannot recognize continuations of long titles when they are spread over two consecutive lines; and its tables work only with single words, not phrases.

But over all, the performance is quite impressive. Our test data was a conglomeration of dirty records from many sources, which we keep around to torture products like this. The system certainly made mistakes that were obvious on human review. But it was pretty amazing to see how our extremely sloppy file transformed into a mostly usable database.

Speed was also more than adequate: about 20,000 records per hour running the professional version on a 386/25, and nearly 90,000 records per hour on a 486/66 with 8 megabytes of RAM. The system is clever enough to locate and use the additional memory without any effort by the user. If a RAM disk is present, it can also be told to load lookup tables into it.

Once a run is complete, users have the option to call up any flagged records for review and editing. Alternately, the system can be set to pause during processing after each record is completed, or to pause only when questionable or problem records are encountered.

The system also includes a half-dozen utilities that handle ASCII conversion; file maintenance including search and replace, browse and edit, and dBASE structure modification; lookup table maintenance; and the ability to copy some or all records on a file. Most of these utilities are provided with all the Peoplesmith products.

Right Fielder is priced at $395 for the standard version and $695 for the professional edition. Peoplesmith provides unlimited, toll-paid technical support from 8:30 to 5:00 Monday through Friday.

Personator and DataLift add other capabilities to the list cleaning tool kit. Once the name field has been identified by Right Fielder, Personator will split it into prefix, first name, middle name, last name, and suffix/title. Like Right Fielder, it is largely table-driven. For speed, the system first checks against a table of the 5,000 most common names, which account for 80% of the cases. If no match is found, it then checks against a larger table, either the 7,500 name file provided with Personator, or the 100,000 name GenderBase file.

The names have also been classified by sex, which allows the system to “genderize” the file and assign a gender-specific prefix (Mr./Ms.). It can then create a separate salutation field in either formal (“Dear Mr. Clinton”) or informal (“Dear Bill”) format. The system is clever enough to automatically default to formal if no first name (or only a first initial) is available. It can also default to a user-defined slug (“Dear Friend”) if no name is found.

Personator shows the same attention to detail and flexibility as Right Fielder. Users can determine what to do when truncation occurs (often, truncation is a sign that the system has made an error); whether to put a comma and space before a suffix (e.g. “John Smith, MD”, vs. “John Smith MD”); and whether the name field holds first-last-suffix, last-suffix-first or last-first-suffix. In the genderization set-up, the user can assign male, female and neutral prefixes; define whether records without first names and those with first names not found in the table are considered neutral; determine how to handle truncations and neutral records; decide whether to correct apparent spelling errors; and choose whether to overwrite existing prefixes.

The system can also split single-field city/state/zip entries into multiple fields.

The three steps–name splitting, genderization and salutation-creation–are each done in separate passes through the input file, although each pass follows the other automatically. On our 486/66, each step ran at about 140,000 records per hour.

Personator costs $345 for the standard version and $695 for the professional system. GenderBase costs $595.

DataLift is designed primarily to improve the appearance of personalized mailings and labels. It does this by changing to mixed upper/lower case, by expanding common abbreviations, and by improving punctuation.

The basic system setup requires the user to select the fields to be processed and define the field type (since different rules will apply) for each. In operation, the system first applies general casing rules (somewhat more sophisticated than the typical “capitalize the first letter following each space”); processes exceptions such as “McDonald”, “du Pont”, “O’Hara”, “3rd”, and “of”; and then checks the lookup tables (different for each field type) for abbreviations to expand. Thus, “GM” in a title field might be expanded to “General Manager”, while “GM” in a company field might be expanded to “General Motors”.

The system lets users control the contents of their lookup tables, since the meaning of some abbreviations will be different based on region, industry, etc. It also has a “custom” field type that can be applied against a user-defined table. This might contain abbreviations particular to a specific company, such as part numbers to be expanded to full descriptions.

Because of the danger of truncation, each lookup record actually contains three entries: the original item, an expanded equivalent, and a standard abbreviation. Thus, “V.P.” might be expanded to “Vice President” if space permitted, or replaced by “VP” if it did not.

In about two months, the firm expects to release a new version of the program that has an expanded set of tools to help force contractions when a file is being prepared for a small format, such as a cheshire label.

Users can determine whether to retain any existing punctuation, or to strip out periods and commas. These would then be replaced with the punctuation determined by DataLift.

In our test, DataLift ran at about 30,000 records per hour on a 486/66. As with the other Peoplesmith products, the actual speed depends on the number of fields per record, how many lookups are needed per field, and the details of the hardware.

DataLift costs $345 for the standard version and $695 for the professional edition.

* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.