1994 Jan 01
More Segmentation Tools
David M. Raab
DM News
January, 1994

More and more, direct marketers are looking for alternatives to traditional statistical segmentation methods like regression analysis. Better performance would be nice, but what they really want is something that is easier to use–so that the work can be done more quickly and cheaply, by in-house business analysts rather than expensive statistical specialists. The ideal solution would allow models to be built in hours instead of weeks, so that advanced segmentation can be applied to problems that are too narrow to justify the time and cost of a traditional model-building effort.

Neural networks and automated interaction detection (“tree analysis”) systems provide some of the better known alternatives. Here are two more.

DataLogic (Reduct Systems, 306-586-9408) is a PC-based software package designed to let non-technical users to build, evaluate and implement “rough sets” models. Developed in the 1980’s, “rough sets” finds patterns in data, which it then uses to develop rules that define the boundaries of different file segments or “sets”. The technique is called “rough sets” because the boundaries between the sets can be imprecise, or “rough”. (“Rough sets” is quite different than “fuzzy logic”, despite the similarity in names.)

“Rough sets” has some interesting advantages for direct marketers: its rules are easier to understand than the mathematical equations of regression or neural networks; it automatically eliminates redundant data elements; and it can handle imprecise, inconsistent, missing or ambiguous data by developing separate rules to handle different types of cases. Nor does the process does require the user to find interactions, define non-linear relationships, normalize the distribution of variables, and perform similar tasks required by many other techniques. Users must still perform basic tasks such as file import and creation of important derived variables like average order size or lifetime total purchases, however.

The performance of “rough sets” is about comparable to other segmentation methods, according to Reduct; in numerous tests, they found that results were usually close, with the “winning” technique determined by the skill of the user and the nature of the particular data set. Even though DataLogic has automated the model-building process itself, user skill still matters. In addition to data preparation, users must set the size of the segments (“roughness”) and the minimum probability for a rule to be accepted (“precision”). A typical project involves building two or three models until the optimal segment size and probability are discovered.

Reduct has found that the best results are achieved when only the strongest rules are used. On cases those rules do not cover, the system simply reports “no decision”. This avoids the “over fitting” that sometimes occurs when modeling systems give answers based on weak evidence.

DataLogic uses a character-based interface with Lotus-style menus. It lacks an on-line help function, but is very easy to use even without one. No particular statistical or computer skills are required, beyond enough understanding of probability concepts to interpret the reports.

A DataLogic project begins by importing ASCII data, either in a comma delimited or fixed record length format. For each field, DataLogic automatically reads and defines the data type (symbolic, numeric, logical, categorized or continuous). Optionally, for continuous data, the user can define a precision level and break-points that seem important. The system will try to use these break-points, but discards them if they turn out to be irrelevant. The system will automatically find up to 100 values for a categorical item when the data is imported.

Users can edit the values in a data set after it is loaded, or directly enter new cases. The maximum file size is 10,000 cases with up to 150 variables for DataLogic/R, and 64,000 cases with up to 2,000 variables for DataLogic/R+.

Before actually running a model, the user must select which data elements to consider (the default being everything, since the system will automatically eliminate irrelevant or redundant items), which variable to predict (that is, the dependent or “decision” variable) and which values of the decision variable to look for (the default being all). Because “rough sets” predicts membership in a set, rather than a specific value, continous decision variables must be grouped into ranges either by the user or DataLogic. The system can model only one decision variable at a time, although rerunning with a different decision variable merely requires calling up the variable list and placing an X next to the new item.

The final settings are for the fraction of the file to set aside as a validation sample (optional), and the “roughness” and “precision” to be used in the particular model run.

Actual processing time depends on the amount of data and complexity of the results. A test database of 25 records processes in a minute or so; 10,000 records might run for one or two hours on a high-end PC. While processing, the system automatically discards the redundant or irrelevant variables and data, generates the classification rules, and then validates the rules against the sample it has automatically set aside.

After a run is complete, the user can examine the rules that were generated, the strength of individual variables in contributing to each rule, the probability of each rule giving a correct result, and the number of cases applying to each rule. Several rules might lead to the same result, but each is listed separately and may describe a different market segment–making it relatively easy to understand how the “decision” was made for each case, the key characteristics of each segment, and whether a particular rule is reasonable. Users can also view the entire data set in a report showing which rule applied to each case. A separate report shows results of applying the rules to the validation set. All information can be viewed on the screen, printed or saved to a text file.

Depending on results, the user might accept the rules that were generated or change the roughness and precision measures and generate another set of rules.

Once the user has acceptable results, the system can apply the rules to classify new records. Where the outcomes are already known, this provides a further check on the system’s accuracy; where outcomes are not yet known, results can be used for scoring or segmentation. When more than one rule applies to a given record, the system can automatically select the rule with the highest probability of being correct.

DataLogic can accept new data either from a file or the keyboard. With direct entry, DataLogic can function as an interactive “expert system” for procedures such as lead classification, credit screening or fault diagnosis. In fact, although DataLogic has been used by some direct marketers, most of its 100-plus installations have been for industrial and scientific applications such as process control and medical diagnosis.

The system also has an optional module to create C-language computer programs that incorporate the rules it has generated. The programs can be embedded in other applications or used as the core of stand-alone expert systems.

Although DataLogic does a good job of explaining the rules, showing their reliability and highlighting exceptions, the system lacks analytical reports for direct marketers–such as ranking of segments by projected response rate, or producing a “gains” chart of the estimated, cumulative response and mail quantity for different cut-off points. The necessary information is contained in the existing reports, however. DataLogic also lacks any batch processing capability, either to automatically generate several runs with different roughness and precision settings, or to make multiple runs against different data sets. Reduct can add these capabilities on a custom basis.

DataLogic comes in two versions: DataLogic/R (limited to 150 variables and 10,000 cases in the training data) and DataLogic/R+ (limited to 2,000 variables and 64,000 cases). A single user license costs $475 for DataLogic/R system and $1,695 for /R+. In addition, a special module to handle missing data without substituting estimated values costs $195. The ES-Shell module, which creates C code for a stand-alone expert system, costs upward of $200, depending on the customization required. Toll-paid technical support is free for three months and by service contract after, and is available from 8:00 to 4:30 Central time. On-site training and consulting are available, although most users need only on the computer-based tutorial and telephone support.

A demonstration disk–fully functional except that you cannot load new data–is available for $10.

Matchkey (The Matchkey Corporation, 415-856-9988) takes a lower tech approach to segmentation: it uses the simplest method there is, cross tabulation, and provides just a hint of automated assistance in making the selections. The logic behind this approach is that a knowledgeable marketer can ultimately do a better job of real-world segmentation than any automated procedure, so what’s really needed is a tool that makes it as efficient as possible for that marketer to manipulate the data. At the same time, Matchkey gives some discreet help–such as highlighting only performance differences that are statistically significant–so that marketers are not tricked by quirks in the data.

Although Matchkey is ultimately a cross tabulation system, it offers several advantages over the cross tab tools in general statistical packages such as SAS or Mini-Tab. The system is considerably faster, handles larger files, and has specialized reports and functions for segmentation analysis. Perhaps most important, Matchkey automatically calculates a “criterion” measure for segment, such as response rate or sales per thousand names. This criterion is the measure that the analyst wants to maximize, and Matchkey will automatically identify segments that are in the low, break-even or high ranges for this value. The criterion formula is hard-coded into each client’s Matchkey system before it is shipped, while the low, break-even and high levels can be set by the end-user for each run.

Matchkey works on a PC in a Microsoft Windows environment, with an interface that takes about a day’s training to introduce and a month to master. No on-line help is available, although there is a written manual and telephone support from 8:00 to 6:00 Pacific time.

A Matchkey project starts by preparing the data. Typically this involves several files–say, a mailing list, a list of responders, and history or demographic data–that must be sorted, merged and placed in a fixed length ASCII format. Matchkey comes with utilities to help this process, which must be completed before the data is loaded into Matchkey itself.

After the data is loaded, Matchkey will produce a “code book” that lists all the values found for each variable, and the number of records with each value. This gives the user an easy way to check the incoming data for errors and ambiguities. After the “code book” is created, the user can group records into ranges by simply highlighting the end of each range. Alternately, the user can tell the system to report on only certain values of a symbolic variable (such as a cluster code or product type), and combine all the others. Variables can be given text labels that make them easy to understand in system reports.

Once the variables are defined, Matchkey will automatically create a series of tables. Each table lists all the value-ranges for one variable, and for each value-range shows the number of records, their percentage of the total file, the average criterion value, and the number and percentage of records that “meet” the criterion (that is, have a sale or response). Rows are labelled if they have statistically significant criterion averages that fall into the high, break-even and low categories.

Once the initial tables are complete, the user can select a row or group of rows that make up an interesting segment, and then create a new set of tables that show the other variables against those records only. It might take 10 to 15 minutes for a 486/33 PC to generate a set of tables on a 30,000 record file with 20 variables.

This process usually continues to three or four levels of variables in a single segment. Once a desired segment has been finalized, it can be extracted from the rest of the file, and the analyst can repeat the process on the remaining records. A typical project results in 15 to 20 segments, and takes an experienced user about two days.

When the work is complete, the user can save the segmentation and run it against a separate sample of records (automatically set aside by the system) to check its accuracy. The system also gives a summary report that shows the number of records, criterion average and records meeting the criterion for each segment. This is not precisely a “gains report”, because it does not automatically rank the segments by performance or show cumulative results. But the needed data is there, and Matchkey has recently added the ability to transfer reports into spreadsheet files for further manipulation.

Users can implement a Matchkey segmentation by importing a file, having Matchkey append segment codes, and exporting the result. Or Matchkey can list the logical rules used to create the segments, which can be transformed into a computer program for another system. Matchkey users regularly score files of well over a million records within the system.

Matchkey was originally built on a mainframe computer and converted to DOS in 1987. The Windows version was launched in early 1993 and has been sold to about 50 users. Cost is $1,995 including six months of telephone support and a year of upgrades. Training costs $750 per day, either at the client’s site or Matchkey’s offices.

There is a free Matchkey demonstration diskette, which is a fully functional system but cannot import new data. The demo comes with a limited printed tutorial, and full documentation can be purchased for $25.

The Matchkey Company offers database consulting and file segmentation services in addition to the software.

* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.