2004 Feb 01
DM STAT-1 Consulting GenIQ
David M. Raab
DM News
February, 2004

Playing with a new technology just because it’s cool is a luxury that few can afford. But if a useful technology just happens to be interesting–well, who’s to complain?

So it is with genetic algorithms and predictive modeling. The genetic approach evolves models through introducing random variations and letting the fittest survive. The analogy to genetics is exact: each element of a prediction formula is a gene; exchange of genes among successful models is breeding; random changes in formulas are mutations; and the best performing models have highest chances to reproduce.

Genetic systems start by randomly combining predictive variables and mathematical functions to build many formulas. Each formula is tested against a cases with known outcomes and given a performance score. After all formulas have been tested, the most accurate exchange random elements (breed) and undergo a few random changes (mutate) to produce a new generation. The system then scores the new formulas and repeats the cycle. As the process randomly discovers and then retains the most powerful variables and relationships, the models become more accurate. The rate of improvement slows over time, as fewer untried elements remain, and eventually the best surviving model is chosen.

It’s important to recognize that “survival of the fittest” is what allows to genetic systems to function efficiently, by passing useful features to succeeding generations. An approach that simply created and tested random formulas could run virtually forever without homing in on the best results. Even with the assistance of evolution, genetic systems create tens of thousands of models before they are ready to declare a winner.

The genetic approach has two grand advantages over traditional model development: it takes much less effort by the modeler and it produces better results. The labor savings are obvious: no need to preselect variables, identify appropriate data transformations, define likely relationships among variables, or assess alternative models. Hands-on effort is literally reduced from days to minutes.

Better results share the same origins. The system can test more options and ultimately find variables, transformations and relationships that work better than the more obvious choices. Nor is it constrained by the preconceptions and rules of thumb that human modelers must apply to work efficiently. For example, the system may pick the less common of several closely correlated variables or find multi-way interactions among several variables. Vendors of genetic systems report they consistently outperform models built by experienced statisticians by anything from 5% to 20%.

So if genetic systems are so great, why haven’t more companies adopted them? It’s not because users fear for their job security: most statisticians would be delighted to find a tool that let them produce better models with less work. This could only enhance their value to their employers. But the random and hidden nature of genetic model building makes some people nervous. This is compounded by the difficulty of interpreting models containing odd variables or calculations. Accepting these uncertainties requires violating one of the most basic rules of modeling: don’t use a model you don’t understand, because it might contain a hidden error.

But there are ways to address these issues, and the benefits of genetic approaches are too compelling to ignore altogether. So developers keep trying.

GenIQ (DM STAT-1 Consulting, 800-367-8281, www.dmstat1.com) takes a very pure genetic approach, letting the system try any mathematical relationship among any variables. Scoring formulas are built from two variables connected by a mathematical operator (add, subtract, multiply, etc.). Each variable may itself represent another variable/operator/variable combination, and the variables within those combinations may be combinations themselves, and so on. The resulting model formula is thus a set of nested calculations. A typical GenIQ model runs several layers deep and uses about a dozen input variables.

In addition to basic genetic techniques, GenIQ applies sophisticated methods to handle missing data, avoid overfitting to data anomalies, and remove unnecessary complexity. Users can control the details of these and other options during model development, although the default settings usually suffice.

GenIQ usually builds 250 models per generation and runs about 20 generations. While most modeling systems, genetic and otherwise, aim to make the most accurate predictions across all cases, GenIQ focuses on doing the best job of finding the top-responding file segments. This is measured by “lift”–that is, the response rate for the top few deciles vs. the rate for entire group. Maximizing lift is typically the real goal of direct response modeling, so focusing on lift directly lets GenIQ build the most useful model possible.

GenIQ displays the model formula as a branching tree, making it easy to read. But it is still virtually impossible to understand, since many calculations will involve apparently unrelated inputs or intuitively meaningless derived values. GenIQ does provide some comfort by giving a report that shows the importance attached to each input used in the model. But users hoping for a comprehensible explanation of the underlying logic will not be satisfied.

Users who are more interested in results, speed and ease of use are more likely to be pleased. GenIQ runs on a Windows PC and can build a model in 15 minutes on 20,000 test cases. The system accepts flat file input and requires virtually no data preparation. Setting up a model requires specifying the target and predictor variables and selecting other parameters or simply accepting the defaults. Displays include the model tree, gains chart, and variable importance ranks. The model formula can be exported in SAS, SPSS, XML, SQL or Basic formats to use in production scoring. Since there is no preprocessing of test data to create transformations and derived variables, there is no need to recreate such preprocessing on production data before scoring.

GenIQ is priced from $30,000 to $60,000 for a perpetual license, depending on the capabilities provided. The $60,000 version allows unlimited numbers of variables and records and supports both binary and continuous target variables. The system has been evolving since 1994 and has been sold to more than dozen clients.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.