1994 Sep 01

Visual Data Analysis
David M. Raab
DM News
September, 1994

Maybe it’s just my imagination, but the debate comparing neural networks with traditional regression methods seems to have led many more people to understand something that statisticians have long known, and would even admit after a drink or two: that choice of modeling technique has much less impact on a segmentation project than the quality of initial data preparation and analysis. In fact, the most exciting developments in data analysis today concern “data visualization tools”, which replace columns of numbers with graphical displays.

(Personally, I attribute the whole trend to MTV–even statisticians are no longer willing to tease an elusive relationship from a stack of computer printouts. Instead, they want dancing, singing information that leaps out at them with the brazenness of Madonna in concert.)

Here are three tools that help analysts understand their data without the traditional drudgery.

dbPROFILE (Advanced Software Applications, 412-429-1003) combines simple conventional analysis with automated clustering and the visual exploration of the relationships among variables. Like ASA’s successful ModelMAX neural network, the product runs on a Windows PC and is designed for marketers with little knowledge of statistics. But instead of producing specific recommendations for which names to mail, dbPROFILE helps to define and understand the characteristics of customer groups. This can identify market segments with different needs and interests, determine which products are most often sold together, or find which variables are most important for traditional segmentation. The system can also find differences among segments defined by external modeling systems, such as ModelMAX itself.

Using dbPROFILE requires a bare minimum of effort. The user points to data stored in any ODBC-complaint database such as Informix, Sybase, dBASE, Paradox or FoxPro, and the system automatically displays a list of fields in that file. The user selects the fields to include, and can choose to import the entire file or a random sample of a specified size. There is no explicit

limit to the number of records or fields per record, although the practical limit is probably related to speed: it takes about 5 minutes to load a 5,000 record database. The system has been tested with sets as large as 40,000 records.

Once data is imported, the user can create new variables by applying math functions (add, multiply, subtract, divide, logarithm) to the existing data, and can examine the data through SQL queries. The initial release of the system requires the user to write actual SQL statements, although ASA expects to add a point-and-shoot SQL generator within a few months.

The system also allows the user to choose two variables and see their relation displayed in a cross-tabulation report. The cross tab shows both counts and percentages for each cell, but does not allow the user to place other data in the cells, create a column that combines several values of a variable, or limit the analysis to specific subsets of the data. A separate report can show the mean, minimum and maximum values for each numeric variable in the database.

When the user is ready to run the clustering process itself, the only choices are which variables to include and the number of clusters to create (the default is 16). The user indicates whether each chosen variable is “continuous” (typically numeric values or dates) or “categorical” (a limited number of unordered values). The system then automatically determines the unique values present for the categorical variables, and groups the continuous variables into ten equal-sized ranges. Then it creates the clusters themselves, using an “unstructured” neural network algorithm (meaning that it is not trying to predict an already-known outcome). On a 5,000 record database with 30 variables, this might take about 15 minutes.

When the process is complete, the user checks that each cluster contains about the same number of records–if not, the process should be rerun with a different number of clusters, which will yield a different result. The system also gives a report showing the “centroid” (a Euclidean measure of central value–think of it as an average that went to college) of each variable in each cluster, which gives an idea of the characteristics of the records that cluster contains.

Now the visual fun begins. Instead of reading the table of centroids, the user can create a “perception map” that plots each cluster on a X/Y chart, where the X and Y axes represent the centroids of two user-selected variables. This shows at glance how the values of those variables change for each cluster. To understand how the different variables relate to each other, the user can then select a different X variable and see the clusters plotted with its value instead. Plots of several X variables can be shown on the same chart, with different colors for each. The clusters themselves can get somewhat jumbled, but they are labeled and, because the Y variable doesn’t change, each cluster remains on the same horizontal line. By changing X and Y variables, the user can gain a good understanding of the cluster characteristics and how different variables interact.

Keeping with the visual theme, the system also lets the user move a “cross hair” on the perception map, and then select all records (regardless of cluster) that exceed the X and Y values at the current cross hair location. Because the system also stores the cluster code on each of the input records, the user can write SQL statements that select records based on cluster code as well. Bit the system does not provide the actual definitions of the clusters themselves–so there is no direct way to put all records in a larger file into the same clusters. At best, the user would have to load those records into dbPROFILE and process them through the system. This should yield similar clusters assuming the data is similar to the original set. Once selected with SQL, the records can be exported to another system.

dbPROFILE was released in July, 1994. It is priced at $10,000 for a single user system, plus annual maintenance of 16.5%. The system itself requires virtually no training, but ASA offers a two-day course in applying its results for $695. ASA is preparing a “slide show” demonstration disk which should be ready by late September.

IDIS (IntelligenceWare, 310-216-6177) applies conventional statistical methods to the discovery of data relationships, using artificial intelligence to find unexpected patterns without any user intervention. The system marries this discovery capability with a set of tools for visualization and for analysis of data quality and irregularities.

IDIS exists as both a stand-alone Windows PC product and in a client/server version that can work with Unix, DEC VAX and IBM mainframe systems. The system reads data directly from files in ASCII, dBASE, Lotus, Paradox, Oracle, Sybase, DB2 and other formats. Multi-table data must be joined into a single file before it can be used, however.

The PC system typically handles databases of 50,000 to 100,000 records, while larger databases can exceed 50 gigabytes or 20 million records. Analysis of a small PC database might take from a few minutes to half an hour, while a 40 million record database might process for a full day on a mainframe. The system works with both continuous and categorical variables directly. It also lets users group variables into categories and create derived variables such as ratios.

At the simplest level, all a user must do is point IDIS at a specific file, accept default parameters, and let it run. The system will examine the distributions of values within the data, form hypotheses regarding relationships among variables, test these, modify them as appropriate, and then continue the cycle until it has come up a set of rules that describe significant relationships. The rules are expressed in if/then/else statements that are easy to understand, although they are not in any specific programming language. The rules can be viewed within the system or exported as ASCII text files.

Users can take additional control as desired. Instead of allowing the system to “roam” freely among the data, they can direct the system to examine the relationship of data to a specific variable–such as promotion response or purchases. (A true predictive modeling module is scheduled for release by the end of this year.) Sophisticated users can also adjust up to twenty parameters that cover such things as the level of detail examined by the system, confidence levels, minimum sample sizes, rule complexity and margin of error.

Once the system has developed its rules, users can see statistics for each rule such as margin of error and proportion of applicable cases, browse the underlying data, or see specific records that support the rule. The data quality portion of the system will find records that are exceptions to common rules or represent statistical outliers, so these can be examined more closely.

The system’s visualization module offers a variety of pie, bar and three-dimensional box diagrams that can illustrate the patterns discovered by the system or show the importance of different variables for different rules. The system automatically selects the appropriate graph for a particular type of data and allows the user to modify the graphs, examine specific regions more closely, and plot variations.

IDIS was launched in 1992, although the “discovery” portion of the system has been available as a separate module since 1988, and has sold thousands of copies. The system is particularly adept at find irregularities in data, so it is often used for applications such as auditing in financial, insurance and other industries. Direct marketers have used it to help identify important predictive variables that would otherwise be overlooked by traditional statistical methods. According to the company, the most successful users have tended to be either non-statisticians or very experienced statisticians–because both groups are willing to accept the system on its own terms, rather than viewing it as a competitor of traditional statistical methods.

The stand-alone PC version of IDIS costs $1,900, while client/server prices depend on the size of the maximum database. A half-million record version costs $15,000 per server, which steps up to $150,000 for a system that handles 50 million records or more. Annual maintenance is 20% of the initial license. Training times vary from one-half day for experienced statisticians to two to four days for novice users, and the company provides toll-free user support. IntelligenceWare offers several courses in data analysis and mining procedures, as well as using its software.

In addition to IDIS, IntelligenceWare offers products including Iconic Query, a graphical SQL query generator, and Corporate Vision, a tool for multi-dimensional, drill-down data analysis.

TempleMVV (Mihalisin Associates, 215-646-3814) is built on the theory that the human brain is better than any statistical or computer technique at detecting patterns, so long as data is displayed in graphical formats the mind can understand. The system applies patented graphical display tools that use size, shape, color and position to embed four to six dimensions of data in a single image. Users can then look for changes in the visual patterns, rather than at actual numbers, to identify situations that are worth closer examination.

Underlying the product is a proprietary database system that imports ASCII data, converts it to a binary format, and then creates a “tree” storing the number of records for each combination of values for up to ten variables. The tree is loaded into the random access memory (RAM) of the user’s computer, where data is aggregated and displayed in a second or two.

Because TempleMVV stores counts for the combinations of variables, the size of the file is determined by the number of variables and values per variable, rather than how many records are in the original data set. The system can handle about a million combinations on a powerful Unix workstation with 512 megabytes of RAM; this would translate to six values for each of ten variables, or a larger number of values if fewer variables are used. The number of input records can be in the tens of millions.

(The reliance on RAM is one major distinction between TempleMVV and Cross/Z International’s Fractal DBMS, which functions somewhat similarly and may be familiar to readers of this column. Cross/Z stores its compressed data on disk, which lets it handle a billion combinations compared to the million or so that TempleMVV can fit into RAM.)

Since the original data may contain continuous variables with many unique values, TempleMVV includes facilities to group those values into ranges. The system can do this automatically or the user can define the values explicitly. The current release of the system cannot scan the input data to find unique values, so the user must also enter the list of the values allowed for existing categorical variables. The user can also create derived variables by performing mathematical and logical operations on existing data.

Once the data is ready, the user selects up to ten variables and lets the system build a “tree”. The user also has the ability to define an “dependent” variable–say, revenue–that will be stored in the tree along with segment counts; this allows the user to develop reports based on either counts or the other variable. The system can also store a list of the actual records associated with unique combination of variables, so the system can later export a list of records associated with a given set of characteristics. This list occupies additional memory (about 4 bytes per record), though, so it reduces the number of combinations that can be analyzed.

Running on a small Unix system (Sun Sparc1), the system takes about one and a half hours to import an eighty megabyte, three million record database, and an additional ten minutes to build a tree for that file. Although the tree resides in RAM during analysis, it can be stored on disk for later reuse.

The actual charts produced by TempleMVV can take literally millions of different forms, although the basic approach is a set of “nested” bar or balloon charts that display several levels of data in a user-defined hierarchy. One of the simpler displays might show three rows with four boxes each, where each box represents a different combination of education and age levels. Within each box would be a square representing the total number of people in the group, plus three color-coded bar charts, each showing the number of people in six income ranges for a particular household size. This is a tremendous amount of information, although it takes a practiced eye to understand what’s going on. Mihalisin Associates reports that most users of the product have been senior statisticians–not because the system requires any statistical knowledge, but simply because of the sophistication required to understand what to look for and how to recognize something interesting when it appears.

The system does attempt to help the user by suggesting the appropriate type of chart for a particular analysis, and by automatically limiting the detail presented in the chart to accommodate the resolution of the computer monitor. A variety of tools make it easy to change the scale, level of detail, colors, and other aspects of the display. The system also lets the user choose the graphs that represent counts or other measures including minimum, maximum, mean, or standard deviation. The user can focus on a particular area of a chart to see more detail, view the underlying statistics in a numeric table, or even examine the underlying records themselves (assuming the linkage is in place).

The system has no explicit predictive capabilities, although a “boolean” capability allows it to display the probability of membership in a group–for example, the proportion of buyers in each region of the chart.

A “presentation” feature allows the user to annotate and store a sequence of charts so they can be presented to other users. These are live charts, so they can be modified or examined more closely in the course of the presentation. A new tree can be applied to an existing presentation, to produce a standard package of reports that are updated regularly with fresh data.

TempleMVV runs on Windows and Unix hardware. The database engine is written in standard C code and the interface is developed in a cross platform tool that would support PC, Unix and MacIntosh installations.

Pricing depends on the hardware platform, and begins at $25,000 for a Windows version. Maintenance costs 20% per year, and buyers receive customized training depending on their needs–usually two or three days. A training version of the system, including a detailed manual and tutorial, is available for $100, and a screen-show demonstration disk is free. TempleMVV was released in early 1994, and has over 20 installations to date. Current users include catalog and insurance direct marketers, as well as pharmaceutical and telecommunications firms. Mihalisin Associates was founded in 1987 to exploit research into data visualization tools done at Temple University. Its original product was a more conventional graphical analysis package called TempleGraph.

* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.