David M. Raab
DM News
August, 1999
.
Relational databases are good for many things, but quick-and-dirty data analysis by non-technical users is not among them. Designing a data structure, importing the data, writing in Structured Query Language (SQL), and massaging the query results all take substantial effort by skilled technicians for any but the most trivial tasks. Good query performance against large databases also requires computers that are too big and expensive to let amateurs access directly. In addition to the obvious cost in time and money, working through technical experts increases the chance of error due to miscommunication between the people asking the questions and those generating the answers.
Quite a few solutions have been offered to this problem. Query generators can help create SQL code. Specialized database designs, notably the “star schema”, allow simple queries to answer complicated questions. Views and metadata layers can perform sophisticated queries without the user knowing how they are written. Advanced index technologies provide superior performance on less expensive hardware. At the extreme, specialized database engines like Oracle Express and Hyperion Essbase load data into non-relational structures.
While these approaches often make it easier to operate the final system, they still require significant technical skill to set up in the first place. And if users want to do something that was significantly outside the boundaries of the initial approach–like add a new data source or change the level of detail–the technicians must be called in once more. So the dependence on external resources still remains.
Digital Archaeology Discovery Suite 2.0 (Digital Archaeology, 913-438-9444, www.digarch.com) promises to let non-technical users perform the entire job of complex, ad hoc analysis by themselves. Using a proprietary database technology called X-Set, the system lets users access and manipulate data without defining its structure in advance. It also provides a powerful programming language with abilities beyond standard SQL, such as conditional execution and recursive processing.
Taken together, these features let users load, manipulate and report on data without the help of database experts. The company cites several cases where it was able to provide information in days that was still unavailable after years of requests to technology staff using conventional methods. Such comparisons can be misleading–presumably the work could have been done in a reasonable time using conventional tools if the technical staff had given it priority. But the larger point is still valid: the technical staff had not done the work because its limited resources were deployed elsewhere; Digital Archaeology got users their answers because it bypassed the resource bottleneck.
X-Set works by building an ordered map of the physical location of each piece of data in a database–thereby allowing fast, direct access to information without indexes or external access methods. The results of each query are stored as another map, which later queries may read instead of the original data map. This yields faster performance over time as query results are reused.
When the original data is stored in a comma-delimited text file, Digital Archaeology builds its map without importing the file itself–a process that ran in just over one hour on a 5.2 gigabyte, 45 million row test performed by the company. Loading the same data into a standard relational database took considerably longer. When the data is already stored in a relational database, Digital Archaeology must first import the data before it builds the map. This took seven hours on the 5.2 gigabyte test data, running a dual Pentium Windows NT server.
The system imports relational data automatically, reading the relational database’s field and table definitions. A wizard guides users through the definition of text file sources.
Because the X-Sets store the location of every data item, they take up considerable space: to accommodate the initial map plus additional query results, Digital Archaeology recommends having available eight to ten times as much disk space as the input data itself. Relational databases commonly need two to four times the raw data for work space, although they sometimes can reach the eight to ten multiplier as well.
Once the data is loaded, queries are returned very quickly–and speed increases over time as the results of earlier queries become available for reuse. The company cited a 20 second response time for the initial query against a 4.5 million row database.
Users build Digital Archaeology queries with a graphical interface, drawing on a palette of operators and functions to lay out each query step in a flow chart. The current release has 72 operators, including a broad range of set manipulation, filter, math, sorting, aggregation, statistical, string, data flow and cleaning tasks. Users can build, save and share their own functions, which can incorporate operators or other functions as input. Results can be summarized in a graph or tabular report, as well as exported as data files. A query can incorporate multiple data sources and outputs.
Building a Digital Archaeology query does not take the technical expertise of a SQL program, but it does require detailed understanding of the source data and the step-by-step data manipulations–such as different types of file joins–needed to reach an answer. Working with the tool at this level is most likely a task for a sophisticated business analyst, rather than a casual user. The system does let advanced users create applications with fill-in-the-blank parameters and predefined reports for others to execute. An integrated scheduler lets data loads, queries, reports and other functions repeat automatically. A packaged application, called c-Discovery, provides a bundle of functions for customer segmentation, Web site analysis, purchase behavior, trends, loyalty, list selection and campaign reporting.
Digital Archaeology runs on Windows NT and Sun Solaris servers, linked to clients running Windows 95/98, Windows NT or any standard Internet browser. Pricing is based on the number of users and starts at $85,000 for Discovery Suite (the base system) and $130,000 for Discovey Suite plus c-Discover. Discovery Suite was initially released in December 1998 and currently has about a dozen installations. c-Discover was released in July 1999.
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.