Business Intelligence and the New Analytical Databases
David M. Raab
Information Management
September / October 2009
Business intelligence software has evolved to fit an environment dominated by relational databases. This mostly meant overcoming the inherent inefficiencies of executing analytical queries against normalized data models, no rx typically by reorganizing the data into star schemas, hospital cubes, recipe indexes or other specialized structures. Sometimes the restructured data itself resides in a relational database and sometimes it doesn’t.
Because so many enterprises have invested heavily in these business intelligence applications, new database products must be compatible with them. But products built specifically for analytical use (ParAccel, Vertica, Netezza, Sybase IQ, etc.) often don’t face the constraints that the BI applications were designed to overcome. Attaching one of those engines to standard BI software is like towing a farm wage with a dump truck: you get better performance but are not taking full advantage of the technology.
So let’s do a little thought experiment and imagine a BI system designed with analytical databases in mind. Probably the most important feature of these databases is that they are relatively insensitive to data structure. (Caveat: each product is different; some are more sensitive to structure than others.) In many cases this means the system can efficiently execute queries against data in its original table structure, without the denormalization, indexing and aggregation used by conventional BI tools. This offers tremendous savings in time and effort, since much of the work in conventional BI systems is designing and populating these alternative structures. What’s worse, at least from an end-user viewpoint, this restructuring must be done by technical specialists. This means that any project must go through a time-consuming cycle of specification, development, review and modification as users struggle to understand their needs and the technicians struggle to understand the users.
But just because the analytical systems can query complex data structures, it doesn’t follow that end-users can understand those structures. Only a sadist would copy hundreds of database tables from an enterprise system and ask an end-user to query them directly. To take advantage of the analytical databases, a BI system needs a metadata layer to present the underlying tables in a way that makes sense to non-specialists. Indeed, one of the best reasons to retain existing BI software is that at least some of the products already contain extensive facilities to decouple the underlying physical data structure from the structures that are presented to users. To the extent that existing BI systems are not dependent on particular physical structures, they can provide excellent models for features that make end-users more productive.
One area where existing BI systems won’t provide much help is allowing end-users to connect to new data sources. In the current systems, this is a task for technicians, so the interfaces are designed with technical users in mind. But a major advantage of the new databases is how easily they can integrate new data with existing tables. A BI system designed to take advantage of this would allow end-users to do it for themselves. This implies easy features to physically load the data, explore it, define relationships between tables, and set rules for how key values are matched.
Data exploration is a particularly important opportunity, because the analytical databases let end users examine the new inputs immediately. This means users can quickly judge the data in each field, uncovering good and bad surprises that might otherwise remain hidden until weeks of technical effort had been invested in deploying them. A BI interface has to support this by providing appropriate data profiling and analysis tools. It must also support the next step, relationship definition, by making it simple to define trial matches between different tables and see how closely they correspond.
Value matching is particularly important as systems incorporate increasingly heterogeneous data sources. Exact matches between two columns are easy, but something as simple as a customer number stored as a numeric field in one system and a text field in another can cause trouble with conventional databases. Alternative date formats are another trivial difference that can still be difficult to overcome. Fuzzy matches, such as variant forms of names and addresses, are much more challenging but very important as users incorporate a wider range of inputs. In the conventional BI world, removing these discrepancies was another task handled by the technical experts as they set up the database. If the technical experts themselves are removed from the process, this matching must be handled automatically by the database or the BI system, or end-users must have tools to manage it for themselves.
Let’s add one more requirement. Although some analytical databases allow only a single logical data structure for each physical set of tables, most resemble SQL in letting users define table relationships within each query. Databases that allow this flexibility need interfaces that make it available to end-users, either when they are formulating queries or when they are setting up the presentation layer.
All of these features support the fundamental goal of letting end-users ask unanticipated questions without needing a technician to build a new cube, reorganize the data tables, or expose details that were lost in a previous aggregation. Today’s analytical databases make this possible, but new business intelligence systems are needed for them to deliver.
Next month’s column will look at some existing BI systems that offer these capabilities today.
******************************
Last month’s column described how business intelligence software could be adapted to take full advantage of today’s analytical databases. The premise was that current BI systems are designed to overcome weaknesses that applied to conventional relational databases but not the newer analytical engines.
It wasn’t very hard to come up with the list of features for these next-generation BI systems because they already exist. The pioneers are vendors like QlikTech, illuminate Solutions, ADVIZOR Solutions, Tableau Software, TIBCO Spotfire, Lyzasoft, smartFocus and Alterian who did not originally constrain themselves to be compatible with SQL. Instead, they developed interfaces that let them take advantage of their proprietary database engines in ways that SQL wouldn’t allow. Some of these vendors provide APIs to let external systems access their data, and some of these do support SQL queries. But SQL front-ends cannot take full advantage of their database features.
The main advantage these systems have over conventional BI software is the power they provide to business users. Adding a new source or building a new report doesn’t require a technician to modify the data structure or generate a new cube. Most of these products also provide at least some data preparation capability through a scripting language or process flow. This won’t replace the integration needed to build an enterprise data warehouse but it lets users outside of IT create applications that go well beyond simple reporting. Some technical skill is needed, but it’s on par with an Excel power user, not a database architect.
Of the systems listed above, only QlikView – with more than 11,000 installations and over $120 million revenue in 2008 – can really be considered a major force in the industry. The other products have much smaller bases, even though several have been around for more than a decade. This raises an obvious question: if the systems are really so great, then why haven’t they been more widely adopted?
One constraint is scalability. Many of these products originally ran on 32 bit architectures that limited the in-memory systems to two gigabytes of data and the disk-based systems to 2 billion rows. Load speeds were also originally measured in tens of gigabytes per hour. Those specs are more than adequate for many purposes, and probably larger than the data cubes underlying many conventional BI systems. Moreover, many of these constraints are now removed by 64 bit architectures and parallel processing. But handling the hundreds of terabytes in even a small enterprise data warehouse is still an issue. So, at best, these systems must coexist with a conventional database rather than simply replacing it.
The more important reason is probably IT department resistance. IT departments are generally conservative when it comes to unfamiliar vendors. But, beyond that, these vendors’ decision not to match SQL standards makes their databases seem dangerously isolated from the rest of the corporate infrastructure. Any data loaded into the systems can only be accessed via the vendors’ own tools or APIs, which raises a red flag for IT departments wary of reliance on a single product. In addition, the unique technology of these tools means that special training is needed to use them – an investment that IT departments seek to avoid. The small installed base of these products also means that IT can’t plan to hire experienced staff that someone else has trained already.
Of course, IT departments do sometimes embrace new technologies, particularly when they appear to solve a pressing problem for the IT department. But remember that the main advantage of these tools is that they empower business users to do more for themselves. In theory this would be an advantage to IT departments, because it would reduce their workload or let them shift resources to other tasks. But in practice, if an IT department has already invested in conventional databases and BI applications, these systems offer few new capabilities. Asked to examine the new systems, IT people often honestly shrug, “I can already do that”. Giving similar capabilities to other people doesn’t strike them as particularly important. And, no one likes to make their existing skills obsolete.
As a result, adoption of these new tools has largely been restricted to situations where IT isn’t involved or IT itself lacks an incumbent alternative. For the lower cost systems, this often means they are purchased by business departments for their own use, outside the IT budget and with minimal IT support. Another sweet spot is small to mid-size companies with few IT resources and little existing business intelligence infrastructure. On the relatively rare occasions when a large IT shop purchases one of these tools directly, it is usually because its technology solves a particular problem that has proven impermeable to conventional methods.
The SQL-compatible analytical databases avoid most of these obstacles because they mesh more easily with existing infrastructure. But, precisely because they continue to rely on conventional, IT-centric BI applications, these systems offer fewer benefits to business users than the more radical alternatives. The vendors selling these systems and the business people eager to use them share the challenge of convincing IT departments and senior management to approve them. Next month’s column will discuss approaches to help make this happen.
******************************
The story so far: analytical databases and new business intelligence interfaces deliver radical improvements in BI price, performance and ease of use. But IT departments often fail to appreciate the benefits of systems that reduce the need for skilled technical support. Far-sighted business and technology managers need ways to present the advantages of these systems more effectively.
Before going further, it’s important to be clear that IT departments are not a villain in this tale. They have sound reasons to be cautious in accepting any new technology, since they will be held responsible if it fails. Systems that promise to give more control to end-users are legitimately worrisome because end-users will make mistakes that IT professionals would avoid and may undertake projects they do not realize are beyond their true competence. IT managers worry, often correctly, that they will still be blamed for any problems that result.
Of course, IT departments make mistakes too, including some that end-users would not. This is why the groups must work together. The goal is finding the right balance of tasks for the two groups. New technologies call for an adjustment of this balance. That is precisely the challenge posed by the new analytical systems.
Enough philosophizing. Companies don’t assess technologies in terms of the proper balance between end-users and IT. Most have a much simpler approach that boils down to assessing benefits vs. costs.
In the case of business intelligence systems, the benefits are often somewhat vague – things like “better decisions” – and particularly difficult to quantify in advance. When the question is one BI technology vs. another, companies can easily assume that the outputs, and therefore the benefits, will be the same regardless of which technology is chosen. Even though they may recognize this is not completely true, it’s a helpful simplifying assumption because it lets them focus solely on comparing the costs of the competing solutions.
One approach to promoting the new analytical systems, therefore, is showing that they have a substantially lower total cost of ownership (TCO).
As with any TCO calculation, the trick is in knowing which elements to include. The main benefit of TCO is that it helps buyers look beyond the out-of-pocket investment – typically the license fee in the case of software – to include the cost of services and on-going support. But this definition itself is still too narrow, because it doesn’t capture the time end-users spend working with the system. One of the main benefits of the new analytical technologies is giving end-users a greater ability to do things for themselves. Although this might seem to increase the amount of time they spend, in practice it eliminates the effort devoted to explaining their needs to technical specialists, reviewing what those specialists produce, and then often revising the requirements based on the results. The result may well be a net decrease in the time that business people spend to complete a given project.
There is also an intangible, but quite real, benefit from letting business people work directly with the data. Doing this provides an opportunity to gain insights that they miss if they only review the results of someone else’s labor. Even though the value may be hard to calculate, it’s intuitively clear that an hour spent analyzing data is a more valuable use of a business analysts’ time than an hour spent explaining his specifications to IT.
In fact, it’s possible to argue that a business analyst generates value only while doing analysis and generates cost while doing anything else. By this logic, a system that lets analysts spend more time on analysis and less on meetings simultaneously increases value and reduces cost. However, that’s a rather extreme position related to the analytical systems because much of the new “analysis” time is really spent doing system development. Still, most analysts would probably feel working through development issues is still more productive, in terms of learning new information, than writing up specs and explaining them.
Finally, although it should be obvious that shifting work from IT to business users will reduce IT costs, a conventional TCO analysis may not show this. It’s quite easy, and in many ways only fair, to prepare an “apples to apples” comparison that assumes IT will continue to provide the services it does today. But an experienced BI specialist can often perform a given task almost as quickly with the existing systems as with the new ones. Comparing those two times ignores the fact that with the new system the task would be performed by a business user instead.
In short, to properly capture the benefits of the new analytical systems, traditional TCO analysis must be extended to include the costs of end-user time and to capture the shift in labor to end-users from IT.
Conventional TCO analyses also ignore the cost of waiting for an answer. In cases like inventory and price optimization, the company can calculate the precise penalty for every second of delay. Most BI applications are less time-sensitive, but there is still a real difference between getting an answer today and getting that same answer one month from now. Next month’s column will look at ways to factor this time element into the system evaluation.
******************************
Last month’s column described how total cost of ownership (TCO) analyses must be expanded to properly capture the savings of new analytical technologies. It’s tempting to measure cost alone because quantifying the value from competing BI solutions is so difficult. But value is too important to ignore completely. Even if you can’t measure it directly, it’s worth finding an indirect approach.
One method is to consider time: even if we don’t know what an answer is worth, we are pretty certain its value is higher if we get it sooner. Phrases like “time to answer” or “time to insight” to clarify that they represent an end-to-end measure rather than individual components such as response time.
This is important for the new technologies because many of their advantages come from eliminating non-computer tasks such as design time or meetings. These are not captured in traditional benchmarks or processing specifications. Traditional analyses also often ignore the time spent waiting for technical staff to deliver a solution or the delays from several rounds of rework as users react to initial results. A good “time to result” analysis might provide two broad measures:
– time to first query. This measures elapsed time from the decision to use a technology to when the system can process its first query. It include contract negotiation, hardware acquisition, software installation, database design, data loading and indexing, front-end tool integration, and training for both technical and business staff. Although it may seem odd to include something as non-technical as contract talks, bear in mind that some cloud-based solutions can literally reduce these from months to minutes. That is a difference worth capturing.
– time to answer. This measures the time to complete an actual BI project, including problem specification, data preparation, result analysis, and subsequent rounds of revisions.
Both of these analyses would include conventional measures of labor hours and machine processing time. But they also incorporate the waiting times – for hardware, training, meetings, approvals, and fitting tasks into a busy schedule – that are often ignored but determine when a project is really completed.
Like any benchmark, each “time to result” analysis must be tailored to a company’s specific situation. Analysts will need to define scenarios with the particular data sources, staff skills, and questions involved. One practical way to develop these scenarios is to build a narrative based on previous projects, starting with the original specifications and tracking how these evolved. This narrative can then be converted into specific tasks, such as adding new sources, refreshing existing data, redesigning queries, and creating new reports. Wait times in particular are difficult to assess, because they depend on the priority assigned to any one project. But records of past service levels should provide an estimate of what can be expected on average.
The “time to result” analysis must also distinguish set-up tasks from repetitions. Hardware for a new system is only ordered once, while data will probably be refreshed at regular intervals. The theoretical ideal would be to simulate a year’s worth of projects, taking into account the frequency of different tasks as well as the time for each task. In practice, few companies are likely to be so thorough. But they should still keep these distinctions at least informally in mind.
Set-up tasks also highlight the difficulty of comparing incumbent technologies with new solutions. The first project with a new technology will incur training costs and delays, while doing the project with an existing technology will not. This is a real difference, so it’s wrong to simply ignore the benefits of using existing resources. But it’s also only a one-time cost, which can legitimately be amortized over several years of projects. Again, this requires envisioning a project portfolio of some sort so you can estimate how many projects to include.
Note that set-up costs can sometimes work in favor of a new technology. If the company has steadily expanding or rapidly fluctuating needs, then a technology that makes it easy to add new capacity or train new users will eventually be more economical than an existing technology that is harder to expand. Cloud-based systems, with their low set-up costs and brief waiting periods, are the extreme example of how set-up costs can provide an advantage over the incumbent.
Of the two measures proposed above, “time to first query” applies more to database engines and “time to answer” applies more to BI applications. Even though the new analytical database engines are broadly similar in using columnar data structures on “shared nothing” parallel hardware, there are still substantial differences in how they deal with database design, compression, indexing, incremental updates, new data sources and other tasks. These can impact both the processing and waiting components of “time to first query”. The impact of these differences also varies greatly depending on the volumes and complexity of the data being processed.
“Time to answer” is generally more sensitive to the amount of work that end-users can perform for themselves. This depends primarily on the interface built into the BI application. Today, the most user-empowering interfaces are attached to the non-SQL analytical databases like QlikView, iLuminate, and ADVIZOR. But as BI systems take advantage of the flexibility provided by the SQL-based analytical engines, time to answer will be an increasingly important measure for those databases as well.
Summary
This series started with a simple question: what would business intelligence applications look like if they weren’t designed around the constraints of conventional relational databases? We’ve seen that the systems would be much easier and cheaper to operate and that they would make end-users more productive because they could do much more for themselves. We also described some existing systems that match this description and looked at some of the obstacles faced by advocates trying to get them adopted. Finally, we suggested how two evaluation methods – expanded Total Cost of Ownership and new “time to result” measures – can highlight the new systems’ advantages.
That’s a lot of ground to cover, but consider the topic: reinvention of business intelligence as we know it. Companies must make many changes to adapt to the new technologies but the results – a huge improvement in the value delivered by business intelligence systems and a huge reduction in cost – is well worth the effort.
* * *
David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.
Leave a Reply
You must be logged in to post a comment.