2003 May 01
Autonomy Corporation Autonomy
David M. Raab
DM News
May, 2003
.

Data warehouse experts used to joke about “write only” databases–systems that were useless because it was impossible to access their contents. (OK, it isn’t much of a joke–there’s a reason these people are technologists not comedians.) Happily, the data warehouse industry has now evolved tools and techniques to overcome most of its data access problems.

But data warehouses work with highly structured data, stored in the records and fields of conventional files or the rows and columns of relational databases. The world also contains huge amounts of unstructured text in word processing files, email messages, Web sites, spreadsheets and presentations. Accessing this type of data poses a different set of challenges. (Non-text data, such as sound and image files, is yet another issue.)

The central challenge in managing text data is applying structure. This can be applied to the documents themselves, by assigning them to categories: for example, news articles mentioning Bill Gates. Structure can also be applied to information within each document, by extracting specific facts: for example, Bill Gates is married to Melinda Gates. Document classification generally uses statistical techniques to identify the characteristics of documents in each category; new documents are then classified by measuring how closely they match these characteristics. Extraction systems typically apply semantic analysis, which uses sentence structures and word definitions, to identify specific information.

Each method has its advantages. Statistical techniques are fast, language-independent, and can automatically identify new categories or concepts when documents do not fit an existing pattern. Semantic methods require preliminary effort to identify the rules and vocabulary of each language, but give more precise results. Vendors tend to focus on one method or the other, and of course feel strongly that their choice is superior. But in practice, both approaches can be made to perform most of the same functions. So from a user’s perspective, it’s more important to evaluate individual systems against specific requirements than to just look at general techniques.

Certainly both types of systems can perform document classification, which is the core capability of any text analysis system. Classification enables many specific applications: searching for and retrieving documents on a particular topic; picking the most suitable reply to an inquiry; generating or extracting summary information about a document; identifying individuals with an interest or expertise in specific topics; alerting users when new information appears on a topic of interest.

Both types of systems can also generate and manage taxonomies, which are structures that define relationships among the categories themselves. These make it easier for users to navigate a body of data by providing a map of its contents. Often more than one taxonomy applies: for example, a collection of business news articles might be classified independently by geography, industry, company, date and topic. Most text analysis systems can generate taxonomies automatically, although the results would typically be reviewed and refined human experts. In practice, automated taxonomy generation is probably less common than starting with a prebuilt taxonomy that reflects established ways of viewing a particular topic. This makes access more intuitive for users who are already familiar with the standard structure.

Once a taxonomy is established, the classification mechanism is typically trained by providing examples of documents known to belong to each category. The system then identifies characteristics these documents have in common–that is, it develops models that predict whether a given document will belong to a particular category. These models are then applied to new documents as they are submitted for categorization.

Text analysis displays the typical characteristics of an emerging industry. There are dozens of small firms, none with a dominant market position and each arguing for the technical superiority of its approach. Product configurations vary widely, from suites with a broad range of capabilities to point solutions performing a single function. There are few standards shared by different systems, although XML is commonly used for category tags. Implementations are still mostly limited to specialized tasks; for text analysis, these include Web search, personalized news reporting and, lately, anti-terrorism surveillance. Other than early adopters, few potential users understand the basic nature or value of the products.

Autonomy (Autonomy Corporation, 415-243-9955, www.autonomy.com) is one of the largest text analysis vendors–although with annual revenues around $50 million, it is far from huge. It uses Bayesian statistics to identify word patterns that signify concepts within documents. But–following the classic strategy of an early leader in an emerging industry–Autonomy positions itself more broadly, as providing an “infrastructure” that makes unstructured data available throughout the enterprise. It supports this claim with connectors and applets that display Autonomy outputs within third-party systems. Perhaps the most interesting example is ActiveKnowledge, which automatically displays a list of documents related to whatever the user is viewing in a third-party application. Autonomy can also provide external applications with profiles of user interests, personalized news feeds or alerts, lists of people with similar interests or expertise in a particular field, and keyword as well as concept-based searches.

Other enterprise-level features include connections to more than 200 data types, sophisticated integration with external security systems to control document access, and classification speeds measured in thousands of documents per second.

Specialized Autonomy products include automated email response, Web site personalization, and fact extraction. Autonomy can also provide speech recognition technology that converts audio and video feeds into transcripts, which it then analyzes like any other type of text document. Because Autonomy uses statistical rather than semantic techniques, it is language-independent and can automatically identify new concepts as these start appearing in new documents. These capabilities make it particularly suited for surveillance applications like monitoring telephone conversations. The system has indeed been sold to several security agencies, although how they use it is not made public.

Pricing of Autonomy depends on the number of users and system functions. A typical large installation costs about $400,000. The product was first released in 1996. It has since been sold to over 600 clients and embedded in software from more than 50 other vendors.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.