2002 Aug 01
Automated Text Analysis
David M. Raab
DM News
August, 2002

Marketers are rightfully wary of claims from vendors of new technologies. But while being first is no longer enough reason to try something new, some new things are still worth trying. One area worth considering is automated text analysis.

Some forms of text analysis have been around for many years. Much of the initial work was grounded in academic research on artificial intelligence. This also encompassed speech processing, computer vision, language translation, neural networks and other components. While all of these have met some success, text analysis has progressed much further in its practical applications.

What pulled text analysis out of the lab was the Internet. Two crucial Internet functions require text analysis: search engines and email response management. At the height of the tech boom, text analysis specialists raced to develop products that drew on their skills. The search engines and automated response systems they created are now so much a part of our everyday lives that we don’t think of them as based on advanced technologies.

Of course, some people might argue that a really advanced technology would produce fewer irrelevant search results and automated replies. Many text analysis gurus would actually agree, because the most sophisticated text analysis approaches are not yet embedded in common Internet products.

The technical details of these methods are best left to specialists. But in general their approach is to move beyond scanning for specific “key words”–still the most common approach for search engines and automated response systems–to classifying documents based on the concepts expressed in their text. Some systems derive classifications from general information such as dictionaries and grammars; some develop custom classifications from input documents; and some rely on classification schemes provided by the user. Many use a combination of these methods.

Once the categories are established, they can be applied to new documents, which can then be accessed by category rather than key words. This avoids key word problems such as missed matches due to synonyms and irrelevant results when words have multiple meanings.

In addition to assigning documents to categories, most systems can establish relationships among the categories themselves. This lets them identify concepts and documents that are similar to others, and sometimes even arrange these concepts and documents in hierarchies. Related functions can identify the most important sentences within a document and use these to build document summaries.

Some systems also identify specific bits of information within a document, such as names, dates and locations. This capability, called feature extraction, can convert unstructured text into a structured database record. This is extremely useful, since the records can then be processed with conventional data management tools.

While these methods have not been widely adopted, they have been available for years in specialized products. For example, Autonomy (www.autonomy.com) and Semio (www.semio.com) have long provided search tools using advanced text categorization and similarity measures. Wider use has been limited by practical obstacles such as cost, scalability and difficulty of deployment.

These barriers will fall as the technologies mature. So now is the time to imagine what a marketer’s world will look like when advanced text analysis is readily available.

One change should be an improvement in existing applications. More intelligent search engines should make life easier in general, and allow advances in tools that scan the Web for specific information–say, new competitors or prospects–and summarize the results. (This column reviewed one scanning product, Intarka, two years ago, but the system is apparently no longer available.) More accurate automated responses should also cut customer service costs and improve satisfaction.

But the real change should come from new applications. Perhaps the most intriguing is mining customer comments for trends and opportunities, in the same way that companies today mine their structured data.

One system offered for this purpose today is PolyAnalyst (Megaputer Intelligence, www.megaputer.com, 812-330-0110). PolyAnalyst is actually a set of modules that analyze both text and conventional data. Text functions include categorization and feature extraction. This means the system could identify common themes in customer comments, code individual records with these themes, and then prepare a detailed statistical analysis of the records. Such tight integration of text and data analysis is unusual and obviously convenient. It helps that PolyAnalyst’s conventional data analysis functions are themselves extensive and impressive.

Megaputer offers a separate text analysis product, TextAnalyst, using a different approach from PolyAnalyst. While PolyAnalyst provides detailed analysis of individual records, TextAnalyst is oriented to organizing groups of documents. Megaputer is also working on yet another product, due in several months, which will do text processing such as classifying and routing emails.

Island Data (www.islanddata.com, 760-517-4100) already offers text processing based on concepts in combination with key words. The vendor’s flagship product, Express Response, is used for online customer service such as automated email response and message routing. Using concepts, the system can classify messages in terms such as tone and urgency and can identify situations such as sales opportunities or attrition risks. This can happen in real time, allowing immediate response when an opportunity presents itself. Most other text analysis product work as batch processes. Island Data is working on a new product that provides similar capabilities but is oriented toward marketing applications rather than customer service operations.

Island Data works as an application service provider–that is, messages are routed to its computers and processed there–rather than selling its software for operation by its customers.

Text Analysis International (www.textanalysis.com) takes the opposite approach, offering tools for users who want to build their own text processing systems. Capabilities include categorization, summarization, natural language queries, text analysis, indexing and data extraction. The vendor also provides a programming language tailored for natural language processing, knowledge base management system, rule generation engine and runtime text analyzer. Clients can combine these to build and access databases of information extracted from unstructured text.

These are just a few of the vendors with interesting text analysis products. One place to look for a more complete list is www.kdnuggets.com/software/text.html.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.