2004 Jul 01
Provalis Research WordStat
David M. Raab
DM News
July, 2004

Unstructured data management is a very broad term. Most applications involve written texts, but the field also includes sound, images, maps, video and instrumentation streams. Major text management functions include classification (assigning texts to categories), search (finding documents related to specified topics), extraction (identifying facts within documents), and profiling (identifying people’s interests). No single product performs all the possible functions, although a few of the major vendors try.

Understanding this context is important when assessing unstructured data management software. Most products perform just one major function, and often are limited to an even narrower sub-specialty. This means any assessment must consider both how well a product performs its intended function and how easily it fits into a complete solution.

WordStat (Provalis Research, 514-899-1672, www.simstat.com) is an impressive bit of software, but definitely has its limits. Mostly, WordStat does something that seems quite simple: identify words within a document. This is a fundamental prerequisite for more advanced activities such as classification and search. And it turns out to be not so simple after all.

The challenge is twofold. First, one word can take many forms. In English, verbs change based on the subject and tense; nouns have singular and plural forms; adjectives and adverbs apply suffixes to common roots. Other languages can be even more complicated. The practical issue is that text analysis works poorly unless these variations are removed. For example, simply searching for the word “Canada” would not return documents with the word “Canadian”, even though these would probably be relevant to a request. Linguists have developed standard techniques to deal with these issues, and given them cool names like stemming and lemmatization (conversion to a root form, or lemma). WordStat performs these transformations automatically.

The second challenge is that different words have related meanings. “Angry”, “mad” and “furious” all share a root meaning of “annoyance”; “dog”, “cat” and “hamster” are all types of mammals, as well as common pets. Meaningful text analysis needs to be aware of these relationships. This can only be done through dictionaries that link words to concepts. WordStat includes several public domain dictionaries and thesauruses plus tools to customize these with a user’s own vocabulary.

In fact, even though WordStat includes some impressively advanced analytical functions, its dictionary building features are arguably the most important. Good dictionaries are the foundation of most text analysis, and building and maintaining dictionaries can be the largest part of a text analysis project. WordStat makes this about as efficient as possible, with a graphical user interface that lets users assign words to categories, build hierarchical category structures, distinguish among different senses for the same word, use fragments and wildcards to remove variations, define phrases to treat as a single word, specify words to include or exclude from an analysis, and identify frequently or infrequently used words as candidates for special attention. A particularly helpful feature called “keyword in context” can display the text surrounding each occurrence of a specified word, so users can see how the word is being used.

Dictionaries are built and applied to either single documents or sets of cases. These can be imported from spreadsheets, text files, or several word processing formats. A case can have multiple data elements including text, numeric and categorical variables. Users can view, edit and code text from within the system.

WordStat’s analytical capabilities build on its core function of word identification. The simplest analysis is a frequency report, which shows how often each word or concept occurs. Inclusion and exclusion dictionaries can limit the analysis to only words of interest. At the next level of complexity are matrix reports, which can count the number of cases containing each word, the number of cases containing different pairs of words, or the frequency of each word in cases with other characteristics. This last type of report, unusual for a text analysis system, can show how word frequencies vary among different authors or for the same author over time. Results can be displayed as counts or statistical measures, and can be viewed in tables or several types of graphs.

WordStat provides even more sophisticated analyses, including clustering to identify similar words or cases; proximity maps to show distance between one word or concept and others; concept maps to show relative positions of many words; and heat maps to illustrate relationships between words and independent variables. Clustering could help to extend the dictionary by identifying likely categories for new words, but is not reliable enough to be a fully automated solution. Clustering could also provide a limited form of document categorization. More precise categorization–based on training with previously categorized cases–is planned for the next release, due by the end of 2004.

So what’s the catch? Certainly not price: WordStat is an astonishing bargain at $595. This includes a copy of SimStat, a robust, general-purpose statistical package that WordStat uses as a base. Nor is it scalability: the software runs on Windows workstations and has been tested with up to several hundred thousand cases. And the system is relatively mature, having sold about 300 copies since its introduction in 1998. SimStat itself is even better established, with sales of nearly 4,000 copies since 1989.

The problem is integration. WordStat dictionaries are stored in a specially formatted text file that is viewable, but not designed for external access. Nor can the system automatically load a document or set of cases, identify the words and concepts, and output the results. At best the user could import the data, run a word frequency report, and save the output into a database. This is adequate for periodic research projects, but not, say, automatic routing of email inquiries.

Happily, Provalis is addressing both these issues. The next release will be able to export dictionaries in an XML format, making them easily readable by other systems. It will also allow automated processing of individual documents or case files. Once these capabilities are added, WordStat may transform itself from an impressive but isolated text analysis tool into a valuable part of a true production system.

* * *

David M. Raab is a Principal at Raab Associates Inc., a consultancy specializing in marketing technology and analytics. He can be reached at draab@raabassociates.com.

Leave a Reply

You must be logged in to post a comment.