(Note: The Term Extraction and Term Lookup transformations are available in Integration Services only in the Enterprise edition of Microsoft SQL Server 2005 and 2008, and in the Developer and Evaluation editions which are variants of Enterprise.)
The Term Extraction and Term Lookup transformations in Integration Services were developed for SQL Server 2005 by Microsoft Research in China. They are unchanged in SQL Server 2008. I'm fascinated by data mining and text mining, but haven't had much time to play with these technologies, or to explore connection points between SSIS and SQL Server Data Mining. Especially not since my data mining technical writer buddy Seth left our team to become a rock star. But I digress.
Early in 2005 (!), I asked the MSR China team to describe the text mining scenarios that they envisioned for their components. The old email message is long gone. Fortunately I printed it out. Unfortunately my scanner doesn't have OCR software. For now I'm just going to share the information that I received (with minor English corrections), while waiting for the leisure to play around with this. I'm glad that my typing skills have lasted longer than my Russian skills.
As for using Term Extraction and Term Lookup transformations in Text Mining, here are the scenarios we envision:
1. Key term extraction from texts
Given a collection of texts, we can use the Term Extraction transformation to extract "key terms" from the texts. We sort the extracted terms by their scores and use the top N terms as key terms. Thus we get a roush sense about the contents of the texts.
For example, if the texts are emails from customers to a company, then by using these transformations, it is possible to find the topics discussed in the emails, and to take business actions based on the customer feedback.
2. Automatic categorization of texts
Given a collection of texts, we can use the transformations to extract terms from the texts (using Term Extraction), and use them as features of the texts (using Term Lookup to feature the texts). In this way, each text is represented as a feature vector. We can then use the Data Mining tools in SQL Server (Decision Tree or Naive Bayes) to train a mining model, and conduct automatic classification of new texts by using the model.
For example, if the texts are customer feedback, then we can use the tools to perform automatic, routine categorization of the feedback into a number of predetermined categories.
3. Analysis of texts (similar to previous, but different goals and algorithms)
Given a collection of texts, we can use the transformations to extract terms from the texts (using Term Extraction), and use them as features of the texts (using Term Lookup to feature the texts). In this way, each text is represented as a feature vector. We can then use the Data Mining tools in SQL Server (Clustering or Association Rule) to train a mining model, and analyze the contents of the texts.
For example, if the texts are survey results from a Web site, we can use these tools to conduct analysis of the survey results. For example, by grouping similar opinions together in a single cluster, and by determining how many clusters there are.
4. Combining text mining data with tabular data
If one column in a database table contains texts and the other columns contain structured tabular data, then we can extract terms from the text column and add the extracted terms in additional columns. The additional columns and the existing columns become a larger table.
For example, imagine a table in which each row represents information about a person - age, gender, and so forth. It also contains the text of a message from the person. The terms extracted from the message can become additional "properties" of the person.
See also this older blog post on Text Mining in SSIS by SSIS veteran Donald Farmer, who's now working on the next generation of Data Mining for SQL Server.