Earlier today I published a blog post about Looping over document similarity details in Semantic Search on the MSDN blog of the SQL Server Full-Text Search team. That post introduces the 3 built-in rowset functions that are provided by Semantic Search in SQL Server 2012 “Denali.” It also describes how to do something for which we don’t provide a built-in set-based function – that is, iterate over the documents that are similar to a source document, and ask why each document is similar. Please take a look!
I had fun playing with semantic search and testing my code. Here are a few more examples of the queries that you can perform with semantic search. Don't laugh at my SQL code! I haven't written much T-SQL while working mostly with SSIS for several years.
The examples in this post use the semanticsimilaritydetails_foreach function created in the blog post linked above.
The examples in this post also use a collection of SQL Server white papers published on MSDN and TechNet – mostly in Microsoft Word format – saved into a SQL Server database by using a SQL Server 2012 FileTable. With this approach, I can easily save BLOBs into the database simply by dragging and dropping the files in Windows Explorer. The database table is named WhitePapers, and it has the fixed schema common to all FileTables.
1. How many white papers are about ETL?
Query:
SELECT COUNT(*) AS 'Number of ETL White Papers'
FROM semantickeyphrasetable (WhitePapers, *) AS SKP
WHERE SKP.keyphrase = 'etl'
GO
Results:
Number of ETL White Papers
--------------------------
30
(1 row(s) affected)
These results tell me that my table contains 30 documents in which “etl” was identified as a statistically significant key phrase. (In version 1 of semantic search, only one-word terms are supported.)
2. What are the top matching white papers that are also about ETL?
Query:
DECLARE @Title NVARCHAR(255)
DECLARE @DocumentID hierarchyid
-- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You."
SET @Title = '1TBin30MwithSSIS.docx'
-- Get the ID of the source document from its title.
SELECT @DocumentID = path_locator
FROM WhitePapers
WHERE name = @Title
-- Get the top n key phrases for each matching document.
DECLARE @DetailsCount smallint = 3
-- Get the top n matching white papers that are also about ETL.
SELECT TOP (10) LEFT(WP.name, 32) AS 'Document',
LEFT(SSD.keyphrase, 16) AS 'Key Phrase',
SSD.score AS 'Score'
FROM WhitePapers AS WP
CROSS APPLY
semanticsimilaritydetails_foreach(@DocumentID, WP.path_locator, @DetailsCount)
AS SSD
WHERE WP.path_locator <> @DocumentID
AND SSD.keyphrase = 'etl'
ORDER BY Score DESC
GO
Results:
Document Key Phrase Score
-------------------------------- ---------------- -------------
HighVolETLConsid.docx etl 0.6152592
Implementing a Microsoft PDW usi etl 0.4083139
HubSpokeOverview.docx etl 0.3958832
DataSrcISPkgs.docx etl 0.3768547
SSIS_die_zweite.docx etl 0.3722264
HubSpokeImpl.docx etl 0.3580182
BestPractDWSQL2008.docx etl 0.3551803
ScaleUpDWinSQL2008.docx etl 0.3497615
OracleSSIS.docx etl 0.3446308
SSIS2008Connectivity.docx etl 0.3446308
(10 row(s) affected)
These results show me the top 10 white papers that are similar to the source document, where “etl” is among the top 3 key phrases that were used to identify the white paper as similar to the source document.
(Note that the semantic indexing also captured a German-language article about ETL.
3. What are the similarities between my source document and other documents?
Query:
DECLARE @Title NVARCHAR(255)
DECLARE @DocumentID hierarchyid
-- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You."
SET @Title = '1TBin30MwithSSIS.docx'
-- Get the ID of the source document from its title.
SELECT @DocumentID = path_locator
FROM WhitePapers
WHERE name = @Title
-- Get the top n key phrases for each matching document.
DECLARE @DetailsCount smallint = 3
-- Get the top n key phrases in matching white papers.
SELECT TOP (10) LEFT(SSD.keyphrase, 16) AS 'Key Phrase',
COUNT(SSD.keyphrase) AS 'Count'
FROM WhitePapers AS WP
CROSS APPLY
semanticsimilaritydetails_foreach(@DocumentID, WP.path_locator, @DetailsCount)
AS SSD
WHERE WP.path_locator <> @DocumentID
GROUP BY SSD.keyphrase
ORDER BY COUNT DESC
GO
Results:
Key Phrase Count
---------------- -----------
sql 149
server 111
sqlserver 30
cpu 24
etl 23
msdn 21
aspx 15
data 11
microsoft 10
tcp 10
(10 row(s) affected)
These results show me the top 10 key phrases that were used to identify other white papers as similar to the source document, where those key phrases are among the top 3 shared key phrases.
Have fun with Semantic Search! For more information see the documentation.
Be sure to check out the killer demo of FileTable and Semantic Search from SQL Server evangelist Roger Doherty and his team. This example includes a tag cloud to visualize the relative importance of key phrases and to find similar documents at a glance.