Full text search primer
Full text search (FTS) is a way of searching a table for strings more effectively than using LIKE filters. When you use a LIKE filter, SQL Server basically scans the whole column looking for matches. If your filter has a starting filter, then the database engine can use an index. So, for instance,
SELECT EmployeeID, Firstname, Lastname FROM Employee WHERE Lastname LIKE 'Smith%'
returns not only John Smith but also Johan Smithson because the % matches everything. And, since the engine knows the start of the dtring it can use an index on Lastname if one exists.
If the % is at the beginning of the criteria, '%mit%' for example, then the engine will have to search all of the rows for a match. Not only that but it would have to check for a match starting with the first character, then with the second and so on for every row. Which is a very expensive task.
So we have Full Text Search (FTS). FTS is a tool that you can use to perform searches of text data that is far more flexible and yet still has issues. To begin understanding FTS, you need to know that it operates separately from your data. I’m not going to go into step by step directions on how to set up FTS since there are already many guides who will help with that. FTS is first enabled for a database. Once that is done, it is enabled for specific columns on tables in a database. And that’s where we’ll begin our exploration. Please note, the following is not an exact representation of SQL Server’s implementation of FTS. That would be far too complex. It is a necessarily simplification to give a general idea of what FTS is doing.
When a table has FTS enabled on it, SQL Server basically creates an index of that data, far different from traditional indices. Let’s say that we have a single row in an FTS enabled table and that table only has an identity column and a single nvarchar(2048) column. The data in that table looks like this:
1 Full Text Search is the wave of the future.
To begin with SQL Server will parse that into 8 potential index entries:
1 1 Full
2 1 Text
3 1 Search
4 1 is
5 1 the
6 1 wave
7 1 of
8 1 future
There are only 8 potential entries because “the” was in there twice. Next, SQL Server removes “noise” words. Words like is, the, of are all words of little value in FTS. This leaves it with a simple index of 5 words. When you’re reading through the books online about FTS, you’ll encounter the word token, which is a more accurate representation of what they are storing, as you’ll see shortly. And the reality of this is far more complex because it actually stores data about how close the word Full is to the word Search as well as linguistic meaning, but I don’t want to get too far afield here.
SQL Server then provides two basic mechanisms for accessing FTS data: FREETEXT and CONTAINS (and their table variants FREETEXTTABLE and CONTAINSTABLE). CONTAINS is the more powerful of the two, allowing not just simple searches but thesaurus lookups (so a search on car can return matches for van, automobile, and taxi as well), inflection (hit could return hits and hitting matches as well), a form of LIKE (so Florida sem* would return Florida seminaries and Florida Seminoles but not Florida State Seminoles), and some weighting and proximity lookups.
FREETEXT, while simpler to use, isn’t quite as powerful. It just looks for word matches, and inflectional and thesaurus versions of the words.
The real problem with either FREETEXT or CONTAINS is that people just don’t think in the format required for using either of these. If you put a simple input box so people can search your database (as I have for my quotes database), most people are going to believe that they can simply put in “life adventure” and find all quotes related to life and adventure. Try to execute something like this:
SELECT * FROM quotes WHERE CONTAINS(*,' life adventure')
That returns an error:
Msg 7630, Level 15, State 3, Line 1
Syntax error near 'adventure' in the full-text search condition ' life adventure'.
The reason for this is the way that CONTAINS parses the search conditions. While Google has trained us that two words next to each other contain an implicit AND, SQL’s FTS requires an explicit one. So, for CONTAINS to work, the search phrase needs some modification. One possibility, the first I tried, was to replace the spaces with an AND. Which does a good job of transforming life adventure in life AND adventure, but it gets more complicated when you try to parse the search phrase
life adventure "sure to be short"
In this case, there are really three tokens: life, adventure and the phrase “sure to be short”. To make that work with my simplistic adding of AND ended up incorrectly breaking up the three phrase token. So my next step was to try to add some tokenizing based on quotation marks. Unfortunately, SQL Server isn’t the greatest at string manipulation and parsing in a UDF is possible but gets unwieldy very quickly. So I ended up writing a UDF that took the easy way out and simply replaced spaces with asterisks outside of quoted text for the CONTAINS and stripped out any punctuation.
CONTAINS and FREETEXT are used within the WHERE filter of a SELECT. But you can also use CONTAINSTABLE and FREETEXTTABLE, both of which return a table containing two columns: Key and Rank. Key is the primary key value of the table being searched, rank is the relevance, where a higher rank indicates a better match. In addition, you need to specify the table that you are looking for. All of that wrapped up in a table function that can be easily joined:
SELECT * FROM quotes q INNER JOIN CONTAINSTABLE(quotes, *,' “life*adventure” ') ct
ON q.quoteid = ct.[key] ORDER BY ct.Rank DESC
Note: Key is a reserved word so you have to wrap it in square brackets.
A couple of things I learned from this exercise. You need to have spaces around the quoted text in contains. I don’t know why SQL doesn’t handle that but it will throw an error without the spaces. Also, this won’t return all the matches that exist. To attempt to get more, I also used FREETEXT and UNION ALL, so that my final query looked something like this:
SELECT * FROM quotes q INNER JOIN
( SELECT [key], max(rank) Rank FROM (SELECT [key], rank from CONTAINSTABLE(quotes, *,' "life*adventure" ')
UNION ALL
SELECT [key], rank from FREETEXTTABLE(quotes, *,' "life adventure" ')) a
GROUP BY [key])ct
ON q.quoteid = ct.[key] ORDER BY ct.Rank DESC
Between those two I generally got most of the hits I needed, though it still wasn’t perfect. And the reason it wasn’t perfect is because the Google like version of what CONTAINS is looking for would be this:
CONTAINS(‘( FORMSOF (INFLECTIONAL, life) AND FORMSOF (INFLECTIONAL, adventure) )’
And, to get that, I highly recommend you check out Michael Coles' A Google Like Full Text Search. The only problem with this solution is that it can’t be run within SQL Server, not even as a CLR. The Irony parsing engine has some issues. So you’ll need to have the calling program implement that piece of code and pass it in. In my case, I modified the web service backend to parse the search request before calling the database.
So, if you need to include FTS, your best bet is to format your query the way that SQL wants it, which requires formatting it outside the database. Your next option is to modify the search to something more like what FTS can use. Feel free to modify the functions below to meet your needs.
-- =============================================
-- Author: Josef Finsel
-- Create date: 3/18/2008
-- Description: Make a FTS Search String
-- =============================================
CREATE FUNCTION [FTS_SearchFormatter_ContainsText]
(
-- Add the parameters for the function here @Search nvarchar(1024)
)
RETURNS nvarchar(1024)
AS
BEGIN -- Declare the return variable here declare @Search2 nvarchar(354)
set @Search = rtrim(ltrim(@search))
set @search = replace(@search, '.','*')
set @search = replace(@search, ',','*')
set @search = replace(@search, '!','*')
set @search = replace(@search, '?','*')
set @search = replace(@search, ':','*')
set @search = replace(@search, ';','*')
-- If the search phrase has spaces but no quotation marks
if @search like '% %' and @search not like '%"%'
begin
set @Search2 = '"' + replace(@Search,' ','*') + '"'
end
-- It didn't have spaces without quotation marks so know we
-- wrap it in quotation marks if it isn't already
if @Search2 is null and @Search not like '%"%'
set @Search2 = @search
-- If it hasn't met any criteria, set it to what was passed in
if @search2 is null
set @Search2 = @Search
set @Search2 = replace(ltrim(rtrim(@Search2)),' ','*')
WHILE Charindex('**',@Search2)>0
set @Search2 = replace(@Search2, '**','*')
return @Search2
END
GO
-- =============================================
-- Author: Josef Finsel
-- Create date: 3/18/2008
-- Description: Make a FTS Search String
-- =============================================
CREATE FUNCTION [widget].[FTS_SearchFormatter_FreeText]
(
-- Add the parameters for the function here @Search nvarchar(1024)
)
RETURNS nvarchar(1024)
AS
BEGIN -- Declare the return variable here set @Search = rtrim(ltrim(@search))
set @search = replace(@search, '.',' ')
set @search = replace(@search, ',',' ')|
set @search = replace(@search, '!',' ')
set @search = replace(@search, '?',' ')
set @search = replace(@search, ':',' ')
set @search = replace(@search, ';',' ')
set @search = replace(@search, '*',' ')
declare @Search2 nvarchar(354)
set @Search = rtrim(ltrim(@search))
if @search like '% %' and @search not like '%"%'
begin
set @Search2 = ' "' + @Search + '" '
end
if @search2 is null
set @Search2 = @Search
set @Search2 = replace(@Search2, '**', '*')
return @Search2
END