Thursday, December 01, 2005

IFilter error - byDesign or byBug?

Full-text indexing of binary columns in MS SQL 2005 is – to say the least – somewhat overly vulnerable.

The root of possible problems is the TYPE COLUMN, specifying the type of the binary value (file) that one tries to build a full-text index (FTI) for. The TYPE column is used by the full-text indexing service (FTS) to select an appropriate IFilter to parse the binary value and create a proper FTI.

E.g.: a value of ".doc" in the TYPE column would invoke the IFilter for Microsoft Word documents, and a value of ".pdf" would instruct the FTS to use the Adobe PDF IFilter (if available). No problems here.

Oh, really?

Actually, all is well as long as the value of the TYPE column corresponds to the actual binary value – if the TYPE column says "doc", there better be a MS Word document stored in the binary column, or else the entire row is not processed.

That's right – not just the 'offending' column as is the practice in cases where an appropriate IFilter is not available. The entire row is not processed. Imagine a document management system, where one document is stored per row – this document will never be found through the use of full-text search, not even by title, if the title is stored in another column in the same table, and the latter is also full-text indexed.

The error is, however, reported in the SQL Server Error Log, and is represented with the following text:

Errors were encountered during full-text index population
for table or indexed view '%table/view name%', database
'%database name%' (table or indexed view ID '%object id%',
database ID '%database id%').
Please see full-text crawl logs for details.

Details are available in the full-text crawl log, located (by default) in the following folder:

%ProgramFiles%\Microsoft SQL Server\MSSQL.1\MSSQL\LOG

The name of the crawl log is defined by the following naming scheme.

Any volunteers to design a CLR function to help programatically track possible mishaps?

So far I've experienced the following error (copied from the crawl log):

Error '0x80030050' occurred during full-text index population
for table or indexed view '%table/view name%' (table or
indexed view ID '%object id%', database ID '%database id%'),
full-text key value %key value (binary)%.
Failed to index the row.

...paired up with this one:

The component 'offfilt.dll' reported error while indexing.
Component path 'C:\WINDOWS\system32\offfilt.dll'.

In fact the same goes for the rest of the supported office documents – if any of the IFilters in offfilt.dll (version: 2003.5.28.0) fail, the row is not full-text indexed.

In conclusion: use the TYPE COLUMN carefully – i.e. do not let the user choose the document type, read it from the document header. Incidentaly – why does the SQL Server not detect the actual content type this way? Is this by design or is it a bug?

Any volunteers to design a CLR function to help programatically detect *proper* content type?


ML