I coffee, therefore IFilter (or, Language-specific processing #1)

by Michael S. Kaplan, published on 2005/03/08 11:59 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/08/389675.aspx

Apologies for the title, I still cannot resist that sort of thing. Maybe one day....
If you have not read it yet, look at Language-specific processing #0 for more info about this series!

IFilter is one interface that you can use to lower the barriers between the engines that do the work of indexing and the data that may be sitting in proprietary formats. The documentation probably explains it better than I could here:

The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter provides the foundation for building higher-level applications such as document indexers and application-independent viewers.

Immediately several of what seems much like the shipping implementations of this feature like this will come to mind: Full Text Search in SQL Server, SharePoint, Exchange, and Index Server for starters. And then there are those like MSN Desktop Search, as well. All of the times that search suppots additional file formats. Imagine being able to get in on the fun to make sure your own format is supported for some type of indexing/searching?

This is a COM interface so to implement it you have to implement AddRef/Release/QueryInterface as always. The additional methods you have to implement:

The general topic about the IFilter interface has pointers to summaries, samples, instructions on building, applying and testing filters, as well as methods to bind to already existing IFilter implementations.

It is also nice to see such a great effort on the security side -- links and information to help guarantee that ISVs who write code against this interface do it securely. Throughout there are good warnings:

Caution    IFilters for Indexing Service run in the Local System security context. They should be written to manage buffers and to stack correctly. All string copies must have explicit checks to guard against buffer overruns. You should always verify the allocated size of the buffer. You should always test the size of the data against the size of the buffer.

That and a link to secure code practices to consider when implementing these interfaces are a welcome touch as far as I am concerned (as it does no good for Microsoft to write secure code if an ISV writes a component with a security issue!).

Now note that this interface, this IFilter, is not really about language-specific processing as much as it is about format-specific processing. But one of the greatest strengths of a service like MS Search is the ability to apply it to different file formats. It makes IFilter a very important interface to stretch the boundaries of what can be searched.

And it gives the future topics, that deal with those more linguistic aspects of language-specific processing a much wider reach than they would otherwise have. So I will give IFilter an honorary "cool" status that I would usually reserve for things more linguisticalish :-)


This post was sponsored by "F" (U+0046, a.k.a. LATIN CAPITAL LETTER F)
A letter that realized it would never get to sponsor any of the fun "F" words while I am working for Microsoft, so it thought it should take "Filter" while it was available.

# Jonathan Payne on 9 Mar 2005 12:04 AM:

Why does the Indexing Service call IFilter using the Local System security context? Wouldn't it make more sense to try and call third part code with the minimal security level needed to get the job done (I realize that this might be hard to determine - perhaps implementers of IFilter could be called with no privileges and be spoon fed the contents of a file unless they requested a more privileged security context).

# bg on 9 Mar 2005 12:13 AM:

probably to much work to do - but you could write a sample to show us how its done. you could, for instance, produce a filter for vb 6.0 .bas/.cls files!

(yes i know you just need to put a Persistant handler key in the registry but thats to easy!)



# RIO - Randektív Informatikai Oldal on 9 Mar 2005 3:56 AM:

Istanbul új nevet kapott: http://www.betanews.com/article/Microsoft_Unveils_Office_Communicator/1110303540

# Michael Kaplan on 9 Mar 2005 3:59 AM:

Jonathan -- The service itself runs with those permissions; I imagine it would be complex to navigate making it callable in variable contexts (though I think that would be worthwhile!).

bg -- I would be a lot more likely to do something with some of the future classes, the more linguistic ones. Stay tuned. :-)

# Michael Kaplan on 12 Mar 2005 9:22 PM:

Lest people think I am making up the potential usefulness here about IFilter implementations:



# Stephen on 7 Oct 2008 3:24 AM:

A lot of the MSDN links on this subject seem dead now - could someone write a little update on this subject now?


# Michael S. Kaplan on 7 Oct 2008 9:16 AM:

Lots of IFilter updates in various other blogs on the server -- for SharePoint, for SQL Server, for others, especially Filter Central. There are like over 2,600 in Google, for example....

Jerry Camel on 17 Dec 2008 11:08 AM:

I can't find a whole lot int he way of examples for developing an iFilter...  Even the SDK references samples that don't seem to exist anymore.

I'm looking specifically for how to pass an embedded document on to it's appropriate iFilter.

Can you point me to some sample code?  I suspect BindIFilterToStream will be involved, but I can't figure out exactly how...



Michael S. Kaplan on 17 Dec 2008 2:48 PM:

I would suggest looking over at http://blogs.msdn.com/ifilter/ for more information here....

Prakash Tandukar on 28 Dec 2008 12:39 AM:

I am able to read properties of *.docx file using ifilter but it does not read any property of *.doc (Microsoft Word 2003). What should be changed in ifilter code to read property of *.doc files as well.



referenced by

2006/11/12 If I wasn't watching her blog, could you really claim I was filtering the list of blogs I read?

2005/03/21 Linguistic and Unicode considerations (or Language-specific Processing #4)

2005/03/14 You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)

2005/03/13 IStemmer'ed the tide (or, Language-specific processing #2)

go to newer or older post, or back to index or month or day