Searching PDF files in .NET using Microsoft Indexing Service

by James Crowe 19. August 2008 14:32
Searching the contents PDF files is a common requirement that seems to cause some confusion regarding how best to implement it.

A simple and flexible solution is using an Adobe PDF IFilter. IFilters are a Microsoft specification that scans a document for its text and properties, allowing Microsoft’s Indexing Service to extract portions of data.


Step 1 – Install the IFilter downloaded from Adobe. Accept all default steps.


Step 2 – Configure the Indexing service.

Basic steps include:

  • Open up computer management > Services and Application > Indexing Service
  • Create a new Catalog, Name and Folder location for Catalog
  • Expand the new catalog and add the directory containing the PDF’s to be indexed
  • Start the indexing service
  • Under the directories section right click the directory containing the PDF’s and select Rescan (full)
  • Select ‘Query the catalog’, from this page you should be able to search for text contained in the PDF files. For more specific results you can query the metadata contained in PDF files. Standard attributes include Title, Subject, Author and Keywords. For further details refer to the Adobe PDF IFilter installation readme file.
  • If the query’s fail to return results, try restarting the server and re-indexing the PDF directory. For further details refer to the Adobe PDF IFilter installation readme file.

Step 3 – Add a reference to the ixsso Control Library


Add ixsso Control Library


Step 4 – Write .NET code to query the indexing server


// Indexing Service Librarys
CissoQueryClass query = new CissoQueryClass();
CissoUtilClass util = new CissoUtilClass();

OleDbDataAdapter dataAdaptor = new OleDbDataAdapter();
DataSet resultsDs = new DataSet("IndexServerResults");

string pdfFolder = @"C:\PDFS\";

// Search query
query.Query = txbSearchValue.Text;

//  Catalog Name
query.Catalog = "PDFSearch";

// Columns to return
query.Columns = "Filename, Path, Size";

// Adds search path to query
// 'deep' will search subdirectories
// Or replace with 'shallow' to search specified folder only
util.AddScopeToQuery(query, pdfFolder, "deep");

// Create recordset
object recordSet = query.CreateRecordset("nonsequential");

// Populate dateset
dataAdaptor.Fill(resultsDs, recordSet, "IndexServerResults");

// Bind results to gridview
grdSearchResults.DataSource = resultsDs;
grdSearchResults.DataBind();

Search Results


Search Results in GridView

The code sample above relates to PDF files because of the type of IFilter used. If others IFilters are installed other types of document can easily be indexed.

Further Reading


Microsoft IFilters
Abode PDF IFilter

Tags:

Comments

11/14/2008 12:10:59 PM #

Is a new catalog required if indexing is already working on the server but only finds .doc and .htm results? thanks

Nick |

12/18/2008 9:37:03 AM #

Title * Please enter a title
Name * Please enter your name
Email Email is not required, but it must be valid if specified.
Url Url is not required, but it must be valid if specified.
Commen

ame |

Powered by BlogEngine.NET 1.5.0.7
Theme by Interakting

Interakting

A full service digital agency offering online strategy, design and usability, systems integration and online marketing services that deliver real business benefits and ensure your online objectives are met.

Calendar

<<  July 2010  >>
MoTuWeThFrSaSu
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

View posts in large calendar