Microsoft and Enterprise Search Q&A

Damir Dobric Posts

Next talks:

 

    

Follow me on Twitter: #ddobric



 

 

Archives

What is Consistent Search?

Office SharePoint Server 2007 and Windows SharePoint Services 3.0 now use a common implementation of Microsoft Search. This means that the search experience is consistent across both Office SharePoint Server 2007 and Windows SharePoint Services 3.0.

What can be indexed?

SharePoint servers, Web sites, file shares, Exchange Public Folders, Lotus Notes databases, or LOB apps.

What is relation between Enterprise Search and Desktop Search?

The index is similar to the index technology used in Windows® Desktop Search.

What is Content Ranking?

It is a number which defines how important is a document. There are two type of content ranking: Static and Dynamic. The static ranking is not impacted by content. For example, ClickDistance, URL-Depth, Language and FileType-Biasing.
The dynamic ranking is dependent on property values and content. For example, URL Matching, Anchor Text and Title Extraction.

What is ClickDistance?
Click Distance refers to the number of links between a content item and an "expert" page linking to the content item. The more links that the crawler must travel from an authoritative page to the content item, the lower the relevance score. If there are multiple paths to a content item, relevance is calculated based on the shortest path, the one with the least amount of links from the authoritative page to the content item.
What is ContentDepth?
Search rengine refers to how many levels deep within a site the content item is found. The level is determined by reviewing the number of slash ("/") characters in the URL; the greater the number of slash characters in the URL path, the deeper the URL is for that content item.
What is FileType Biasing?

In most search scenarios, certain file types are more relevant than others. For example, HTML pages and Word documents are usually more relevant to a user's search than an Excel spreadsheet or a plain text file.

What is Anchor Text

Anchor text is the text that is included with a hyperlink to describe the target content of that hyperlink. When Enterprise Search crawls the content item, this text is included in the index for that content. Anchor text only influences rank, and is not the determining factor for including a content item in the result set. For example, if all the query terms are found only in the anchor text and not in the actual content of the item, the link may be obsolete, so the content item is not included in the results.

What is URL Matching

URL matching is the process by which Enterprise Search checks content item URLs for a direct match with the specified search terms.

What is Title Extraction

Title extraction, or using the title value in the relevance calculation, can help return highly relevant content, if the content item is appropriately named. However, there are scenarios where the value in the title property does not accurately reflect the content. For example, the following titles do not provide valuable information about their content:

What metadata is included in measurement relevancy?

Several metadata tags are included in the relevancy calculations:

  • Click Distance Browsing distance from authoritative sites (shorter distances tend to be more relevant).
  • Anchor Text Hyperlinks act as annotations on their target. In addition, they tend to be highly descriptive.
  • URL Depth URLs higher in the hierarchy tend to be more relevant.
  • URL Matching Direct matches on text that's in URLs.
  • Metadata Extraction Automatically extracts titles and authors from document text if they are missing.
  • Automatic Language Detection Helps create preference for results in your language.
  • File Type Biasing Certain file types tend to be more relevant (for example, PPT files are often more relevant than XLS files).
  • Text Analysis Traditional text ranking based on such factors as matching terms, term frequencies, and word variants
What are main components of the search engine?

Two main components make up the index: a content index and a properties store.

What is a Content Index?

The content index includes the actual text contained in files as well as an associated inverted index of words that are in the enterprise.

What does contain property store?

It contains Metadata properties like author, date created, document type etc. Imagine property store as a table of properties and their values. Each row in the table corresponds to a separate document in the full-text index. The property store also maintains and enforces document-level security that is gathered when a document is indexed.

What is an IFilter?

An IFilter is an add-in that enables the index engine to open, read, and index the contents of new file types it would not otherwise be able to fully index. IFilters extract the text and the metadata for each document and then pass the stream back to the index engine. Document properties are then stored in the properties store, and the actual text of the document is placed in the content index.

IFilter extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document.

What is Continuous Propagation?

The index engine uses continuous propagation, which allows the index to be built almost immediately. With continuous propagation, the index continues to be built even as the crawling process moves through the content sources. The indexing of SharePoint Portal Sever 2003 with large content could take days, because and the index was only propagated when the crawl was completed.

What is a Wordbreaker?
Wordbreakers break the text into words and phrases. It improves the relevance and effectiveness of the results returned by the query.
What is a Stemmer?

Stemmers generate inflected forms of a given word. Each language can have a language-specific wordbreaker. Otherwise the neutral (invariant) wordbreaker will be used. It improves the relevance and effectiveness of the results returned by the query.

What is two-phase query?

When a user inputs a query, it first passes entered words into a language-specific or neutral wordbreaker. After entered text is broken down, the engine passes the information to a stemmer (if stemming is enabled) This two-step process improves the relevance and effectiveness of the results returned by the query.

How the query is processed?

If the query (result of wordbreaking and steeming) specifies property information, the content index is checked first for matches paired with documents in the property store, and then the properties in the query are checked again to ensure a match. The query engine does an additional level of filtering to remove results that the user does not have permission to access.

What is Fast Query Scoping Subsystem

Search scopes help users broaden or narrow the scope of their content searches. Enterprise Search scopes in MOSS are now decoupled from content sources and can be based on arbitrary content properties such as URL, type, and author. Scopes can be based in simple rules, such as "All Marketing Plans," or multiple rules, such as "All Marketing Plans on the North American Sales Web Site, authored by John Smith." In addition, search scopes can be defined globally at the Shared Services Provider (SSP) level and shared across sites, or at the site collection level.

What is Business Data Catalog Protocol Handler

Enterprise Search combined with the Business Data Catalog make it easy to index and search any relational database or other information store accessible by Microsoft ADO.NET or a Web service, for example, data in a customer relationship management (CRM) system.

Note that you do not need to write custom protocol handlers or IFilters or create searchable HTML representations of information in a database.

Search results from the Business Data Catalog can be highly customized and fully integrated with Enterprise Search scopes and other Search Center features.

Custom Security Trimming

The Enterprise Search Query object model includes the API, the ISecurityTrimmer interface, which developers can use to create custom security trimmers to trim search results at query time. This provides support for search results-trimming based on custom authentication types, as well as making it possible to perform up-to-date security trimming without requiring a re-crawl of content.

Here is an example for developers: http://msdn2.microsoft.com/En-US/library/aa981173.aspx

Can I customize enterprise search by using of some API?
Yes, enterprise search provides a powerful API: http://msdn2.microsoft.com/en-us/library/microsoft.office.server.search.query.aspx

Following example shows how to build the web part which utilizes the query API: http://msdn2.microsoft.com/en-us/library/ms551453.aspx

Can I customize enterprise search by using of web services?
Yes enterprise search provides the web service, which can be used for this purpose.
Here is described in more detail how to do that:
http://msdn2.microsoft.com/en-us/library/ms543175.aspx

 


Posted Jun 06 2007, 02:25 PM by Damir Dobric
Filed under:
developers.de is a .Net Community Blog powered by daenet GmbH.