Sharepoint Search – How features work part 1
By peter.stilgoe
Word breakers A word breaker is a component used by the query and index engines to break compound words and phrases into individual words or tokens. If there is no word breaker for a specific language, the neutral word breaker is used, in which case word breaking occurs where there are white spaces between the words and phrases. At indexing time, if there is any locale information associated with the document (for example, a Word document contains locale information for each text chunk), the index engine will try to use the word breaker for that locale. If the document does not contain any locale information, the user locale of the computer the indexer is installed on is used instead. At query time, the locale (HTTP_ACCEPT_LANGUAGE) of the browser from which the query was sent is used to perform word breaking on the query. Additional information about the language availability of the word breaker component is available in Appendix B: Search Language Considerations.
Stemming Stemming is a feature of the word breaker component used only by the query engine to determine where the word boundaries are in the stream of characters in the query. A stemmer extracts the root form of a given word. For example, ”running,” ”ran,” and ”runner“ are all variants of the verb ”to run.” In some languages, a stemmer expands the root form of a word to alternate forms. Stemming is turned off by default. Stemmers are available only for languages that have morphological expansion; this means that, for languages where stemmers are not available, turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect. Additional information about language availability for the Stemmer feature is available in Appendix B: Search Language Considerations.
Noise words dictionary Noise words are words that do not add value to a query, such as ”and,” ”the,” and ”a.” The indexing engine filters them to save index space and to increase performance. Noise word files are customizable, language-specific text files. These files are a simple list of words, one per line. If a noise word file is changed, you must perform a full update of the index to incorporate the changes. Additional information about the noise words dictionary and how to customize it is available at www.microsoft.com.
Custom dictionary The custom dictionary file contains values that the search server must include at index and query times. Custom dictionary lists are customizable, language-specific text files. These files are used by Search in both the index and query processes to identify exceptions to the noise word dictionaries. A word such as “AT&T,” for example, will never be indexed by default because the word breaker breaks it into single noise words. To avoid this, the user can add ”AT&T” to the custom dictionary file; as result, this word will be treated as an exception by the word breaker and will be indexed and queried. These files contain a simple list of words, one per line. If the custom dictionary file is changed, you must perform a full update of the index to incorporate the changes. By default, no custom dictionary file is installed during Office SharePoint Server 2007 Setup. Additional information about the custom dictionary file and how to customize it is available at www.microsoft.com.
Thesaurus There is a configurable thesaurus file for each language that Search supports. Using the thesaurus, you can specify synonyms for words and also automatically replace words in a query with other words that you specify. The thesaurus used will always be in the language of the query, not necessarily the server’s user locale. If a language-specific thesaurus is not available, a neutral thesaurus (tseneu.xml) is used. Additional information about the thesaurus file and how to customize it is available at www.microsoft.com.
Language Auto Detection The Language Auto Detection (LAD) feature generates a best guess about the language of a text chunk based on the Unicode range and other language patterns. Basically, it’s used for relevance calculation by the index engine and in queries sent from the Advanced Search Web Part, where the user is able to specify constraints on the language of the documents returned by a query.
Did You Mean? The Did You Mean? feature is used by the query engine to catch possible spelling errors and to provide suggestions for queries. The Did You Mean? feature builds suggestions by using three components:
· Query log Information tracked in the query log includes the query terms used, when the search results were returned for search queries, and the pages that were viewed from search results. This search usage data helps you understand how people are using search and what information they are seeking. You can use this data to help determine how to improve the search experience for users.
· Dictionary lexicon A dictionary of most-used lexicons provided at installation time.
· Custom lexicon A collection of the most frequently occurring words in the corpus, built at query time by the query engine from indexed information.
The Did You Mean? suggestions are available only for English, French, German, and Spanish.
Definition Extraction The Definition Extraction feature finds definitions for candidate terms and identifies acronyms and their expansions by examining the grammatical structure of sentences that have been indexed (for example, NASA, radar, modem, and so on). It is only available for English.
How the default Document Properties in Office documents used in the default MOSS 2007 SearchCenter search results
By peter.stilgoe
1) MOSS2007 uses the “Comments” field (from Document Properties) as its description.
2) MOSS2007 does not search the keywords assigned to an Office document via Document Properties.
3) For Word, Excel and PowerPoint documents, if you search for a word that appears in the document’s description (aka the “Comments” field), then the description is displayed in the search results with the search term highlighted in bold.
4) For Word, Excel and PowerPoint documents, if you search for a word that only appears in the body of the document, then:
# for Office 2003 documents, a snippet from the body of the document is displayed in the search results, with the search term highlighted in bold. The description is not displayed, even when it exists.
# for Office 2007 documents, the description is displayed in the search results.
5) If you search for a term that only appears in the document’s “Keywords” field, then nothing is found in the search.
6) If an Office document has a title assigned in its Document Properties, then the title is used in the search results. If no title is specified, then the document’s filename (including the file extension) is used instead.
http://www.thismuchiknow.co.uk/?p=41
Create a XML Test Page for Search Results
By peter.stilgoe
Tag Cloud
Recent Posts
- K2 Blackpoint K2 Blackpearl does not support single sign on authentication
- K2 4.5 Release Candidate Is Now Available
- External User Not Recieving Sharepoint Alerts
- Import Excel 2007 Spreadsheet as Sharepoint List Error Method ‘post’ of object ‘IOWSPostData’ failed
- Sharepoint 2010 User Profile Synchronization Service stuck on starting
Categories
- Access
- ACT
- Affiliate Marketing
- Analytics & Tracking
- Autodesk Inventor / Vault
- AvePoint
- BDC
- Blackberry / BES
- Business & Entrepreneurship
- Business Intelligence
- CNC
- Content Editor Webpart
- Content Query Webpart
- Document Imaging
- Document Management
- Domain Name Investing
- Email Marketing
- Enterprise Content Management
- Enterprise Search
- Firewalls
- Forms Server
- IIS
- InfoPath
- Information Architechture
- Internet Marketing
- Javascript
- JQuery
- K2 Blackpearl
- K2 Blackpoint
- Kerberos
- KnowledgeLake
- Making Money
- Microsoft Dynamics CRM
- Micross / Omnis
- Misc
- MS Exchange
- MS Office
- NHS Sharepoint
- Novell Netware
- Oracle
- Pay Per Click
- Perofrmance Tuning
- Powerpivot
- Records Management
- Relex Studio
- Sharepoint / MOSS / WSS
- Sharepoint 2010
- Sharepoint Alerts
- Sharepoint Calculated Columns
- Sharepoint Content Types
- Sharepoint Customisation
- Sharepoint Designer
- Sharepoint Errors
- Sharepoint IA
- Sharepoint Layouts
- Sharepoint Lists
- Sharepoint Lookup Columns
- Sharepoint Publishing Sites
- Sharepoint Search
- Sharepoint Security
- Sharepoint SSP
- Sharepoint Surveys
- Sharepoint Updates
- Sharepoint User Profiles
- Sharepoint Views
- Sharepoint Webparts
- Sharepoint Workflows
- SMTP
- SQL Server
- SSO
- Symantec
- Taxonomy
- Video Conferencing
- Virtualization
- Windows / Active Directory
- Windows Server 2008
- Windows Vista
- xml
Archive
- March 2010 (5)
- February 2010 (4)
- January 2010 (8)
- December 2009 (2)
- November 2009 (6)
- October 2009 (7)
- September 2009 (11)
- August 2009 (14)
- July 2009 (4)
- June 2009 (2)
- April 2009 (1)
- March 2009 (5)
- February 2009 (2)
- January 2009 (4)
- December 2008 (5)
- November 2008 (14)
- October 2008 (17)
- September 2008 (10)
- August 2008 (14)
- July 2008 (10)
- June 2008 (4)
- May 2008 (2)
- April 2008 (12)
- March 2008 (19)
- February 2008 (5)
- January 2008 (7)
- December 2007 (2)
- November 2007 (14)
- October 2007 (7)
- September 2007 (1)
- August 2007 (4)
- July 2007 (18)
- June 2007 (14)
- May 2007 (13)
- April 2007 (4)
- March 2007 (3)
- February 2007 (3)
- January 2007 (2)
- December 2006 (2)
- November 2006 (2)
- October 2006 (13)
- September 2006 (7)
- August 2006 (7)
- July 2006 (4)
- June 2006 (1)
- May 2006 (5)
- April 2006 (3)
- March 2006 (6)
- February 2006 (2)
- January 2006 (3)



October 28th, 2009
