SharePoint Crawl Rules Appears to Ignore Some URL Protocols

I recently came across an issue relating to crawling people information in SharePoint and the use of crawl rules to exclude certain content.

The issue revolved around a requirement to exclude content contained within peoples’ MySites, but include user profile information so that people searches could still be conducted. The following crawl rule had been configured and was successfully excluding MySite content, but was also excluding the user profile data (crawled using the sps3s:// protocol):

URL Exclude or Include
https://mysite.domain.com/* Exclude

Using the crawl rule test facility indicated that while SharePoint treats http:// and https:// differently, https:// and sps3s:// appear to be treated the same as far as crawling is concerned, so if the above crawl rule is in place, items in the MySite root site collection, both with an https:// and sps3s:// prefix, will not be crawled, and therefore user profile data and people search will not be available:

Crawl rule test

[Screen shot from lab SharePoint 2010 system. however the same tests have been performed against SharePoint 2013 and 2016 with the same results]

In fact what is happening is that the sps3s:// prefix tells SharePoint which connector to use, and in the case of people search, this is translated into a call to a web service at the host specified, i.e. https://mysite.domain.com/_vti_bin/spscrawl.asmx, so the final call that is made is in fact to an https:// prefix, hence the reason that the people data is not crawled.

Replacing the above crawl rule with the following rule corrects the issue allowing people data stored in the MySite root site collection to be indexed and therefore be available for users to search:

URL Exclude or Include
https://mysite.domain.com/personal/* Exclude

Incorrect Title Shown for Office 2007/2010 Documents in SharePoint 2010 Search Results

Office 2007/2010 format documents stored in SharePoint 2010 which have their title field populated show an incorrect title in the default search results. The screen shot below shows the search results for the following document:

Filename: Test document 1.docx
Title: Test Document 1 Title
First line of document: First line of test document 1

Search results Title 01a

The title that is displayed as the link to the document at the top of the individual search result is the first line of text from the document, not the title (metadata) field. It should be noted that if the title field for the document is not set, the filename is displayed as the link to the document at the top of the individual search result, not the first line of text from the document.

Luckily, correcting this particular ‘feature’ is simple:

  • Open registry editor
  • Navigate to HKLM\SOFTWARE\Microsoft\Office Server\14.0\Search\Global\Gathering Manager
  • Edit the ‘EnableOptimisticTitleOverride’ key and modify its value to 0:
    Search results Title 02a
  • Restart the SharePoint Server Search 14 service by starting an admin command prompt and issuing the following commands:
    net stop osearch14
    net start osearch14
    Search results Title 04a
  • Repeat the above steps for all SharePoint servers in the farm.
  • Perform a full crawl on the SharePoint content source(s)

Once the full crawl has completed, performing the search again gives the title field as the search result title for a document which has the title field populated (note that in the screen shot below, test document 1 has the title field set, test document 2 doesn’t):

Search results Title 07a

The following can be copied into a .reg file to automate setting the key, or the key could be set using PowerShell:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Global\Gathering Manager]
"EnableOptimisticTitleOverride"=dword:00000000

“Access is Denied” when crawling content on MOSS 2007 hosted on Windows Server 2008

One of the SharePoint farms we’ve built recently runs on Windows Server 2008  and SQL Server 2008.  As usual, the installation is a least privileged account setup, with individual accounts running the various services and app pools. The farm is also patched to the latest level.

We’ve experienced one or two issues with this setup, but the most persistent one has been to do with crawling.  When crawls run, they would consistently fail with the following error in the Event viewer:

“The start address <https://site.domain.com> cannot be crawled.

Context: Application ‘SharedServices1’, Catalog ‘Portal_Content’

Details:
    Access is denied. Check that the Default Content Access Account has access to this content, or add a crawl rule to crawl this content.   (0x80041205)”

In addition, the following error appeared in the SharePoint logs:

***** Couldn’t retrieve server https://site.domain.com policy, hr = 80041205 – File:d:\office\source\search\search\gather\protocols\sts3\sts3util.cxx Line:548

And the crawl logs showed only errors, each having the following description:

“Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify the account you are using has “Full Read” permissions on the SharePoint Web Application being crawled.(The item was deleted because it was either not found or the crawler was denied access to it.)”

We’d checked all of the usual suspects including web application permissions for the account used by search, database permissions etc with no success.

The solution was to disable the loopback check on the servers hosting SharePoint. Adding the hostnames served to the BackConnectionHostNames list in the registry on the SharePoint servers wasn’t enough, the loopback check had to be completely disabled.

As an aside, another issue we’d experienced with an InfoPath form with code behind failing to load correctly on these servers was also solved disabling the loopback check on these servers.

For instructions on disabling the loopback check, see KB896861.