SharePoint Crawl Rules Appears to Ignore Some URL Protocols
I recently came across an issue relating to crawling people information in SharePoint and the use of crawl rules to exclude certain content. The issue revolved around a requirement to exclude content contained within peoples’ MySites, but include user profile information so that people searches could still be conducted. The following crawl rule had been configured and was successfully excluding MySite content, but was also excluding the user profile data (crawled using the sps3s:// protocol):
URL
Exclude or Include
Exclude
Using the crawl rule test facility indicated that while SharePoint treats http:// and https:// differently, https:// and sps3s:// appear to be treated the same as far as crawling is concerned, so if the above crawl rule is in place, items in the MySite root site collection, both with an https:// and sps3s:// prefix, will not be crawled, and therefore user profile data and people search will not be available: [Screen shot from lab SharePoint 2010 system. however the same tests have been performed against SharePoint 2013 and 2016 with the same results] In fact what is happening is that the sps3s:// prefix tells SharePoint which connector to use, and in the case of people search, this is translated into a call to a web service at the host specified, i.e. https://mysite.domain.com/_vti_bin/spscrawl.asmx, so the final call that is made is in fact to an https:// prefix, hence the reason that the people data is not crawled. Replacing the above crawl rule with the following rule corrects the issue allowing people data stored in the MySite root site collection to be indexed and therefore be available for users to search:
URL
Exclude or Include
https://mysite.domain.com/personal/*
Exclude