SharePoint Connector and Datasource Configuration

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

To retrieve content from cloud-based SharePoint repositories, see the SharePoint Online connector.

This connector can access a SharePoint repository running on the following platforms:

  • Microsoft SharePoint 2010

  • Microsoft SharePoint 2013

  • Microsoft SharePoint 2016

See this tutorial about configuring a SharePoint datasource and enabling security trimming:

When crawling, the connector discovers SharePoint contents in the following order: sites, then sub-sites (children). A site may contain:

  • Sub-sites

  • Generic Lists

    • List Items

      • Attachments

  • Document Libraries

    • Folders

    • Documents

When the connector re-crawls a SharePoint repository, each previously crawled URL is accessed before any newly discovered objects, but no order is guaranteed. The connector uses a cache to store retrieved parent objects to avoid unnecessary requests. The last modified date of each object is retrieved to determine if it has changed since the last crawl. If it has changed, a new request is made to retrieve the changes. If it has not changed, the object is skipped and no additional request is made.

The connector uses SOAP to connect to and retrieve documents, lists, and other objects for indexing. It does not access a SharePoint site in the same way that a regular user does, and it needs additional privileges to use the SOAP interface to SharePoint.

The connector can be configured to work with Active Directory (AD) or LDAP to retrieve the ACLs for each object, which can then be used for security trimming at query time. In order to use security trimming to restrict user access to SharePoint objects, the the authenticated user must have sufficient privileges to read every document in the system and determine which users can access them. The permissions requirements are explained below.

SharePoint Permissions

SharePoint security trimming restricts access to documents based on user permissions. There are two types of permissions in SharePoint:

  • Site permissions, which are:

    • managed by SharePoint

    • customizable for each site or subsite

    • inherited by subsites as the default permissions

    • grantable to users and groups

  • User permissions, which are:

    • assigned by group membership, when groups have been configured and provided permissions

    • assigned directly to the user

These permissions are stored as ACLs. When the SharePoint server is configured with security trimming set to "true", then documents retrieved from SharePoint have the set of all ACLs stored in a acl_ss field on each document.

At search time, the ACLs are used to verify if a user has access to a document. This is configured in a query pipeline with a Security Trimming Query Stage.

To crawl all the sites and subsites, the authenticated user must belong the site administrators group. If not, Fusion can still crawl and complete the job, but the crawled data will be limited by the user’s privileges. In addition, a WARNING message will appear in the connector.log indicating that the user is not site administrator and therefore unable to get sites from site collections. The message starts with Authorization Error (401).

Required Permissions

The SharePoint datasource must be configured with the name of a user who has sufficient permissions to crawl the entire site. These permissions require use of a custom Permission Policy. The required permissions correspond to the concept of Site Collection Auditor, a permission type which is not the same as a Site Administrator, but requires almost all of the Site Administrator privileges.

You will need to work with your SharePoint administrator to ensure that the account used by Fusion has all of the permissions listed in the following table:

Permission Type Permission Description

Site Collection Auditor

Full Read access for the entire site collection, including reading permissions and configuration data.

List

View Items

View items in lists and documents in document libraries.

List

Open Items

View the source of documents with server-side file handlers.

List

View Versions

View past versions of a list item or document.

Site

Browse Directories

Enumerate files and folders in a Web site using SharePoint Designer and WebDAV interfaces.

Site

View Pages

View pages in a Web site.

Site

Enumerate Permissions

Enumerate permissions on the Web site, list, folder, document, or list item.

Site

Browse User Information

View information about users of the Web site.

Site

Use Remote Interfaces

Use SOAP, WebDAV, Client Object Model, or SharePoint Designer interfaces to access the Web site.

Site

Open

Open a Web site, list or folder in order to access items inside that container.

Troubleshooting Permission Issues

When the connector is configured using a SharePoint username without sufficient privileges, the Fusion connectors log file fusion/4.1.x/var/log/connectors/connectors.log contains an error like the following:

crawler.common.sharepoint.exception.SharePointException: Server was unable to process request. ---> Attempted to perform an unauthorized operation. at crawler.common.sharepoint.service.BaseService.analyzeResponse(BaseService.java:194) ~[classes/:?] at crawler.common.sharepoint.service.SiteDataService.getContentBySiteOrList(SiteDataService.java:169) ~[classes/:?] at com.lucidworks.permissions.Main.test1(Main.java:50) [classes/:?] at com.lucidworks.permissions.Main.main(Main.java:32) [classes/:?]

This user’s permissions may be sufficient to connect via SOAP and read the documents, but not sufficient to get the ACLs and other associated metadata. This may result in complete lack of access to documents, or access to unauthorized documents. Confirm that the configured SharePoint user has the required privileges.

Cache User Groups to Improve Search Performance

When a user performs a search query with SharePoint security trimming enabled, the security trimming process starts by fetching the user groups. Two types of groups reference SharePoint document ACLs:

  • SharePoint groups

  • Active Directory LDAP security groups

Fusion creates a security filter using this user’s loginName and the groups that they are part of. The security filter is a Solr fq filter. Once the security filter is created, when this user performs a query, she sees only the documents that she is supposed to see.

By default, Fusion looks up a user’s SharePoint groups and LDAP groups every time a search query is performed. But fetching groups for a user is expensive and can hurt query time.

Note, too, that each SharePoint site collection that is part of a datasource has its own unique SharePoint groups. This means that if Fusion has crawled multiple SharePoint site collections, it must look up a user’s groups from each site collection. This is done in parallel for speed, but it dominates query times if there are many site collections to query.

Another consideration is SharePoint or LDAP unplanned down time. If a user performs real-time group lookups during down time, her queries result in missing documents because the security filter is not available.

To help alleviate these issues, Fusion offers a few different caching options. Consider using these caching options if you have many site collections, need extremely fast search, or cannot tolerate SharePoint or LDAP outages.

Security Filter Cache

With a security filter cache, once the query filter for a user has been generated, Fusion reuses the filter for this user for subsequent queries. The cache_expiration_time parameter dictates how long Fusion reuses the filter until generating it again. The cache_max_size parameter dictates the maximum number of items to hold in the security filter cache.

There are two flavors of security filter caches.

  • Local - This security filter is used only locally for this datasource. All other SharePoint datasources in your SharePoint security trimming query pipeline do not have access to this security filter.

  • Global - This security filter is used between multiple SharePoint datasources. So if groups have already been looked up for LDAP or a SharePoint site collection in another site collection, they do not have to be looked up again.

To enable security filter caches:

  1. In the Fusion UI, navigate to SharePoint datasource configuration.

  2. Check the boxes labeled Enable local security filter cache and Enable global security filter cache.

User Group Cache

To make queries significantly faster and also prevent security trimming from failing if any of those other systems happen to be down, you can enable user group caching.

The biggest bottleneck in security trimming for the SharePoint connector is looking up each user’s groups. Caching user groups means that when a security trimming query is performed, a single query to a Solr collection looks up the user’s LDAP and SharePoint groups, instead of going to the LDAP and SharePoint services to get them.

Every time Fusion crawls the SharePoint datasource, it updates the user group cache.

There are some costs of user group caching:

  • Increased indexing time - Fusion needs to build a user group cache while indexing.

  • Stale user groups - If Fusion does not recrawl the SharePoint datasource often enough, the user group cache can get out of date. The more often Fusion recrawls, the closer it is to a realtime user group lookup.

To enable user group caching:

  1. In the Fusion UI, navigate to SharePoint datasource configuration.

  2. Check the box labeled Enable User Group Caching in Solr. At crawl time each User’s LDAP and SharePoint groups will be fetched and stored in a Solr collection.

  3. (Optional) Set User Group Cache Solr Collection Name for all of your SharePoint data sources to the same name (for example, sp_usr_grp). The default is sp_usr_grp_<datasource>, where <datasource> is the ID of your data source. But several SharePoint data sources can share the same Solr collection, and performing this step prevents multiple collections from being created.

Feature history

Fusion version

Release date

Features

4.0.2

May 2018

  • Security trimming performance improvements

  • Global cache of security filters shared among all Sharepoint datasources

3.1.5

April 2018

Improved logging to help troubleshoot document onboarding

4.0.0

February 2018

  • Security trimming: support for groups, domain/username

  • XML parsing updates

  • Active Directory Federation Services (ADFS) support

3.1.3

December 2017

Multi-domain Active Directory security trimming no longer retries failed crawls by default

3.1.0

June 2017

New configuration parameter, parserRetryCount, which specifies the maximum number of times the configured parser will try getting content before giving up

3.0.1

April 2017

Sharepoint 2016 support

2.4.4

April 2017

  • Changed the default setting for "Maximum Number of Child Elements" from 50,000 to 5,000,000. This helps avoid an error when SharePoint has a huge number of groups being returned from a single xml element

  • Added a new system property org.apache.cxf.stax.maxChildElements that, if specified, will override the "Maximum Number of Child Elements".

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.