SharePoint Online Connector and Datasource Configuration

The SharePoint Online connector retrieves data from cloud-based SharePoint repositories. Authentication requires a Sharepoint user who has permissions to access Sharepoint via the SOAP API. This user must be registered with the Sharepoint Online authentication server; it is not necessarily the same as the user in Active Directory or LDAP.

To retrieve data from on-premises SharePoint installation, see the SharePoint connector.

When crawling, the connector discovers SharePoint contents in the following order: sites, then sub-sites (children). A site may contain:

  • Sub-sites

  • Generic Lists

    • List Items

      • Attachments

  • Document Libraries

    • Folders

    • Documents

When the connector re-crawls a SharePoint repository, each previously crawled URL is accessed before any newly discovered objects, but no order is guaranteed. The connector uses a cache to store retrieved parent objects to avoid unnecessary requests. The last modified date of each object is retrieved to determine if it has changed since the last crawl. If it has changed, a new request is made to retrieve the changes. If it has not changed, the object is skipped and no additional request is made.

The connector uses SOAP to connect to and retrieve documents, lists, and other objects for indexing. It does not access a SharePoint site in the same way that a regular user does, and it needs additional privileges to use the SOAP interface to SharePoint.

The connector can be configured to work with Active Directory (AD) or LDAP to retrieve the ACLs for each object, which can then be used for security trimming at query time. In order to use security trimming to restrict user access to SharePoint objects, the the authenticated user must have sufficient privileges to read every document in the system and determine which users can access them. The permissions requirements are explained below.

SharePoint Permissions

SharePoint security trimming restricts access to documents based on user permissions. There are two types of permissions in SharePoint:

  • Site permissions, which are:

    • managed by SharePoint

    • customizable for each site or subsite

    • inherited by subsites as the default permissions

    • grantable to users and groups

  • User permissions, which are:

    • assigned by group membership, when groups have been configured and provided permissions

    • assigned directly to the user

These permissions are stored as ACLs. When the SharePoint server is configured with security trimming set to "true", then documents retrieved from SharePoint have the set of all ACLs stored in a acl_ss field on each document.

At search time, the ACLs are used to verify if a user has access to a document. This is configured in a query pipeline with a Security Trimming Query Stage.

To crawl all the sites and subsites, the authenticated user must belong the site administrators group. If not, Fusion can still crawl and complete the job, but the crawled data will be limited by the user’s privileges. In addition, a WARNING message will appear in the connector.log indicating that the user is not site administrator and therefore unable to get sites from site collections. The message starts with Authorization Error (401).

Required Permissions

The SharePoint datasource must be configured with the name of a user who has sufficient permissions to crawl the entire site. These permissions require use of a custom Permission Policy. The required permissions correspond to the concept of Site Collection Auditor, a permission type which is not the same as a Site Administrator, but requires almost all of the Site Administrator privileges.

You will need to work with your SharePoint administrator to ensure that the account used by Fusion has all of the permissions listed in the following table:

Permission Type Permission Description

Site Collection Auditor

Full Read access for the entire site collection, including reading permissions and configuration data.

List

View Items

View items in lists and documents in document libraries.

List

Open Items

View the source of documents with server-side file handlers.

List

View Versions

View past versions of a list item or document.

Site

Browse Directories

Enumerate files and folders in a Web site using SharePoint Designer and WebDAV interfaces.

Site

View Pages

View pages in a Web site.

Site

Enumerate Permissions

Enumerate permissions on the Web site, list, folder, document, or list item.

Site

Browse User Information

View information about users of the Web site.

Site

Use Remote Interfaces

Use SOAP, WebDAV, Client Object Model, or SharePoint Designer interfaces to access the Web site.

Site

Open

Open a Web site, list or folder in order to access items inside that container.

Troubleshooting Permission Issues

When the connector is configured using a SharePoint username without sufficient privileges, the Fusion connectors log file $FUSION/var/log/connectors/connectors.log contains an error like the following:

crawler.common.sharepoint.exception.SharePointException: Server was unable to process request. ---> Attempted to perform an unauthorized operation. at crawler.common.sharepoint.service.BaseService.analyzeResponse(BaseService.java:194) ~[classes/:?] at crawler.common.sharepoint.service.SiteDataService.getContentBySiteOrList(SiteDataService.java:169) ~[classes/:?] at com.lucidworks.permissions.Main.test1(Main.java:50) [classes/:?] at com.lucidworks.permissions.Main.main(Main.java:32) [classes/:?]

This user’s permissions may be sufficient to connect via SOAP and read the documents, but not sufficient to get the ACLs and other associated metadata. This may result in complete lack of access to documents, or access to unauthorized documents. Confirm that the configured SharePoint user has the required privileges.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.