SharePoint Online Connector and Datasource Configuration

The SharePoint Online connector retrieves data from cloud-based SharePoint repositories. Authentication requires a Sharepoint user who has permissions to access Sharepoint via the SOAP API. This user must be registered with the Sharepoint Online authentication server; it is not necessarily the same as the user in Active Directory or LDAP.

Note
As of Fusion 4.2.4, an “optimized” version of this connector replaces the previous version. See the 4.2.4 release notes for details about the differences between the optimized connector and the earlier versions. This topic describes the non-optimized connector.

To retrieve data from on-premises SharePoint installation, see the SharePoint connector.

This connector can access a SharePoint repository running on the following platforms:

  • Microsoft SharePoint 2010

  • Microsoft SharePoint 2013

  • Microsoft SharePoint 2016 See this tutorial about configuring a SharePoint datasource and enabling security trimming:

When crawling, the connector discovers SharePoint contents in the following order: sites, then sub-sites (children). A site may contain:

  • Sub-sites

  • Generic Lists

    • List Items

      • Attachments

  • Document Libraries

    • Folders

    • Documents

When the connector re-crawls a SharePoint repository, each previously crawled URL is accessed before any newly discovered objects, but no order is guaranteed. The connector uses a cache to store retrieved parent objects to avoid unnecessary requests. The last modified date of each object is retrieved to determine if it has changed since the last crawl. If it has changed, a new request is made to retrieve the changes. If it has not changed, the object is skipped and no additional request is made.

The connector uses SOAP to connect to and retrieve documents, lists, and other objects for indexing. It does not access a SharePoint site in the same way that a regular user does, and it needs additional privileges to use the SOAP interface to SharePoint.

The connector can be configured to work with Active Directory (AD) or LDAP to retrieve the ACLs for each object, which can then be used for security trimming at query time. In order to use security trimming to restrict user access to SharePoint objects, the authenticated user must have sufficient privileges to read every document in the system and determine which users can access them. The permissions requirements are explained below.

SharePoint Permissions

SharePoint security trimming restricts access to documents based on user permissions. There are two types of permissions in SharePoint:

  • Site permissions, which are:

    • managed by SharePoint

    • customizable for each site or subsite

    • inherited by subsites as the default permissions

    • grantable to users and groups

  • User permissions, which are:

    • assigned by group membership, when groups have been configured and provided permissions

    • assigned directly to the user

These permissions are stored as ACLs. When the SharePoint server is configured with security trimming set to "true", then documents retrieved from SharePoint have the set of all ACLs stored in a acl_ss field on each document.

At search time, the ACLs are used to verify if a user has access to a document. This is configured in a query pipeline with a Security Trimming Query Stage.

To crawl all the sites and subsites, the authenticated user must belong the site administrators group. If not, Fusion can still crawl and complete the job, but the crawled data will be limited by the user’s privileges. In addition, a WARNING message will appear in the connector.log indicating that the user is not site administrator and therefore unable to get sites from site collections. The message starts with Authorization Error (401).

Configuring a Non-administrative Crawl Account in SharePoint Online

The steps below describe how to configure a crawl account in SharePoint Online without giving the account administrative access.

Create a Service Account

Log in as a SharePoint administrator, and go to your admin center. . If you are using an on-premise active directory synced to SharePoint Online, you will need to create an active directory account and license the active directory account on SharePoint Online.

On-premise Active Directory

  1. If you are using SharePoint Online user accounts, add a user named “Lucidworks Fusion Service Account”.

SharePoint Online Users

Create the account as User (no administrator access).

Add a Crawl Permissions Level

To create a new permission level, click the gear symbol and go to Site Settings > Site permissions. Select Permission Levels, and click Add a Permission Level. Name the new permission level "Lucidworks Fusion Service Permission", and assign the following site permissions:

Name Description

View Items

View items in lists and documents in document libraries.

Open Items

View the source of documents with server-side file handlers.

View Versions

View past versions of a list item or document.

View Application Pages

View forms, views, and application pages. Enumerate lists.

View Web Analytics Data

View reports on Web site usage.

Browse Directories

Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.

View Pages

View pages in a Web site.

Enumerate Permissions

Enumerate permissions on the Web site, list, folder, document, or list item.

Browse User Information

View information about users of the Web site.

Use Remote Interfaces

Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.

Open

Allows users to open a Web site, list, or folder in order to access items inside that container.

Edit Personal User Information

Allows a user to change his or her own user information, such as adding a picture.

Create a Fusion Crawl Group

For each top-level site you want to be able to crawl, you must create a site permissions group and assign the permissions level you created previously. Go to Site Settings > Site permissions. Click the Create Group symbol and name the new group "Lucidworks Fusion Crawl Accounts". Add the “Lucidworks Fusion Service Account” user, and any other user that you wish to have crawl permissions, to this group.

The “Lucidworks Fusion Service Account” user should now be able to crawl without administrator rights.

Limitations of a Non-administrative Crawl Account in SharePoint Online

There are important limitations to crawling SharePoint Online with a non-administrative account. Only administrators are permitted to list site collections from SharePoint Online. To crawl multiple site collections from your SharePoint Online tenant, you must either:

  1. List the site collections in the Start Links explicitly, or;

  2. Provide a SharePoint administrator account when crawling SharePoint Online

The image below illustrates what information a non-administrator user can crawl:

Non-admin Crawl Permissions

Note
Although a non-administrator user can be allowed to list sub-sites in a site collection, the user cannot list the site collections of the tenant URL. For example, a non-administrator user may list the Sub-sites in https://lucidworks.sharepoint.com/sites/sitecol, such as /sitecol/subsite1 and /sitecol/subsite2. However, only an administrator can list the site collections in https://lucidworks.sharepoint.com.

Troubleshooting Permission Issues

When the connector is configured using a SharePoint username without sufficient privileges, the Fusion connectors log file fusion/4.0.x/var/log/connectors/connectors.log contains an error like the following:

crawler.common.sharepoint.exception.SharePointException: Server was unable to process request. ---> Attempted to perform an unauthorized operation. at crawler.common.sharepoint.service.BaseService.analyzeResponse(BaseService.java:194) ~[classes/:?] at crawler.common.sharepoint.service.SiteDataService.getContentBySiteOrList(SiteDataService.java:169) ~[classes/:?] at com.lucidworks.permissions.Main.test1(Main.java:50) [classes/:?] at com.lucidworks.permissions.Main.main(Main.java:32) [classes/:?]

This user’s permissions may be sufficient to connect via SOAP and read the documents, but not sufficient to get the ACLs and other associated metadata. This may result in complete lack of access to documents, or access to unauthorized documents. Confirm that the configured SharePoint user has the required privileges.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.