SharePoint and SharePoint Online Connectors

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

Platform versions

V1 connectors

Before the release of Fusion 4.2.4, SharePoint and SharePoint Online connectors were offered in platform version V1. For configuration information on the V1 connectors:

V2 connectors

Starting in Fusion 5.1.0, a SharePoint V2 connector is available.

In addition to the features and benefits provided by V1 connectors, V2 connectors offer:

  • Security Access-control Lists (ACL) which are separate from content

  • Fusion connectors support SSL/TLS security

  • Improved scalability, depending on the connector

    • Jobs can be scaled by simply adding instances of the connector

    • The fetching process supports distributed fetching, allowing many instances to contribute to the same job

  • Connectors can be hosted within Fusion, or can run remotely

    • Hosted connectors are cluster-aware, allowing connectors on separate notes to become of new connectors

    • Remote connectors become clients of Fusion and run a lightweight process and communicate to Fusion using an efficient messaging format

    • Remote connectors can be located wherever the data is located, which might be required for performance or security and access

  • Google’s fast and efficient framework gRPC is used as the underlying client/server technology

    • Increased flexibility in the way services and their methods are defined

    • HTTP/2 based transport

    • Efficient serialization format for data handling (protocol buffers)

    • Allows bi-directional/multiplexed stream

SharePoint (on-premises)

This connector can access a SharePoint repository running on the following platforms:

  • Microsoft SharePoint 2010

  • Microsoft SharePoint 2013

  • Microsoft SharePoint 2016

  • Microsoft SharePoint 2019

Understanding incremental crawls

After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls".

Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started.

If you are crawling an entire SharePoint Web application and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.

Important
If you are filtering on fields, be sure to leave the lw fields in place. These fields are required for successful incremental crawling.

Throttling or rate limiting

SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling.

Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.

This section explains how to identify the errors that indicate that throttling is taking place, and how to adjust your connector’s configuration to help avoid it.

When SharePoint Online performs rate limiting, you may see one of two types of errors in the Log Viewer:

  • 429 - Too many requests

    This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks.

  • 503 - Server too busy

    This error is less common, but the result is the same.

User permission configuration options

The SharePoint connectors provide a variety of configuration options for accessing SharePoint and SharePoint Online. Permissions settings should follow the principle of least privilege, as described in the Microsoft SharePoint docs:

Follow the principle of least-privileged: Users should have only the permission levels or individual permissions they must have to perform their assigned tasks.

SharePoint

Account type Account config Description

Active Directory Service Account

Account is set up as a Site Collection Auditor

Allows you to list all site collections.

Active Directory Service Account

Account is set up with limited permissions

Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Additionally, noindex tags are ignored. Sites will always be indexed regardless of their noindex settings.

See Configure A SharePoint V1 Optimized Datasource for configuration instructions.

SharePoint Online

Account type Account config Description

Full Admin

Azure App Only

Allows you to list all site collections in tenant.

Full Admin

OAuth App Only

Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually.

ADFS Account

Account is set up as a Site Collection Auditor

Allows you to list all site collections if the user is a tenant administrator.

ADFS Account

Account is set up with limited permissions

Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Use this option if your deployment requires the Lucidworks crawl account to have the fewest privileges possible.

See Configure A SharePoint Online V1 Optimized Datasource for configuration instructions.