SharePoint and SharePoint Online Connectors

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

Note
As of Fusion 4.2.4, the V1 platform version of these connectors is replaced by a new platform version, V1 Optimized. See the SharePoint and SharePoint Online connector platform versions for details about the differences between V1 and V1 Optimized platform versions of this connector.

Platform versions

V1 connectors

Before the release of Fusion 4.2.4, SharePoint and SharePoint Online connectors were offered in platform version V1. For configuration information on the V1 connectors:

V1 Optimized connectors

As of Fusion 4.2.4, SharePoint and SharePoint Online connectors are offered in platform version V1 Optimized. The V1 Optimized platform version is designed to replace the V1 platform version, and users are encouraged to upgrade to Fusion 4.2.4 or higher to take advantage of the V1 Optimized benefits. For configuration information on the V1 Optimzed connectors:

V2 connectors

Starting in Fusion 5.1.0, a SharePoint V2 connector is available.

In addition to the features and benefits provided by V1 connectors, V2 connectors offer:

  • Security Access-control Lists (ACL) which are separate from content

  • Fusion connectors support SSL/TLS security

  • Improved scalability, depending on the connector

    • Jobs can be scaled by simply adding instances of the connector

    • The fetching process supports distributed fetching, allowing many instances to contribute to the same job

  • Connectors can be hosted within Fusion, or can run remotely

    • Hosted connectors are cluster-aware, allowing connectors on separate notes to become of new connectors

    • Remote connectors become clients of Fusion and run a lightweight process and communicate to Fusion using an efficient messaging format

    • Remote connectors can be located wherever the data is located, which might be required for performance or security and access

  • Google’s fast and efficient framework gRPC is used as the underlying client/server technology

    • Increased flexibility in the way services and their methods are defined

    • HTTP/2 based transport

    • Efficient serialization format for data handling (protocol buffers)

    • Allows bi-directional/multiplexed stream

Key differences between V1 and V1 Optimized

CSOM REST API

V1 platform version

The V1 platform version uses SOAP API. This API style has been deprecated as of SharePoint 2013.

V1 Optimized platform version

The V1 Optimized platform version uses CSOM REST API. This API style provides a variety of benefits not found with SOAP API:

  • CSOM REST API supports bulk operations for faster crawl operations.

  • CSOM REST API uses traffic decorating and is therefore less susceptible to throttling.

  • CSOM REST API is considerably more efficient, resulting in less data being transfered during crawl operations.

Active Directory Connector for ACLs dependency

V1 platform version

The V1 platform version has a key limitation in regard to LDAP/ActiveDirectory access. In order to look up user group memberships, each SharePoint datasource was required to perform LDAP queries. If multiple SharePoint datasources utilized a single LDAP/ActiveDirectory backend, however, multiple LDAP lookup operations took place unnecessarily, and the user would suffer from excessive LDAP overhead.

V1 Optimized platform version

In Fusion 4.2.4, the Active Directory (AD) Connector for ACLs was introduced.

The SharePoint V1 Optimized connector works in tangent with the AD Connector for ACLs to create a sidecar collection which is used in graph security trimming queries. As a result, all LDAP/ActiveDirectory operations are fully dependent on the AD Connector for ACLs.

Important
If you are using SharePoint Online, and it’s not backed by Azure Active Directory or Active Directory Federation Services (ADFS), the V1 Optimized connector does not depend on the AD Connector for ACLs.

Changes API

V1 platform version

The V1 platform version does not use the SharePoint Changes API. As a result, the recrawl process required all items to be revisited in order. For large SharePoint collections, incremental crawls took an excessive amount of time.

V1 Optimized platform version

The V1 Optimized platform version is able to take advantage of the Changes API to perform incremental crawls. The Changes API tracks all additions, updates, and deletions since the previous crawl operation for a collection.

This improved crawl operation process significantly improves incremental crawl speed.

Graph security trimming

V1 platform version

The security trimming approach used by the V1 platform version had notable drawbacks:

  • LDAP/ActiveDirectory information is stored in an inefficient manner. When a document is fetched for indexing, it returns the users and groups with permission to view the document. However, SharePoint doesn’t explicitly list these users and groups. The security trimming approach requires that all nested LDAP/ActiveDirectory groups be fetched and added to the document ACLs.

    As a result, if the nested LDAP/ActiveDirectory group relationships change, the content is sometimes required to be reindexed despite not changing in SharePoint. This can lead to massive reindexing operations.

  • Each SharePoint datasource requires a separate Solr filter. With the V1 platform version, SharePoint datasources are unable to share the same security filter, even if they are pointing to the same SharePoint farm. This restriction can be severely inefficient.

    In a use case with five SharePoint datasources, for example, five Solr filter queries (fqs) would be required. The more fqs you have, the more work is required from Solr while performing queries, resulting in slower queries. This inefficiency scales with the number of SharePoint datasources, and it is not uncommon to have 30-50 datasources in an application.

  • SharePoint security filters cannot be shared with other connectors. For example, if a SharePoint datasource and an SMB2 datasrouce are backed by the same ActiveDirectory, you are still required to have an individual security filter for both datasources. Again, this inefficiency scales with the number of datasources you have.

V1 Optimized platform version

Unlike the V1 platform version, the V1 Optimized platform version uses a Solr graphy query approach. Advantages include:

  • LDAP/ActiveDirectory information is not stored in nested groups on the content document ACL fields.

  • ACLs in SharePoint content documents are stored in a field. Each SharePoint document that you crawl contains ACLs. As the document is indexed by Fusion, a field is populated with any role assignments attached to the document to ensure only users with appropriate permissions can view it. For example when doing a security trimmed query, you can input the username that is performing the search, and a Solr fq is formed with the values that match the ACL field on each document. The documents that are returned are restricted to what the user is permitted to view.

  • A single filter can perform a security trimming query against datasources backed by the same ActiveDirectory instance. This is not restricted to the SharePoint V1 Optimized connector. Other connectors, such as the SMB2 connector, can use the same filter.

  • Group membership lookups (LDAP queries) are separated from the SharePoint connector. Now, the AD Connector for ACLs is used to create a separate ACL Solr sidecar collection. First, a Solr graph query is performed to obtain a user’s groups and nested groups from the sidecar collection. Then, a join query is used to match the ACL fields on the content documents.

    Note
    This process is performed behind-the-scenes. The V1 Optimized connector uses the security trimming stage like all other connectors.

Multiple crawl phases

V1 platform version

The V1 platform version does not support multple crawl phases.

V1 Optimized platform version

The V1 Optimized platform version performs crawl operations in two phases:

  • Pre-fetch phase - The pre-fetch phase utilizes the CSOM REST API to fetch all relevant metadata in large batches. This creates a pre-fetch database, which is exported for use by the post-fetch phase.

    Note
    The pre-fetch phase does not download the file content of list items. It only fetches the metadata.
  • Post-fetch phase - After the pre-fetch phase has completed, the crawl operation is ready to index documents during the post-fetch phase. The crawl will iterate through all items identified in the pre-fetch phase and index them into the pipeline. If there is file content associated with a pre-fetch list item, that content will be downloaded and parsed using the Fusion parser.

SharePoint (on-premises)

This connector can access a SharePoint repository running on the following platforms:

  • Microsoft SharePoint 2010

  • Microsoft SharePoint 2013

  • Microsoft SharePoint 2016

  • Microsoft SharePoint 2019

Understanding incremental crawls

After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls".

Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started.

If you are crawling an entire SharePoint Web application and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.

Important
If you are filtering on fields, be sure to leave the lw fields in place. These fields are required for successful incremental crawling.

Throttling or rate limiting

SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling.

Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.

This section explains how to identify the errors that indicate that throttling is taking place, and how to adjust your connector’s configuration to help avoid it.

When SharePoint Online performs rate limiting, you may see one of two types of errors in the Log Viewer:

  • 429 - Too many requests

    This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks.

  • 503 - Server too busy

    This error is less common, but the result is the same.