- Platform versions
- Key differences
- SharePoint (on-premises)
The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.
|As of Fusion 4.2.4, the V1 platform version of these connectors is replaced by a new platform version, V1 Optimized. See the SharePoint and SharePoint Online connector platform versions for details about the differences between V1 and V1 Optimized platform versions of this connector.|
Before the release of Fusion 4.2.4, SharePoint and SharePoint Online connectors were offered in platform version V1. For configuration information on the V1 connectors:
V1 Optimized connectors
As of Fusion 4.2.4, SharePoint and SharePoint Online connectors are offered in platform version V1 Optimized. The V1 Optimized platform version is designed to replace the V1 platform version, and users are encouraged to upgrade to Fusion 4.2.4 or higher to take advantage of the V1 Optimized benefits. For configuration information on the V1 Optimzed connectors:
CSOM REST API
The V1 platform version uses SOAP API. This API style has been deprecated as of SharePoint 2013.
The V1 Optimized platform version uses CSOM REST API. This API style provides a variety of benefits not found with SOAP API:
CSOM REST API supports bulk operations for faster crawl operations.
CSOM REST API uses traffic decorating and is therefore less susceptible to throttling.
CSOM REST API is considerably more efficient, resulting in less data being transfered during crawl operations.
Active Directory Connector for ACLs dependency
The V1 platform version has a key limitation in regard to LDAP/ActiveDirectory access. In order to look up user group memberships, each SharePoint datasource was required to perform LDAP queries. If multiple SharePoint datasources utilized a single LDAP/ActiveDirectory backend, however, multiple LDAP lookup operations took place unnecessarily, and the user would suffer from excessive LDAP overhead.
In Fusion 4.2.4, the Active Directory (AD) Connector for ACLs was introduced.
The SharePoint V1 Optimized connector works in tangent with the AD Connector for ACLs to create a sidecar collection which is used in graph security trimming queries. As a result, all LDAP/ActiveDirectory operations are fully dependent on the AD Connector for ACLs.
|If you are using SharePoint Online, and it’s not backed by Azure Active Directory or Active Directory Federation Services (ADFS), the V1 Optimized connector does not depend on the AD Connector for ACLs.|
The V1 platform version does not use the SharePoint Changes API. As a result, the recrawl process required all items to be revisited in order. For large SharePoint collections, incremental crawls took an excessive amount of time.
The V1 Optimized platform version is able to take advantage of the Changes API to perform incremental crawls. The Changes API tracks all additions, updates, and deletions since the previous crawl operation for a collection.
This improved crawl operation process significantly improves incremental crawl speed.
Graph security trimming
The security trimming approach used by the V1 platform version had notable drawbacks:
LDAP/ActiveDirectory information is stored in an inefficient manner. When a document is fetched for indexing, it returns the users and groups with permission to view the document. However, SharePoint doesn’t explicitly list these users and groups. The security trimming approach requires that all nested LDAP/ActiveDirectory groups be fetched and added to the document ACLs.
As a result, if the nested LDAP/ActiveDirectory group relationships change, the content is sometimes required to be reindexed despite not changing in SharePoint. This can lead to massive reindexing operations.
Each SharePoint datasource requires a separate Solr filter. With the V1 platform version, SharePoint datasources are unable to share the same security filter, even if they are pointing to the same SharePoint farm. This restriction can be severely inefficient.
In a use case with five SharePoint datasources, for example, five Solr filter queries (fqs) would be required. The more fqs you have, the more work is required from Solr while performing queries, resulting in slower queries. This inefficiency scales with the number of SharePoint datasources, and it is not uncommon to have 30-50 datasources in an application.
SharePoint security filters cannot be shared with other connectors. For example, if a SharePoint datasource and an SMB2 datasrouce are backed by the same ActiveDirectory, you are still required to have an individual security filter for both datasources. Again, this inefficiency scales with the number of datasources you have.
Unlike the V1 platform version, the V1 Optimized platform version uses a Solr graphy query approach. Advantages include:
LDAP/ActiveDirectory information is not stored in nested groups on the content document ACL fields.
ACLs in SharePoint content documents are stored in a field. Each SharePoint document that you crawl contains ACLs. As the document is indexed by Fusion, a field is populated with any role assignments attached to the document to ensure only users with appropriate permissions can view it. For example when doing a security trimmed query, you can input the username that is performing the search, and a Solr fq is formed with the values that match the ACL field on each document. The documents that are returned are restricted to what the user is permitted to view.
A single filter can perform a secruity trimming query against datasources backed by the same ActiveDirectory instance. This is not restricted to the SharePoint V1 Optimized connector. Other connectors, such as the SMB2 connector, can use the same filter.
Group membership lookups (LDAP queries) are separated from the SharePoint connector. Now, the AD Connector for ACLs is used to create a separate ACL Solr sidecar collection. First, a Solr graph query is performed to obtain a user’s groups and nested groups from the sidecar collection. Then, a join query is used to match the ACL fields on the content documents.Note
This process is performed behind-the-scenes. The V1 Optimized connector uses the security trimming stage like all other connectors.
Multiple crawl phases
The V1 platform version does not support multple crawl phases.
The V1 Optimized platform version performs crawl operations in two phases:
Pre-fetch phase - The pre-fetch phase utilizes the CSOM REST API to fetch all relevant metadata in large batches. This creates a pre-fetch database, which is exported for use by the post-fetch phase.Note
The pre-fetch phase does not download the file content of list items. It only fetches the metadata.
Post-fetch phase - After the pre-fetch phase has completed, the crawl operation is ready to index documents during the post-fetch phase. The crawl will iterate through all items identified in the pre-fetch phase and index them into the pipeline. If there is file content associated with a pre-fetch list item, that content will be downloaded and parsed using the Fusion parser.
This connector can access a SharePoint repository running on the following platforms:
Microsoft SharePoint 2010
Microsoft SharePoint 2013
Microsoft SharePoint 2016
Microsoft SharePoint 2019
Understanding incremental crawls
After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls".
Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started.
If you are crawling an entire SharePoint Web application and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.
If you are filtering on fields, be sure to leave the
Throttling or rate limiting
SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling.
Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.
This section explains how to identify the errors that indicate that throttling is taking place, and how to adjust your connector’s configuration to help avoid it.
When SharePoint Online performs rate limiting, you may see one of two types of errors in the Log Viewer:
429 - Too many requests
This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks.
503 - Server too busy
This error is less common, but the result is the same.