SharePoint and SharePoint Online Connectors

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

Platform versions

V1 connectors

The SharePoint V1 connectors were deprecated in Fusion 5.2. However, the:

SharePoint V1 connector can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.4.
SharePoint Online V1 connector can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.3.

V1 Optimized connectors

SharePoint V1 Optimized connectors can only be used in Fusion 4.2.

To retrieve data from on-premises SharePoint installation, SharePoint V1 Optimized connector and datasource configuration
To retrieve content from cloud-based SharePoint repositories, SharePoint Online V1 Optimized connector and datasource configuration

V2 connectors

The SharePoint V2 connector can be used in Fusion 5.1 - Fusion 5.5. This connector is deprecated as of June 19, 2023 and is no longer available in Fusion 5.6 and later. However, the SharePoint Optimized V2 Connector can be used in Fusion 5.6 and later.

Key differences between V1 and V1 Optimized

This section is only relevant to Fusion 4.x and earlier.

CSOM REST API

V1 platform version The V1 platform version uses SOAP API. This API style was deprecated as of SharePoint 2013. V1 Optimized platform version The V1 Optimized platform version uses CSOM REST API. This API style provides a variety of benefits not found with SOAP API:

CSOM REST API supports bulk operations for faster crawl operations.
CSOM REST API uses traffic decorating and is therefore less susceptible to throttling.
CSOM REST API is considerably more efficient, resulting in less data being transferred during crawl operations.

Active Directory Connector for ACLs dependency

V1 platform version The V1 platform version has a key limitation in regard to LDAP/ActiveDirectory access. In order to look up user group memberships, each SharePoint datasource was required to perform LDAP queries. If multiple SharePoint datasources utilized a single LDAP/ActiveDirectory backend, however, multiple LDAP lookup operations took place unnecessarily, and the user would suffer from excessive LDAP overhead. V1 Optimized platform version In Fusion 4.2.4, the Active Directory (AD) Connector for ACLs was introduced. The SharePoint V1 Optimized connector works in tangent with the AD Connector for ACLs to create a sidecar collection which is used in graph security trimming queries. As a result, all LDAP/ActiveDirectory operations are fully dependent on the AD Connector for ACLs.

ImportantIf you are using SharePoint Online, and it is not backed by Azure Active Directory or Active Directory Federation Services (ADFS), the V1 Optimized connector does not depend on the AD Connector for ACLs.

Changes API

V1 platform version The V1 platform version does not use the SharePoint Changes API. As a result, the recrawl process required all items to be revisited in order. For large SharePoint collections, incremental crawls took an excessive amount of time. V1 Optimized platform version The V1 Optimized platform version is able to take advantage of the Changes API to perform incremental crawls. The Changes API tracks all additions, updates, and deletions since the previous crawl operation for a collection. This improved crawl operation process significantly improves incremental crawl speed.

Graph security trimming

V1 platform version The security trimming approach used by the V1 platform version had notable drawbacks:

LDAP/ActiveDirectory information is stored in an inefficient manner. When a document is fetched for indexing, it returns the users and groups with permission to view the document. However, SharePoint does not explicitly list these users and groups. The security trimming approach requires that all nested LDAP/ActiveDirectory groups be fetched and added to the document ACLs.
As a result, if the nested LDAP/ActiveDirectory group relationships change, the content is sometimes required to be reindexed despite not changing in SharePoint. This can lead to massive reindexing operations.
Each SharePoint datasource requires a separate Solr filter. With the V1 platform version, SharePoint datasources are unable to share the same security filter, even if they are pointing to the same SharePoint farm. This restriction can be Severely inefficient.
In a use case with five SharePoint datasources, for example, five Solr filter queries (fqs) would be required. The more fqs you have, the more work is required from Solr while performing queries, resulting in slower queries. This inefficiency scales with the number of SharePoint datasources, and it is not uncommon to have 30-50 datasources in an application.
SharePoint security filters cannot be shared with other connectors. For example, if a SharePoint datasource and an SMB2 datasource are backed by the same ActiveDirectory, you are still required to have an individual security filter for both datasources. Again, this inefficiency scales with the number of datasources you have.

V1 Optimized platform version Unlike the V1 platform version, the V1 Optimized platform version uses a Solr graphy query approach. Advantages include:

LDAP/ActiveDirectory information is not stored in nested groups on the content document ACL fields.
ACLs in SharePoint content documents are stored in a field. Each SharePoint document that you crawl contains ACLs. As the document is indexed by Fusion, a field is populated with any role assignments attached to the document to ensure only users with appropriate permissions can view it. For example when doing a security trimmed query, you can input the username that is performing the search, and a Solr fq is formed with the values that match the ACL field on each document. The documents that are returned are restricted to what the user is permitted to view.
A single filter can perform a security trimming query against datasources backed by the same ActiveDirectory instance. This is not restricted to the SharePoint V1 Optimized connector. Other connectors, such as the SMB2 connector, can use the same filter.
Group membership lookups (LDAP queries) are separated from the SharePoint connector. Now, the AD Connector for ACLs is used to create a separate ACL Solr sidecar collection. First, a Solr graph query is performed to obtain a user’s groups and nested groups from the sidecar collection. Then, a join query is used to match the ACL fields on the content documents.
This process is performed behind-the-scenes. The V1 Optimized connector uses the security trimming stage like all other connectors.

Multiple crawl phases

V1 platform version The V1 platform version does not support multiple crawl phases. V1 Optimized platform version The V1 Optimized platform version performs crawl operations in two phases:

Pre-fetch phase - This phase:
- Utilizes the CSOM REST API to fetch all relevant metadata in large batches. This creates a pre-fetch database, which is exported for use by the post-fetch phase.
- Does not download the file content of list items. It only fetches the metadata.
- Is saved in $FUSION_HOME/var/log/connectors/connectors-classic/connectors-classic.log and $FUSION_HOME/var/log/connectors/connectors-classic/sharepoint-exporter-DSID.log where DSID is the SharePoint optimized datasource ID.
The counters in the data source job status window only increase when the content documents begin to index.
Post-fetch phase - After the pre-fetch phase has completed, the crawl operation is ready to index documents during the post-fetch phase. The crawl will iterate through all items identified in the pre-fetch phase and index them into the pipeline. If there is file content associated with a pre-fetch list item, that content will be downloaded and parsed using the Fusion parser.

View the SharePoint export database file

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.When a crawl is performed with the V1 Optimized connector, a SharePoint export database file is created. This file contains various metadata related to the SharePoint data. It does not store file contents from the SharePoint data.A Java web viewer, sharepoint-exporter.jar, is included to browse the export database file. The web viewer is located in the following directory:

${FUSION_HOME}/apps/connectors/connectors-classic/plugins/{viewer-directory}/assets/sharepoint-exporter/sharepoint-exporter.jar

For example, the:

SharePoint optimized connector export viewer utility is located at: ${FUSION_HOME}/apps/connectors/connectors-classic/plugins/lucidworks.sharepoint-optimized/assets/sharepoint-exporter/sharepoint-exporter.jar
SharePoint online optimized connector export viewer utility is located at: ${FUSION_HOME}/apps/connectors/connectors-classic/plugins/lucidworks.sharepoint-online-optimized/assets/sharepoint-exporter/sharepoint-exporter.jar

The web viewer is launched with the following arguments:

-exportDirectoryPath - The full path to the export database file.
-port - The port which the web viewer server will run on. If unassigned, a random port will be selected.

java -cp /opt/fusion/latest/apps/connectors/connectors-classic/plugins/{viewer-directory}/assets/sharepoint-exporter/sharepoint-exporter.jar  com.lucidworks.fusion.connector.plugins.sharepoint.exporter.SharepointExportWeb -port 5000 -exportDirectoryPath /opt/fusion/latest/data/connectors/connectors-classic/{export-directory}/example_spo

The files are viewed by navigating the directory with any browser.

SharePoint V1 Optimized Export Database File Viewer

SharePoint (on-premises)

This connector can access a SharePoint repository running on the following platforms:

Microsoft SharePoint 2013
Microsoft SharePoint 2016
Microsoft SharePoint 2019

Understanding incremental crawls

After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are “incremental crawls”. Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started. If the Limit Documents > Fetch all site collections checkbox selected, you are crawling an entire SharePoint Web application, and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.

ImportantIf you are filtering on fields, be sure to leave the lw fields in place. These fields are required for successful incremental crawling.

Throttling or rate limiting

SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling. Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.

Crawl SharePoint Online with Minimal Throttling

These instructions are for how to design your crawl so Microsoft will not throttle you. Use this if you are starting a new deployment or getting chronic throttling.Here are the key parts of the strategy:

Identify traffic with a custom user agent tied to an Azure app.
Divide and conquer by splitting sites across multiple datasources and multiple service accounts.
Set very low threads per datasource, but spread the load horizontally across accounts and datasources.
Shape concurrency at the job level by chaining jobs with the scheduler and controlling how many SharePoint jobs run at once.

SharePoint Online is a shared cloud service. Indexing content with SharePoint Online is much slower than crawling SharePoint On-Premises servers. SharePoint Online throttles the traffic heavily, keeping the SharePoint services responsive and healthy.Microsoft has published a purposely vague article, which does not include a throttling algorithm, for further understanding How to Avoid Throttling.Based on the information provided, we will discuss the techniques currently used to eliminate throttling. These recommendations are subject to change, as this is an evolving topic.

Update the user agent string to link to an Azure app

Each SharePoint datasource requires your company name in the user agent string.Example:

ISV|AcmeInc|Fusion/5.3

This “decorates” traffic uniquely so Microsoft will know exactly who API calls are coming from. Do not use the default user agent string. It will not uniquely identify your company’s traffic.

Divide and conquer with multiple service accounts

SharePoint Online rate limits are due to multiple threads accessing SharePoint APIs concurrently. SharePoint sees that sort of traffic and throttles it heavily.To reduce the traffic and throttling:

Split SharePoint site collections between datasources.
Create multiple SharePoint Online service accounts and spread them across the datasources.

Example - configured SharePoint online optimized connector datasource SPO_DS_1:

Start links:
https://tenant.sharepoint.com/sites/test1
https://tenant.sharepoint.com/sites/test2
https://tenant.sharepoint.com/sites/test3
https://tenant.sharepoint.com/sites/test4
https://tenant.sharepoint.com/sites/test5

Service account: [email protected]
Fetch Threads: 5
Number of prefetch threads: 10
User agent: ISV|YourCompanyName|Fusion/5.x.x

The crawling will begin with great speed. However, eventually errors are generated in the logs:

Error message: [429 TOO MANY REQUESTS]

Divide and conquer

Create five service accounts, exactly like the first.
Create five datasources. Give each datasource its very own service account and a subset of the SharePoint site collections that are to be crawled.
Set the fetch threads to 1.

Configuration results

* SPO_DS_1:\
Start links: **<span>https://</span>tenant.sharepoint.com/sites/test1**\
Service account: **\s[email protected]**\
Fetch Threads: **1**\
Number of prefetch threads: **1**
* SPO_DS_2:\
Start links: **<span>https://</span>tenant.sharepoint.com/sites/test2**\
Service account: **\s[email protected]**\
Fetch Threads: **1**\
Number of prefetch threads: **1**
* SPO_DS_3:\
Start links: **<span>https://</span>tenant.sharepoint.com/sites/test3**\
Service account: **\s[email protected]**\
Fetch Threads: **1**\
Number of prefetch threads: **1**
* SPO_DS_4:\
Start links: **<span>https://</span>tenant.sharepoint.com/sites/test4**\
Service account: **\s[email protected]**\
Fetch Threads: **1**\
Number of prefetch threads: **1**
* SPO_DS_5:\
Start links: **<span>https://</span>tenant.sharepoint.com/sites/test5**\
Service account: **\s[email protected]**\
Fetch Threads: **1**\
Number of prefetch threads: **1**

The idea is to reduce rate limiting. Microsoft will see requests coming from five accounts rather than a single account. To prevent multiple SharePoint jobs from running concurrently, chain jobs.The key is to prevent Fusion from making too many SharePoint API requests to the SharePoint Online servers concurrently. If too aggressive, Microsoft will throttle.

Lucidworks job scheduler to limit concurrent datasources

In addition to limiting a single datasource from making to many connections as once, limit the number of SharePoint Online datasources running concurrently. To avoid too many concurrent datasources running, use the Lucidworks Job Scheduler to chain SharePoint datasources to run one at a time.Use the “Trigger job after another data source completes” feature.Example using a single datasource at a time:

SPO_DS_1 Schedule: Every day at 06:00:00\
SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1\
SPO_DS_3 Schedule: Trigger job upon completion of SPO_DS_2\
SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3\
SPO_DS_5 Schedule: Trigger job upon completion of SPO_DS_4

In the above case Job 2 (SPO_DS_2) is triggered when Job 1 (SPO_DS_1) completes and so on.For practical purposes one datasource crawling may be too slow. To increase the number of concurrent jobs use MaxConcurrentSharePointJobs.Example using MaxConcurrentSharePointJobs=3:

SPO_DS_1 Schedule: Every day at 06:00:00\
SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1\
SPO_DS_3 Schedule: Every day at 06:00:00\
SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3\
SPO_DS_5 Schedule: Every day at 06:00:00\
SPO_DS_6 Schedule: Trigger job upon completion of SPO_DS_5

Read Avoid SharePoint throttling to identify the errors that indicate that throttling is taking place, and adjust your connector’s configuration to help avoid it.

429. Too many requests
This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks.
503. Server too busy
This error is less common, but the result is the same.

Avoid SharePoint throttling

These instructions are for what to tweak in the connector if you are seeing 429 and 503 responses.When using a SharePoint connector to crawl SharePoint Online, rate limiting can be an issue. Learn more about throttling in SharePoint Online.You have a few options to avoid throttling:

Decrease the number of threads
Stagger the datasource job schedules
Increase the number of retries

Decrease the number of threads

If you see many 429/503 errors, you are probably hitting SharePoint Online with too many concurrent fetchers.How to decrease the number of threads

Set Crawl Performance > Fetch Threads to a lower value.
Set Crawl Performance > Prefetch Threads to a lower value.

Stagger the datasource jobs

If you have multiple SharePoint Online datasource jobs that run at the same time, use the job scheduler to stagger their schedules instead.

Increase the number of retries

By default, the connector is configured with retries. This provides a chance for the requests that were rate-limited to run again.You can increase the number of retries and the interval between retries. The process is called exponential backoff, which gradually increases the delays between retries to increase the chances of a successful retry. This helps prevent missing documents due to rate limiting.For SharePoint Online V1 Optimized, retry configuration parameters include:

Retry attempts
Retry maximum wait
Retryer backoff delay (milliseconds)
Retryer backoff max delay (milliseconds)
Retryer backoff multiplier (decimal)

For SharePoint Optimized V2, retry configuration parameters include:

Retry Delay
Maximum Retries
Delay Factor
Maximum Delay Time
Maximum Time Limit

When you are receiving too many rate limiting errors, it is likely too many requests are being sent too frequently. Retrying may not help. One option is to decrease your traffic instead. If you want to continue sending the maximum number of requests, configure the Retryer backoff multiplier so it gets larger after every retry. The crawler will slow significantly and allow SharePoint to relax the throttling.

User permission configuration options

The SharePoint connectors provide a variety of configuration options for accessing SharePoint and SharePoint Online. Permissions settings should follow the principle of least privilege, as described in the Microsoft SharePoint docs:

Follow the principle of least-privileged: Users should have only the permission levels or individual permissions they must have to perform their assigned tasks.

SharePoint

Account type	Account config	Description
Active Directory Service Account	Account is set up as a Site Collection Auditor	Allows you to list all site collections.
Active Directory Service Account	Account is set up with limited permissions	Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Additionally, noindex tags are ignored. Sites will always be indexed regardless of their noindex settings.

See the following resources for configuration instructions:

Configure a SharePoint V2 Datasource

Decide what to crawl

Determine what to crawl and select one of the following:

An entire SharePoint Web application (all site collections in a specific SharePoint URL).
A subset of SharePoint site collections.
A specific sub-site, list, or list item.

How to crawl an entire SharePoint Web application

Verify the Limit Documents > Fetch all site collections option is selected (default).
Specify the Web application URL as a site. For example: https://lucidworks.sharepoint.local/

Administrative access to SharePoint is required to crawl an entire SharePoint Web application.

How to crawl a subset of SharePoint site collections

Uncheck the Limit Documents > Fetch all site collections option.
Specify a “Start Link” for each site collection to crawl. Examples include:
- https://lucidworks.sharepoint.local/sites/site1
- https://lucidworks.sharepoint.local/sites/site2
- https://lucidworks.sharepoint.local/sites/site3

How to crawl a specific sub-site, list, or list item:

Uncheck the Limit Documents > Fetch all site collections option.
Specify a “Start Link” for each site collection that contains the item to fetch.
Specify a non-wildcard Inclusive Regular Expression for each parent. For example, if you want to crawl https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesite, then you must include inclusive regexes for all parents:
```
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.*
```
If you exclude a parent item of the site, the connector does not crawl the site because it will not spider down to it during the crawl process.

Create permission and user policy for the crawl

The options are:

Set up an on-prem crawl account with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL.
Set up an online crawl account with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL.
Provide administrative access to crawl

How to set up an on-prem crawl account

Create a permission policy level

Navigate to Central Administration > Manage web application > Permission Policy.
Select Add permission policy level. In this example, the permission level is named fusion_crawl_policy.
If you need to list all site collections in a SharePoint web application, select the Site Collection Auditor option.
Grant the following permissions:
- View Items - View items in lists and documents in document libraries.
- Open Items - View the source of documents with server-side file handlers.
- View Versions - View past versions of a list item or document.
- View Application Pages - View forms, views, and application pages. Enumerate lists.
Site Permissions
- Browse Directories - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.
- View Pages - View pages in a Web site.
- Enumerate Permissions - Enumerate permissions on the Web site, list, folder, document, or list item.
- Browse User Information - View information about users of the Web site.
- Use Remote Interfaces - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.
- Open - Allows users to open a Web site, list, or folder in order to access items inside that container.

Grant user permission to the user policy

Navigate to Central Administration > Manage web application > User Policy > Add Users.
Create a new user with the new fusion_crawl_policy permission level selected:

How to set up an online crawl account

Create a permission policy level

Navigate to Site settings > Site permissions > Advanced Permission Settings.
Select New permission level. In this example, the permission level is named fusion_crawl_policy.
Grant the following permissions:
- View Items - View items in lists and documents in document libraries.
- Open Items - View the source of documents with server-side file handlers.
- View Versions - View past versions of a list item or document.
- View Application Pages - View forms, views, and application pages. Enumerate lists.
Site Permissions
- Browse Directories - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.
- View Pages - View pages in a Web site.
- Enumerate Permissions - Enumerate permissions on the Web site, list, folder, document, or list item.
- Browse User Information - View information about users of the Web site.
- Use Remote Interfaces - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.
- Open - Allows users to open a Web site, list, or folder in order to access items inside that container.

Grant user permission

Navigate to Site settings > Site permissions > Advanced Permission Settings.
Select Grant permissions.
Enter the new user name and add the user.
Select a value in the Select a permission level field.
Select Share.
In the Edit Permissions > Choose Permissions section, select the following check boxes:
- Read. Can view pages and list items and download documents.
- LW Fusion.
Select OK to save the information.

If you grant the service account the Site Collection Auditor permission, the Lucidworks Fusion SharePoint connector has write-level permission and can list: * Sites in Site Collections * SharePoint Site Collection site metadata

How to provide admin access to crawl

See the SharePoint documentation for instructions.

Test user permissions

The following PowerShell script verifies permissions on the user account created to crawl SharePoint from Fusion.

The script must be run by the user account on which the permissions were set. If rights were granted: * On your account, you must run the script to verify the user rights are set correctly. * On a different user account, the owner of that account must run the script.

Save the script with following file name: test-sharepoint-permissions.ps1.
Enter the first of the site collection URLs to crawl in the $site_col_url field of the script.
Save the changes.

Permission verification script

$site_col_url="https://your.sharepoint.local/sites/mysitecollection"

$cred = (Get-Credential)

if (-not ([System.Management.Automation.PSTypeName]'ServerCertificateValidationCallback').Type)
{
$certCallback = @"
    using System;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    public class ServerCertificateValidationCallback
    {
        public static void Ignore()
        {
            if(ServicePointManager.ServerCertificateValidationCallback ==null)
            {
                ServicePointManager.ServerCertificateValidationCallback +=
                    delegate
                    (
                        Object obj,
                        X509Certificate certificate,
                        X509Chain chain,
                        SslPolicyErrors errors
                    )
                    {
                        return true;
                    };
            }
        }
    }
"@
    Add-Type $certCallback
 }

[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12;
[ServerCertificateValidationCallback]::Ignore()

$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Content-Type", "text/xml")
$headers.Add("SOAPAction", "http://schemas.microsoft.com/sharepoint/soap/GetUpdatedFormDigestInformation")
$headers.Add("X-RequestForceAuthentication", "true")
$headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")

$body = "<?xml version=`"1.0`" encoding=`"utf-8`"?>`n<soap:Envelope xmlns:xsi=`"http://www.w3.org/2001/XMLSchema-instance`" xmlns:xsd=`"http://www.w3.org/2001/XMLSchema`" xmlns:soap=`"http://schemas.xmlsoap.org/soap/envelope/`">`n  <soap:Body>`n    <GetUpdatedFormDigestInformation xmlns=`"http://schemas.microsoft.com/sharepoint/soap/`" />`n  </soap:Body>`n</soap:Envelope>"

$response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" -Method 'POST' -Headers $headers -Body $body -Credential $cred

$digest_value = $response.Envelope.Body.GetUpdatedFormDigestInformationResponse.FirstChild.DigestValue


$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Content-Type", "text/xml")
$headers.Add("X-RequestForceAuthentication", "true")
$headers.Add("X-RequestDigest", $digest_value)
$headers.Add("Accept", "application/json")
$headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")

$body = @'
<Request AddExpandoFieldTypeSuffix="true" SchemaVersion="14.0.0.0" LibraryVersion="16.0.0.0"
         ApplicationName=".NET Library" xmlns="http://schemas.microsoft.com/sharepoint/clientquery/2009">
    <Actions>
        <ObjectPath Id="2" ObjectPathId="1"/>
        <ObjectPath Id="4" ObjectPathId="3"/>
        <Query Id="5" ObjectPathId="3">
            <Query SelectAllProperties="false">
                <Properties>
                    <Property Name="Webs" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="Title" ScalarProperty="true"/>
                    <Property Name="ServerRelativeUrl" ScalarProperty="true"/>
                    <Property Name="RoleDefinitions" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="RoleAssignments" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="HasUniqueRoleAssignments" ScalarProperty="true"/>
                    <Property Name="Description" ScalarProperty="true"/>
                    <Property Name="Id" ScalarProperty="true"/>
                    <Property Name="LastItemModifiedDate" ScalarProperty="true"/>
                </Properties>
            </Query>
        </Query>
    </Actions>
    <ObjectPaths>
        <StaticProperty Id="1" TypeId="{3747adcd-a3c3-41b9-bfab-4a64dd2f1e0a}" Name="Current"/>
        <Property Id="3" ParentId="1" Name="Web"/>
    </ObjectPaths>
</Request>
'@

$response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/ProcessQuery" -Method 'POST' -Headers $headers -Body $body -Credential $cred
$response | ConvertTo-Json -Depth 100

Successful query response

If the test script executes successfully, metadata is returned. The following is a sample of a successful response:

test-sharepoint-permissions.ps1
cmdlet Get-Credential at command pipeline position 1
Supply values for the following parameters:
[
    {
        "SchemaVersion":  "14.0.0.0",
        "LibraryVersion":  "16.0.10337.12109",
        "ErrorInfo":  null,
        "TraceCorrelationId":  "c419a69f-1c06-b07f-b69b-4d7720fd7756"
    },
    2,
    {
        "IsNull":  false
    },
    4,
    {
        "IsNull":  false
    },
    5,
    {
        "_ObjectType_":  "SP.Web",
        "_ObjectIdentity_":  "c419a69f-1c06-b07f-b69b-4d7720fd7756|740c6a0b-85e2-48a0-a494-e0f1759d4aa7:site:8992a373-cdf0-4262-b240-9527c7174682:web:2080d74c-e181-43df-829f-ad5bee97b6f8",
        "Webs":  {
                     "_ObjectType_":  "SP.WebCollection",
                     "_Child_Items_":  [
                                           {
                                               "_ObjectType_":  "SP.Web",
       ... truncated for brevity ...

        "LastItemModifiedDate":  "\/Date(1603731388000)\/"
    }
]

Failed query response

If the test script fails, either:

An error code is generated. For example, an error code 401.
An error message with explanatory information is returned. The following is a sample of a failed response:

Credential
Invoke-RestMethod : The remote server returned an error: (401) Unauthorized.
At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:47 char:13
+ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" - ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

Invoke-RestMethod : The remote server returned an error: (401) Unauthorized.
At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:100 char:13
+ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/Pr ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

Configure a SharePoint V1 Optimized Datasource

The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

Decide what you need to crawl

The first and most important thing to do is determine what you are trying to crawl, and to pick your “Start Links” accordingly.Choose one of the following:

An entire SharePoint Web application (all site collections in a specific SharePoint URL).
A subset of SharePoint site collections.
A specific sub-site, list, or list item.

How to crawl an entire SharePoint Web application

Leave the Limit Documents > Fetch all site collections option checked (as it is by default).
Specify the Web application URL as a site. For example: https://lucidworks.sharepoint.local/

Crawling an entire SharePoint Web application requires administrative access to SharePoint.

How to crawl a subset of SharePoint site collections

Uncheck the Limit Documents > Fetch all site collections option.
Specify a “Start Link” for each site collection that you want to crawl. Examples: https://lucidworks.sharepoint.local/sites/site1, https://lucidworks.sharepoint.local/sites/site2, https://lucidworks.sharepoint.local/sites/site3

How to crawl a specific sub-site, list, or list item:

Uncheck the Limit Documents > Fetch all site collections option.
Specify a “Start Link” for each site collection that contains the item you want to fetch.
Specify a non-wildcard Inclusive Regular Expression for each parent. For example, if you want to crawl https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesite then you must include inclusive regexes for all parents along the way:
```
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite
https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.*
```
If you exclude a parent item of the site, the connector will not crawl the site because it will never spider down to it during the crawl process.

Set up permissions for the crawl

You have two options here:

Set up a crawl account with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL. It cannot access the SharePoint Tenant Admin API to list all the site collections on your tenant. If you use this authentication method, you must enter each site collection to crawl in Start Links.
Provide administrative access to crawl

How to set up a crawl account

1. Create a Lucidworks Fusion crawl permission

Navigate to Central Administration > Manage web application > Permission Policy.
Click Add permission policy level. In this example, the permission level is named “fusion_crawl_policy”.
If you need to list all site collections in a SharePoint web application, select the option Site Collection Auditor:
Grant the following permissions:
- View Items - View items in lists and documents in document libraries.
- Open Items - View the source of documents with server-side file handlers.
- View Versions - View past versions of a list item or document.
- View Application Pages - View forms, views, and application pages. Enumerate lists.
Site Permissions
- Browse Directories - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.
- View Pages - View pages in a Web site.
- Enumerate Permissions - Enumerate permissions on the Web site, list, folder, document, or list item.
- Browse User Information - View information about users of the Web site.
- Use Remote Interfaces - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.
- Open - Allows users to open a Web site, list, or folder in order to access items inside that container.

2. Grant user permission to the user policy

Navigate to Central Administration > Manage web application > User Policy > Add Users.
Create a new user with the new policy permission level, “fusion_crawl_policy”, selected:

How to provide admin access to crawl

See the SharePoint documentation for instructions.

Test user permissions

The following PowerShell script verifies permissions on the user account created to crawl SharePoint from Fusion.

Save the script with following file name: test-sharepoint-permissions.ps1.
Enter the first of the site collection URLs to crawl in the $site_col_url field of the script.
Save the changes.

Permission verification script

$site_col_url="https://your.sharepoint.local/sites/mysitecollection"

$cred = (Get-Credential)

if (-not ([System.Management.Automation.PSTypeName]'ServerCertificateValidationCallback').Type)
{
$certCallback = @"
    using System;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    public class ServerCertificateValidationCallback
    {
        public static void Ignore()
        {
            if(ServicePointManager.ServerCertificateValidationCallback ==null)
            {
                ServicePointManager.ServerCertificateValidationCallback +=
                    delegate
                    (
                        Object obj,
                        X509Certificate certificate,
                        X509Chain chain,
                        SslPolicyErrors errors
                    )
                    {
                        return true;
                    };
            }
        }
    }
"@
    Add-Type $certCallback
 }

[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12;
[ServerCertificateValidationCallback]::Ignore()

$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Content-Type", "text/xml")
$headers.Add("SOAPAction", "http://schemas.microsoft.com/sharepoint/soap/GetUpdatedFormDigestInformation")
$headers.Add("X-RequestForceAuthentication", "true")
$headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")

$body = "<?xml version=`"1.0`" encoding=`"utf-8`"?>`n<soap:Envelope xmlns:xsi=`"http://www.w3.org/2001/XMLSchema-instance`" xmlns:xsd=`"http://www.w3.org/2001/XMLSchema`" xmlns:soap=`"http://schemas.xmlsoap.org/soap/envelope/`">`n  <soap:Body>`n    <GetUpdatedFormDigestInformation xmlns=`"http://schemas.microsoft.com/sharepoint/soap/`" />`n  </soap:Body>`n</soap:Envelope>"

$response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" -Method 'POST' -Headers $headers -Body $body -Credential $cred

$digest_value = $response.Envelope.Body.GetUpdatedFormDigestInformationResponse.FirstChild.DigestValue


$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Content-Type", "text/xml")
$headers.Add("X-RequestForceAuthentication", "true")
$headers.Add("X-RequestDigest", $digest_value)
$headers.Add("Accept", "application/json")
$headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")

$body = @'
<Request AddExpandoFieldTypeSuffix="true" SchemaVersion="14.0.0.0" LibraryVersion="16.0.0.0"
         ApplicationName=".NET Library" xmlns="http://schemas.microsoft.com/sharepoint/clientquery/2009">
    <Actions>
        <ObjectPath Id="2" ObjectPathId="1"/>
        <ObjectPath Id="4" ObjectPathId="3"/>
        <Query Id="5" ObjectPathId="3">
            <Query SelectAllProperties="false">
                <Properties>
                    <Property Name="Webs" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="Title" ScalarProperty="true"/>
                    <Property Name="ServerRelativeUrl" ScalarProperty="true"/>
                    <Property Name="RoleDefinitions" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="RoleAssignments" SelectAll="true">
                        <Query SelectAllProperties="false">
                            <Properties/>
                        </Query>
                    </Property>
                    <Property Name="HasUniqueRoleAssignments" ScalarProperty="true"/>
                    <Property Name="Description" ScalarProperty="true"/>
                    <Property Name="Id" ScalarProperty="true"/>
                    <Property Name="LastItemModifiedDate" ScalarProperty="true"/>
                </Properties>
            </Query>
        </Query>
    </Actions>
    <ObjectPaths>
        <StaticProperty Id="1" TypeId="{3747adcd-a3c3-41b9-bfab-4a64dd2f1e0a}" Name="Current"/>
        <Property Id="3" ParentId="1" Name="Web"/>
    </ObjectPaths>
</Request>
'@

$response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/ProcessQuery" -Method 'POST' -Headers $headers -Body $body -Credential $cred
$response | ConvertTo-Json -Depth 100

Successful query response

If the test script executes successfully, metadata is returned. The following is a sample of a successful response:

test-sharepoint-permissions.ps1
cmdlet Get-Credential at command pipeline position 1
Supply values for the following parameters:
[
    {
        "SchemaVersion":  "14.0.0.0",
        "LibraryVersion":  "16.0.10337.12109",
        "ErrorInfo":  null,
        "TraceCorrelationId":  "c419a69f-1c06-b07f-b69b-4d7720fd7756"
    },
    2,
    {
        "IsNull":  false
    },
    4,
    {
        "IsNull":  false
    },
    5,
    {
        "_ObjectType_":  "SP.Web",
        "_ObjectIdentity_":  "c419a69f-1c06-b07f-b69b-4d7720fd7756|740c6a0b-85e2-48a0-a494-e0f1759d4aa7:site:8992a373-cdf0-4262-b240-9527c7174682:web:2080d74c-e181-43df-829f-ad5bee97b6f8",
        "Webs":  {
                     "_ObjectType_":  "SP.WebCollection",
                     "_Child_Items_":  [
                                           {
                                               "_ObjectType_":  "SP.Web",
       ... truncated for brevity ...

        "LastItemModifiedDate":  "\/Date(1603731388000)\/"
    }
]

Failed query response

If the test script fails, either:

An error code is generated. For example, an error code 401.
An error message with explanatory information is returned. The following is a sample of a failed response:

Credential
Invoke-RestMethod : The remote server returned an error: (401) Unauthorized.
At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:47 char:13
+ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" - ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

Invoke-RestMethod : The remote server returned an error: (401) Unauthorized.
At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:100 char:13
+ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/Pr ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

SharePoint Online

ImportantFor the V2 connector, when the access to SharePoint Online is affected by a Conditional Access Policy (CAP), it’s recommended to set a proper user-agent value (depending on the CAP configuration) in the connector configuration (toggle advanced properties): Requests settings > User agent.

Account type	Account config	Description
Full Admin	Azure App Only	Allows you to list all site collections in tenant.
Full Admin	OAuth App Only	Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually.
ADFS Account	Account is set up as a Site Collection Auditor	Allows you to list all site collections if the user is a tenant administrator.
ADFS Account	Account is set up with limited permissions	Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Use this option if your deployment requires the Lucidworks crawl account to have the fewest privileges possible.

Fusion Connectors

​Platform versions

​V1 connectors

​V1 Optimized connectors

​V2 connectors

​Key differences between V1 and V1 Optimized

​CSOM REST API

​Active Directory Connector for ACLs dependency

​Changes API

​Graph security trimming

​Multiple crawl phases

​SharePoint (on-premises)

​Understanding incremental crawls

​Throttling or rate limiting

​Update the user agent string to link to an Azure app

​Divide and conquer with multiple service accounts

​Divide and conquer

Configuration results

​Lucidworks job scheduler to limit concurrent datasources

​Decrease the number of threads

​Stagger the datasource jobs

​Increase the number of retries

​User permission configuration options

​SharePoint

​Decide what to crawl

​How to crawl an entire SharePoint Web application

​How to crawl a subset of SharePoint site collections

​How to crawl a specific sub-site, list, or list item:

​Create permission and user policy for the crawl

​How to set up an on-prem crawl account

​Create a permission policy level

​Grant user permission to the user policy

​How to set up an online crawl account

​Create a permission policy level

​Grant user permission

​How to provide admin access to crawl

​Test user permissions

​Permission verification script

​Successful query response

​Failed query response

​Decide what you need to crawl

​How to crawl an entire SharePoint Web application

​How to crawl a subset of SharePoint site collections

​How to crawl a specific sub-site, list, or list item:

​Set up permissions for the crawl

​How to set up a crawl account

​1. Create a Lucidworks Fusion crawl permission

​2. Grant user permission to the user policy

​How to provide admin access to crawl

​Test user permissions

​Permission verification script

​Successful query response

​Failed query response

​SharePoint Online

Platform versions

V1 connectors

V1 Optimized connectors

V2 connectors

Key differences between V1 and V1 Optimized

CSOM REST API

Active Directory Connector for ACLs dependency

Changes API

Graph security trimming

Multiple crawl phases

SharePoint (on-premises)

Understanding incremental crawls

Throttling or rate limiting

Update the user agent string to link to an Azure app

Divide and conquer with multiple service accounts

Divide and conquer

Lucidworks job scheduler to limit concurrent datasources

Decrease the number of threads

Stagger the datasource jobs

Increase the number of retries

User permission configuration options

SharePoint

Decide what to crawl

How to crawl an entire SharePoint Web application

How to crawl a subset of SharePoint site collections

How to crawl a specific sub-site, list, or list item:

Create permission and user policy for the crawl

How to set up an on-prem crawl account

Create a permission policy level

Grant user permission to the user policy

How to set up an online crawl account

Create a permission policy level

Grant user permission

How to provide admin access to crawl

Test user permissions

Permission verification script

Successful query response

Failed query response

Decide what you need to crawl

How to crawl an entire SharePoint Web application

How to crawl a subset of SharePoint site collections

How to crawl a specific sub-site, list, or list item:

Set up permissions for the crawl

How to set up a crawl account

1. Create a Lucidworks Fusion crawl permission

2. Grant user permission to the user policy

How to provide admin access to crawl

Test user permissions

Permission verification script

Successful query response

Failed query response

SharePoint Online