> ## Documentation Index > Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt > Use this file to discover all available pages before exploring further. # SharePoint and SharePoint Online Connectors export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => { const [isLoaded, setIsLoaded] = useState(false); useEffect(() => { const timer = setTimeout(() => { setIsLoaded(true); }, 500); return () => clearTimeout(timer); }, []); return

{isLoaded && ` }} />} Powered by Lucidworks Agent Studio

; }; [localhost link]: http://localhost:3000/docs/fusion-connectors/concepts/sharepoint [mintlify link]: https://doc.lucidworks.com/docs/fusion-connectors/concepts/sharepoint [old doc.lw link]: https://doc.lucidworks.com/fusion-connectors/21 The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository. ## Platform versions ### V1 connectors The SharePoint V1 connectors were deprecated in Fusion 5.2. However, the: * [SharePoint V1 connector](/docs/fusion-connectors/connectors/v1/sharepoint) can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.4. * [SharePoint Online V1 connector](/docs/fusion-connectors/connectors/v1/sharepoint-online) can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.3. ### V1 Optimized connectors SharePoint V1 Optimized connectors can only be used in Fusion 4.2. * To retrieve data from on-premises SharePoint installation, [SharePoint V1 Optimized connector and datasource configuration](/docs/fusion-connectors/connectors/v1/sharepoint-v1-optimized) * To retrieve content from cloud-based SharePoint repositories, [SharePoint Online V1 Optimized connector and datasource configuration](/docs/fusion-connectors/connectors/v1/sharepoint-online-v1-optimized) ### V2 connectors The [SharePoint V2 connector](/docs/fusion-connectors/connectors/sharepoint-v2) can be used in Fusion 5.1 - Fusion 5.5. This connector is deprecated as of June 19, 2023 and is no longer available in Fusion 5.6 and later. However, the [SharePoint Optimized V2 Connector](/docs/fusion-connectors/connectors/sharepoint-v2-optimized) can be used in Fusion 5.6 and later. ## Key differences between V1 and V1 Optimized This section is only relevant to Fusion 4.x and earlier. ### CSOM REST API **V1 platform version** The V1 platform version uses SOAP API. This API style was deprecated as of SharePoint 2013. **V1 Optimized platform version** The V1 Optimized platform version uses CSOM REST API. This API style provides a variety of benefits not found with SOAP API: * CSOM REST API supports bulk operations for faster crawl operations. * CSOM REST API uses [traffic decorating](https://docs.microsoft.com/en-us/sharepoint/dev/general-development/how-to-avoid-getting-throttled-or-blocked-in-sharepoint-online#how-to-decorate-your-http-traffic-to-avoid-throttling) and is therefore less susceptible to throttling. * CSOM REST API is considerably more efficient, resulting in less data being transferred during crawl operations. ### Active Directory Connector for ACLs dependency **V1 platform version** The V1 platform version has a key limitation in regard to LDAP/ActiveDirectory access. In order to look up user group memberships, each SharePoint datasource was required to perform LDAP queries. If multiple SharePoint datasources utilized a single LDAP/ActiveDirectory backend, however, multiple LDAP lookup operations took place unnecessarily, and the user would suffer from excessive LDAP overhead. **V1 Optimized platform version** In Fusion 4.2.4, the [Active Directory (AD) Connector for ACLs](/docs/fusion-connectors/concepts/ad-acl) was introduced. The SharePoint V1 Optimized connector works in tangent with the AD Connector for ACLs to create a sidecar collection which is used in [graph security trimming queries](#graph-security-trimming). As a result, all LDAP/ActiveDirectory operations are fully dependent on the AD Connector for ACLs. **Important** If you are using SharePoint Online, and it is *not* backed by Azure Active Directory or Active Directory Federation Services (ADFS), the V1 Optimized connector does not depend on the AD Connector for ACLs. ### Changes API **V1 platform version** The V1 platform version does not use the SharePoint Changes API. As a result, the recrawl process required all items to be revisited in order. For large SharePoint collections, incremental crawls took an excessive amount of time. **V1 Optimized platform version** The V1 Optimized platform version is able to take advantage of the Changes API to perform incremental crawls. The Changes API tracks all additions, updates, and deletions since the previous crawl operation for a collection. This improved crawl operation process significantly improves incremental crawl speed. ### Graph security trimming **V1 platform version** The security trimming approach used by the V1 platform version had notable drawbacks: * **LDAP/ActiveDirectory information is stored in an inefficient manner.** When a document is fetched for indexing, it returns the users and groups with permission to view the document. However, SharePoint does not explicitly list these users and groups. The security trimming approach requires that all nested LDAP/ActiveDirectory groups be fetched and added to the document ACLs.\ As a result, if the nested LDAP/ActiveDirectory group relationships change, the content is sometimes required to be reindexed despite not changing in SharePoint. This can lead to massive reindexing operations. * **Each SharePoint datasource requires a separate Solr filter.** With the V1 platform version, SharePoint datasources are unable to share the same security filter, even if they are pointing to the same SharePoint farm. This restriction can be Severely inefficient.\ In a use case with five SharePoint datasources, for example, five Solr filter queries (fqs) would be required. The more fqs you have, the more work is required from Solr while performing queries, resulting in slower queries. This inefficiency scales with the number of SharePoint datasources, and it is not uncommon to have 30-50 datasources in an application. * **SharePoint security filters cannot be shared with other connectors.** For example, if a SharePoint datasource and an SMB2 datasource are backed by the same ActiveDirectory, you are still required to have an individual security filter for both datasources. Again, this inefficiency scales with the number of datasources you have. **V1 Optimized platform version** Unlike the V1 platform version, the V1 Optimized platform version uses a Solr graphy query approach. Advantages include: * **LDAP/ActiveDirectory information is not stored in nested groups on the content document ACL fields.** * **ACLs in SharePoint content documents are stored in a field.** Each SharePoint document that you crawl contains ACLs. As the document is indexed by Fusion, a field is populated with any role assignments attached to the document to ensure only users with appropriate permissions can view it. For example when doing a security trimmed query, you can input the username that is performing the search, and a Solr fq is formed with the values that match the ACL field on each document. The documents that are returned are restricted to what the user is permitted to view. * **A single filter can perform a security trimming query against datasources backed by the same ActiveDirectory instance.** This is not restricted to the SharePoint V1 Optimized connector. Other connectors, such as the SMB2 connector, can use the same filter. * **Group membership lookups (LDAP queries) are separated from the SharePoint connector.** Now, the [AD Connector for ACLs](#active-directory-connector-for-acls-dependency) is used to create a separate ACL Solr sidecar collection. First, a Solr graph query is performed to obtain a user’s groups and nested groups from the sidecar collection. Then, a join query is used to match the ACL fields on the content documents. This process is performed behind-the-scenes. The V1 Optimized connector uses the security trimming stage like all other connectors. ### Multiple crawl phases **V1 platform version** The V1 platform version does not support multiple crawl phases. **V1 Optimized platform version** The V1 Optimized platform version performs crawl operations in two phases: * **Pre-fetch phase** - This phase: * Utilizes the CSOM REST API to fetch all relevant metadata in large batches. This creates a pre-fetch database, which is exported for use by the post-fetch phase. * Does *not* download the file content of list items. It only fetches the metadata. * Is saved in `$FUSION_HOME/var/log/connectors/connectors-classic/connectors-classic.log` and `$FUSION_HOME/var/log/connectors/connectors-classic/sharepoint-exporter-DSID.log` where `DSID` is the SharePoint optimized datasource ID. The counters in the data source job status window only increase when the content documents begin to index. * **Post-fetch phase** - After the pre-fetch phase has completed, the crawl operation is ready to index documents during the post-fetch phase. The crawl will iterate through all items identified in the pre-fetch phase and index them into the pipeline. If there is file content associated with a pre-fetch list item, that content will be downloaded and parsed using the Fusion parser. The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository. When a crawl is performed with the V1 Optimized connector, a SharePoint export database file is created. This file contains various metadata related to the SharePoint data. It does not store file contents from the SharePoint data. A Java web viewer, `sharepoint-exporter.jar`, is included to browse the export database file. The web viewer is located in the following directory: ```bash wrap theme={"dark"} ${FUSION_HOME}/apps/connectors/connectors-classic/plugins/{viewer-directory}/assets/sharepoint-exporter/sharepoint-exporter.jar ``` For example, the: * SharePoint optimized connector export viewer utility is located at: `${FUSION_HOME}/apps/connectors/connectors-classic/plugins/lucidworks.sharepoint-optimized/assets/sharepoint-exporter/sharepoint-exporter.jar` * SharePoint online optimized connector export viewer utility is located at: `${FUSION_HOME}/apps/connectors/connectors-classic/plugins/lucidworks.sharepoint-online-optimized/assets/sharepoint-exporter/sharepoint-exporter.jar` The web viewer is launched with the following arguments: * `-exportDirectoryPath` - The full path to the export database file. * `-port` - The port which the web viewer server will run on. If unassigned, a random port will be selected. ```bash wrap theme={"dark"} java -cp /opt/fusion/latest/apps/connectors/connectors-classic/plugins/{viewer-directory}/assets/sharepoint-exporter/sharepoint-exporter.jar com.lucidworks.fusion.connector.plugins.sharepoint.exporter.SharepointExportWeb -port 5000 -exportDirectoryPath /opt/fusion/latest/data/connectors/connectors-classic/{export-directory}/example_spo ``` The files are viewed by navigating the directory with any browser. SharePoint V1 Optimized Export Database File Viewer

SharePoint V1 Optimized Export Database File Viewer

## SharePoint (on-premises) This connector can access a SharePoint repository running on the following platforms: * Microsoft SharePoint 2013 * Microsoft SharePoint 2016 * Microsoft SharePoint 2019 ### Understanding incremental crawls After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls". Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started. If the **Limit Documents > Fetch all site collections** checkbox selected, you are crawling an entire SharePoint Web application, and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index. **Important** If you are filtering on fields, be sure to leave the `lw` fields in place. These fields are required for successful incremental crawling. ### Throttling or rate limiting SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling. Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The [SharePoint Online documentation](https://docs.microsoft.com/en-us/sharepoint/dev/general-development/how-to-avoid-getting-throttled-or-blocked-in-sharepoint-online) has some important information about this. These instructions are for how to design your crawl so Microsoft will not throttle you. Use this if you are starting a new deployment or getting chronic throttling. Here are the key parts of the strategy: * Identify traffic with a [custom user agent tied to an Azure app](#update-the-user-agent-string-to-link-to-an-azure-app). * [Divide and conquer](#divide-and-conquer-with-multiple-service-accounts) by splitting sites across multiple datasources and multiple service accounts. * Set [very low threads per datasource](#divide-and-conquer), but spread the load horizontally across accounts and datasources. * [Shape concurrency at the job level](#lucidworks-job-scheduler-to-limit-concurrent-datasources) by chaining jobs with the scheduler and controlling how many SharePoint jobs run at once. SharePoint Online is a shared cloud service. Indexing content with SharePoint Online is much slower than crawling SharePoint On-Premises servers. SharePoint Online throttles the traffic heavily, keeping the SharePoint services responsive and healthy. Microsoft has published a purposely vague article, which does not include a throttling algorithm, for further understanding [How to Avoid Throttling](https://docs.microsoft.com/en-us/sharepoint/dev/general-development/how-to-avoid-getting-throttled-or-blocked-in-sharepoint-online). Based on the information provided, we will discuss the techniques currently used to eliminate throttling. These recommendations are subject to change, as this is an *evolving topic*. ### Update the user agent string to link to an Azure app Each SharePoint datasource requires your company name in the user agent string. **Example:** ```bash theme={"dark"} ISV|AcmeInc|Fusion/5.3 ``` This “decorates” traffic uniquely so Microsoft will know exactly who API calls are coming from. ***Do not use*** the default user agent string. It will not *uniquely identify* your company’s traffic. ### Divide and conquer with multiple service accounts SharePoint Online rate limits are due to multiple threads accessing SharePoint APIs concurrently. SharePoint sees that sort of traffic and throttles it heavily. To reduce the traffic and throttling: * Split SharePoint site collections between datasources. * Create multiple SharePoint Online service accounts and spread them across the datasources. **Example** - configured SharePoint online optimized connector datasource **SPO\_DS\_1**: ```bash theme={"dark"} Start links: https://tenant.sharepoint.com/sites/test1 https://tenant.sharepoint.com/sites/test2 https://tenant.sharepoint.com/sites/test3 https://tenant.sharepoint.com/sites/test4 https://tenant.sharepoint.com/sites/test5 Service account: service.account@tenant.onmicrosoft.com Fetch Threads: 5 Number of prefetch threads: 10 User agent: ISV|YourCompanyName|Fusion/5.x.x ``` The crawling will begin with great speed. However, eventually errors are generated in the logs: ```bash theme={"dark"} Error message: [429 TOO MANY REQUESTS] ``` #### Divide and conquer 1. Create five service accounts, exactly like the first. 2. Create five datasources. Give each datasource its very own service account and a subset of the SharePoint site collections that are to be crawled. 3. Set the fetch threads to 1. ##### Configuration results ```bash expandable theme={"dark"} * SPO_DS_1:\ Start links: **https://tenant.sharepoint.com/sites/test1**\ Service account: **\service.account@tenant1.onmicrosoft.com**\ Fetch Threads: **1**\ Number of prefetch threads: **1** * SPO_DS_2:\ Start links: **https://tenant.sharepoint.com/sites/test2**\ Service account: **\service.account@tenant2.onmicrosoft.com**\ Fetch Threads: **1**\ Number of prefetch threads: **1** * SPO_DS_3:\ Start links: **https://tenant.sharepoint.com/sites/test3**\ Service account: **\service.account@tenant3.onmicrosoft.com**\ Fetch Threads: **1**\ Number of prefetch threads: **1** * SPO_DS_4:\ Start links: **https://tenant.sharepoint.com/sites/test4**\ Service account: **\service.account@tenant4.onmicrosoft.com**\ Fetch Threads: **1**\ Number of prefetch threads: **1** * SPO_DS_5:\ Start links: **https://tenant.sharepoint.com/sites/test5**\ Service account: **\service.account@tenant5.onmicrosoft.com**\ Fetch Threads: **1**\ Number of prefetch threads: **1** ``` The idea is to reduce rate limiting. Microsoft will see requests coming from five accounts rather than a single account. To prevent multiple SharePoint jobs from running concurrently, chain jobs. The key is to prevent Fusion from making too many SharePoint API requests to the SharePoint Online servers concurrently. If too aggressive, Microsoft will throttle. ### Lucidworks job scheduler to limit concurrent datasources In addition to limiting a single datasource from making to many connections as once, limit the number of SharePoint Online datasources running concurrently. To avoid too many concurrent datasources running, use the Lucidworks Job Scheduler to chain SharePoint datasources to run one at a time. Use the **“Trigger job after another data source completes”** feature. **Example using a single datasource at a time:** ```bash theme={"dark"} SPO_DS_1 Schedule: Every day at 06:00:00\ SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1\ SPO_DS_3 Schedule: Trigger job upon completion of SPO_DS_2\ SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3\ SPO_DS_5 Schedule: Trigger job upon completion of SPO_DS_4 ``` In the above case Job 2 (`SPO_DS_2`) is triggered when Job 1 (`SPO_DS_1`) completes and so on. For practical purposes one datasource crawling may be too slow. To increase the number of concurrent jobs use `MaxConcurrentSharePointJobs`. **Example using `MaxConcurrentSharePointJobs=3`:** ```bash theme={"dark"} SPO_DS_1 Schedule: Every day at 06:00:00\ SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1\ SPO_DS_3 Schedule: Every day at 06:00:00\ SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3\ SPO_DS_5 Schedule: Every day at 06:00:00\ SPO_DS_6 Schedule: Trigger job upon completion of SPO_DS_5 ``` Read **Avoid SharePoint throttling** to identify the errors that indicate that throttling is taking place, and adjust your connector’s configuration to help avoid it. * `429. Too many requests`\ This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks. * `503. Server too busy`\ This error is less common, but the result is the same. {/* // https://lucidworks.atlassian.net/wiki/spaces/FPO/pages/2335703183/SharePoint+Online+-+How+to+avoid+throttling+rate+limiting */} These instructions are for what to tweak in the connector if you are seeing `429` and `503` responses. When using a SharePoint connector to crawl SharePoint Online, rate limiting can be an issue. Learn more about [throttling in SharePoint Online](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/how-to-avoid-getting-throttled-or-blocked-in-sharepoint-online). You have a few options to avoid throttling: * [Decrease the number of threads](#decrease-the-number-of-threads) * [Stagger the datasource job schedules](#stagger-the-datasource-jobs) * [Increase the number of retries](#increase-the-number-of-retries) ## Decrease the number of threads If you see many `429`/`503` errors, you are probably hitting SharePoint Online with too many concurrent fetchers. **How to decrease the number of threads** 1. Set **Crawl Performance** > **Fetch Threads** to a lower value. 2. Set **Crawl Performance** > **Prefetch Threads** to a lower value. ## Stagger the datasource jobs If you have multiple SharePoint Online datasource jobs that run at the same time, use the job scheduler to stagger their schedules instead. * [Fusion 4.x Scheduler](/docs/4/fusion-server/concepts/jobs/schedules) * [Fusion 5.x Scheduler](/docs/5/fusion/operations/jobs-and-scheduling/schedules) ## Increase the number of retries By default, the connector is configured with retries. This provides a chance for the requests that were rate-limited to run again. You can increase the number of retries and the interval between retries. The process is called [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff), which gradually increases the delays between retries to increase the chances of a successful retry. This helps prevent missing documents due to rate limiting. For SharePoint Online V1 Optimized, retry configuration parameters include: * Retry attempts * Retry maximum wait * Retryer backoff delay (milliseconds) * Retryer backoff max delay (milliseconds) * Retryer backoff multiplier (decimal) For SharePoint Optimized V2, retry configuration parameters include: * Retry Delay * Maximum Retries * Delay Factor * Maximum Delay Time * Maximum Time Limit When you are receiving too many rate limiting errors, it is likely too many requests are being sent too frequently. Retrying may not help. One option is to decrease your traffic instead. If you want to continue sending the maximum number of requests, configure the **Retryer backoff multiplier** so it gets larger after every retry. The crawler will slow significantly and allow SharePoint to relax the throttling. ## User permission configuration options The SharePoint connectors provide a variety of configuration options for accessing SharePoint and SharePoint Online. Permissions settings should follow the principle of least privilege, as described in the [Microsoft SharePoint docs](https://docs.microsoft.com/en-us/sharepoint/sites/plan-site-permissions-in-sharepoint-server): > Follow the principle of least-privileged: Users should have only the permission levels or individual permissions they must have to perform their assigned tasks. ### SharePoint | Account type | Account config | Description | | -------------------------------- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Active Directory Service Account | Account is set up as a **Site Collection Auditor** | Allows you to list all site collections. | | Active Directory Service Account | Account is set up with limited permissions | Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Additionally, **noindex** tags are ignored. Sites will always be indexed regardless of their **noindex** settings. | See the following resources for configuration instructions: ## Decide what to crawl Determine what to crawl and select one of the following: * [An entire SharePoint Web application](#how-to-crawl-a-subset-of-sharepoint-site-collections) (all site collections in a specific SharePoint URL). * [A subset of SharePoint site collections](#how-to-crawl-a-subset-of-sharepoint-site-collections). * [A specific sub-site, list, or list item](#how-to-crawl-a-specific-sub-site-list-or-list-item). ### How to crawl an entire SharePoint Web application 1. Verify the **Limit Documents > Fetch all site collections** option is selected (default). 2. Specify the Web application URL as a site. For example: `https://lucidworks.sharepoint.local/` Administrative access to SharePoint is required to crawl an entire SharePoint Web application. ### How to crawl a subset of SharePoint site collections 1. Uncheck the **Limit Documents > Fetch all site collections** option. 2. Specify a "Start Link" for each site collection to crawl. Examples include: * `https://lucidworks.sharepoint.local/sites/site1` * `https://lucidworks.sharepoint.local/sites/site2` * `https://lucidworks.sharepoint.local/sites/site3` ### How to crawl a specific sub-site, list, or list item: 1. Uncheck the **Limit Documents > Fetch all site collections** option. 2. Specify a "Start Link" for each site collection that contains the item to fetch. 3. Specify a non-wildcard **Inclusive Regular Expression** for each parent. For example, if you want to crawl `https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesite`, then you must include inclusive regexes for all parents: ``` https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.* ``` If you exclude a parent item of the site, the connector does not crawl the site because it will not spider down to it during the crawl process. ## Create permission and user policy for the crawl The options are: * [Set up an on-prem crawl account](#how-to-set-up-an-on-prem-crawl-account) with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL. * [Set up an online crawl account](#how-to-set-up-an-online-crawl-account) with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL. * [Provide administrative access to crawl](#how-to-provide-admin-access-to-crawl) ### How to set up an on-prem crawl account #### Create a permission policy level 1. Navigate to **Central Administration > Manage web application > Permission Policy**. 2. Select **Add permission policy level**. In this example, the permission level is named **fusion\_crawl\_policy**. 3. If you need to list all site collections in a SharePoint web application, select the **Site Collection Auditor** option. Fusion SharePoint Crawl Permissions

4. Grant the following permissions: * **View Items** - View items in lists and documents in document libraries. * **Open Items** - View the source of documents with server-side file handlers. * **View Versions** - View past versions of a list item or document. * **View Application Pages** - View forms, views, and application pages. Enumerate lists. **Site Permissions** * **Browse Directories** - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces. * **View Pages** - View pages in a Web site. * **Enumerate Permissions** - Enumerate permissions on the Web site, list, folder, document, or list item. * **Browse User Information** - View information about users of the Web site. * **Use Remote Interfaces** - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site. * **Open** - Allows users to open a Web site, list, or folder in order to access items inside that container. #### Grant user permission to the user policy 1. Navigate to **Central Administration > Manage web application > User Policy > Add Users**. 2. Create a new user with the new **fusion\_crawl\_policy** permission level selected: SharePoint Permission Policy Level

### How to set up an online crawl account #### Create a permission policy level 1. Navigate to **Site settings > Site permissions > Advanced Permission Settings**. 2. Select **New permission level**. In this example, the permission level is named **fusion\_crawl\_policy**. 3. Grant the following permissions: * **View Items** - View items in lists and documents in document libraries. * **Open Items** - View the source of documents with server-side file handlers. * **View Versions** - View past versions of a list item or document. * **View Application Pages** - View forms, views, and application pages. Enumerate lists. **Site Permissions** * **Browse Directories** - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces. * **View Pages** - View pages in a Web site. * **Enumerate Permissions** - Enumerate permissions on the Web site, list, folder, document, or list item. * **Browse User Information** - View information about users of the Web site. * **Use Remote Interfaces** - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site. * **Open** - Allows users to open a Web site, list, or folder in order to access items inside that container. #### Grant user permission 1. Navigate to **Site settings > Site permissions > Advanced Permission Settings**. 2. Select **Grant permissions**. 3. Enter the new user name and add the user. 4. Select a value in the **Select a permission level** field. 5. Select **Share**. 6. In the **Edit Permissions > Choose Permissions** section, select the following check boxes: * **Read.** Can view pages and list items and download documents. * **LW Fusion.** 7. Select **OK** to save the information. If you grant the service account the **Site Collection Auditor** permission, the Lucidworks Fusion SharePoint connector has write-level permission and can list: \* Sites in Site Collections \* SharePoint Site Collection site metadata ### How to provide admin access to crawl See [the SharePoint documentation](https://docs.microsoft.com/en-us/sharepoint/sites/change-site-collection-administrators) for instructions. ## Test user permissions The following PowerShell script verifies permissions on the user account created to crawl SharePoint from Fusion. The script *must* be run by the user account on which the permissions were set. If rights were granted: \* On your account, *you* must run the script to verify the user rights are set correctly. \* On a different user account, the owner of that account must run the script. 1. Save the script with following file name: `test-sharepoint-permissions.ps1`. 2. Enter the first of the site collection URLs to crawl in the `$site_col_url` field of the script. 3. Save the changes. ### Permission verification script ```java theme={"dark"} $site_col_url="https://your.sharepoint.local/sites/mysitecollection" $cred = (Get-Credential) if (-not ([System.Management.Automation.PSTypeName]'ServerCertificateValidationCallback').Type) { $certCallback = @" using System; using System.Net; using System.Net.Security; using System.Security.Cryptography.X509Certificates; public class ServerCertificateValidationCallback { public static void Ignore() { if(ServicePointManager.ServerCertificateValidationCallback ==null) { ServicePointManager.ServerCertificateValidationCallback += delegate ( Object obj, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors ) { return true; }; } } } "@ Add-Type $certCallback } [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12; [ServerCertificateValidationCallback]::Ignore() $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]" $headers.Add("Content-Type", "text/xml") $headers.Add("SOAPAction", "http://schemas.microsoft.com/sharepoint/soap/GetUpdatedFormDigestInformation") $headers.Add("X-RequestForceAuthentication", "true") $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f") $body = "`n`n `n `n `n" $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" -Method 'POST' -Headers $headers -Body $body -Credential $cred $digest_value = $response.Envelope.Body.GetUpdatedFormDigestInformationResponse.FirstChild.DigestValue $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]" $headers.Add("Content-Type", "text/xml") $headers.Add("X-RequestForceAuthentication", "true") $headers.Add("X-RequestDigest", $digest_value) $headers.Add("Accept", "application/json") $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f") $body = @' '@ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/ProcessQuery" -Method 'POST' -Headers $headers -Body $body -Credential $cred $response | ConvertTo-Json -Depth 100 ``` ### Successful query response If the test script executes successfully, metadata is returned. The following is a sample of a successful response: ``` test-sharepoint-permissions.ps1 cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: [ { "SchemaVersion": "14.0.0.0", "LibraryVersion": "16.0.10337.12109", "ErrorInfo": null, "TraceCorrelationId": "c419a69f-1c06-b07f-b69b-4d7720fd7756" }, 2, { "IsNull": false }, 4, { "IsNull": false }, 5, { "_ObjectType_": "SP.Web", "_ObjectIdentity_": "c419a69f-1c06-b07f-b69b-4d7720fd7756|740c6a0b-85e2-48a0-a494-e0f1759d4aa7:site:8992a373-cdf0-4262-b240-9527c7174682:web:2080d74c-e181-43df-829f-ad5bee97b6f8", "Webs": { "_ObjectType_": "SP.WebCollection", "_Child_Items_": [ { "_ObjectType_": "SP.Web", ... truncated for brevity ... "LastItemModifiedDate": "\/Date(1603731388000)\/" } ] ``` ### Failed query response If the test script fails, either: * An error code is generated. For example, an error code 401. * An error message with explanatory information is returned. The following is a sample of a failed response: ``` Credential Invoke-RestMethod : The remote server returned an error: (401) Unauthorized. At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:47 char:13 + $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" - ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand Invoke-RestMethod : The remote server returned an error: (401) Unauthorized. At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:100 char:13 + $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/Pr ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand ``` The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository. {/* // tag::body[] */} {/* // tag::common-1[] */} ## Decide what you need to crawl The first and most important thing to do is determine what you are trying to crawl, and to pick your “Start Links” accordingly. Choose one of the following: * [An entire SharePoint Web application](#how-to-crawl-an-entire-sharepoint-web-application) (all site collections in a specific SharePoint URL). * [A subset of SharePoint site collections](#how-to-crawl-a-subset-of-sharepoint-site-collections). * [A specific sub-site, list, or list item](#how-to-crawl-a-specific-sub-site-list-or-list-item). ### How to crawl an entire SharePoint Web application 1. Leave the **Limit Documents > Fetch all site collections** option checked (as it is by default). 2. Specify the Web application URL as a site. For example: `https://lucidworks.sharepoint.local/` Crawling an entire SharePoint Web application requires administrative access to SharePoint. ### How to crawl a subset of SharePoint site collections 1. Uncheck the **Limit Documents > Fetch all site collections** option. 2. Specify a "Start Link" for each site collection that you want to crawl. Examples: `https://lucidworks.sharepoint.local/sites/site1`, `https://lucidworks.sharepoint.local/sites/site2`, `https://lucidworks.sharepoint.local/sites/site3` ### How to crawl a specific sub-site, list, or list item: 1. Uncheck the **Limit Documents > Fetch all site collections** option. 2. Specify a "Start Link" for each site collection that contains the item you want to fetch. 3. Specify a non-wildcard **Inclusive Regular Expression** for each parent. For example, if you want to crawl `https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesite` then you must include inclusive regexes for all parents along the way: ``` https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.* ``` If you exclude a parent item of the site, the connector will not crawl the site because it will never spider down to it during the crawl process. {/* // end::common-1[] */} ## Set up permissions for the crawl {/* // tag::common-2[] */} You have two options here: * [Set up a crawl account](#how-to-set-up-a-crawl-account) with only as much permission as it needs. This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL. It cannot access the SharePoint Tenant Admin API to list all the site collections on your tenant. If you use this authentication method, you must enter each site collection to crawl in **Start Links**. * [Provide administrative access to crawl](#how-to-provide-admin-access-to-crawl) {/* // end::common-2[] */} ### How to set up a crawl account #### 1. Create a Lucidworks Fusion crawl permission 1. Navigate to **Central Administration > Manage web application > Permission Policy**. 2. Click **Add permission policy level**. In this example, the permission level is named "fusion\_crawl\_policy". 3. If you need to list all site collections in a SharePoint web application, select the option **Site Collection Auditor**: Fusion SharePoint Crawl Permissions

4. Grant the following permissions: * **View Items** - View items in lists and documents in document libraries. * **Open Items** - View the source of documents with server-side file handlers. * **View Versions** - View past versions of a list item or document. * **View Application Pages** - View forms, views, and application pages. Enumerate lists. **Site Permissions** * **Browse Directories** - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces. * **View Pages** - View pages in a Web site. * **Enumerate Permissions** - Enumerate permissions on the Web site, list, folder, document, or list item. * **Browse User Information** - View information about users of the Web site. * **Use Remote Interfaces** - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site. * **Open** - Allows users to open a Web site, list, or folder in order to access items inside that container. #### 2. Grant user permission to the user policy 1. Navigate to **Central Administration > Manage web application > User Policy > Add Users**. 2. Create a new user with the new policy permission level, "fusion\_crawl\_policy", selected: SharePoint Permission Policy Level

### How to provide admin access to crawl See [the SharePoint documentation](https://docs.microsoft.com/en-us/sharepoint/sites/change-site-collection-administrators) for instructions. ## Test user permissions The following PowerShell script verifies permissions on the user account created to crawl SharePoint from Fusion. The script *must* be run by the user account on which the permissions were set. If rights were granted: \* On your account, *you* must run the script to verify the user rights are set correctly. \* On a different user account, the owner of that account must run the script. 1. Save the script with following file name: `test-sharepoint-permissions.ps1`. 2. Enter the first of the site collection URLs to crawl in the `$site_col_url` field of the script. 3. Save the changes. ### Permission verification script ```java theme={"dark"} $site_col_url="https://your.sharepoint.local/sites/mysitecollection" $cred = (Get-Credential) if (-not ([System.Management.Automation.PSTypeName]'ServerCertificateValidationCallback').Type) { $certCallback = @" using System; using System.Net; using System.Net.Security; using System.Security.Cryptography.X509Certificates; public class ServerCertificateValidationCallback { public static void Ignore() { if(ServicePointManager.ServerCertificateValidationCallback ==null) { ServicePointManager.ServerCertificateValidationCallback += delegate ( Object obj, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors ) { return true; }; } } } "@ Add-Type $certCallback } [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12; [ServerCertificateValidationCallback]::Ignore() $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]" $headers.Add("Content-Type", "text/xml") $headers.Add("SOAPAction", "http://schemas.microsoft.com/sharepoint/soap/GetUpdatedFormDigestInformation") $headers.Add("X-RequestForceAuthentication", "true") $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f") $body = "`n`n `n `n `n" $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" -Method 'POST' -Headers $headers -Body $body -Credential $cred $digest_value = $response.Envelope.Body.GetUpdatedFormDigestInformationResponse.FirstChild.DigestValue $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]" $headers.Add("Content-Type", "text/xml") $headers.Add("X-RequestForceAuthentication", "true") $headers.Add("X-RequestDigest", $digest_value) $headers.Add("Accept", "application/json") $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f") $body = @' '@ $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/ProcessQuery" -Method 'POST' -Headers $headers -Body $body -Credential $cred $response | ConvertTo-Json -Depth 100 ``` ### Successful query response If the test script executes successfully, metadata is returned. The following is a sample of a successful response: ``` test-sharepoint-permissions.ps1 cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: [ { "SchemaVersion": "14.0.0.0", "LibraryVersion": "16.0.10337.12109", "ErrorInfo": null, "TraceCorrelationId": "c419a69f-1c06-b07f-b69b-4d7720fd7756" }, 2, { "IsNull": false }, 4, { "IsNull": false }, 5, { "_ObjectType_": "SP.Web", "_ObjectIdentity_": "c419a69f-1c06-b07f-b69b-4d7720fd7756|740c6a0b-85e2-48a0-a494-e0f1759d4aa7:site:8992a373-cdf0-4262-b240-9527c7174682:web:2080d74c-e181-43df-829f-ad5bee97b6f8", "Webs": { "_ObjectType_": "SP.WebCollection", "_Child_Items_": [ { "_ObjectType_": "SP.Web", ... truncated for brevity ... "LastItemModifiedDate": "\/Date(1603731388000)\/" } ] ``` ### Failed query response If the test script fails, either: * An error code is generated. For example, an error code 401. * An error message with explanatory information is returned. The following is a sample of a failed response: ``` Credential Invoke-RestMethod : The remote server returned an error: (401) Unauthorized. At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:47 char:13 + $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" - ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand Invoke-RestMethod : The remote server returned an error: (401) Unauthorized. At C:\Users\nicho\Documents\test-sharepoint-permissions.ps1:100 char:13 + $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/Pr ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand ``` {/* // end::body[] */} ### SharePoint Online **Important** For the V2 connector, when the access to SharePoint Online is affected by a Conditional Access Policy (CAP), it’s recommended to set a proper user-agent value (depending on the CAP configuration) in the connector configuration (toggle advanced properties): **Requests settings** > **User agent**. | Account type | Account config | Description | | ------------ | -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Full Admin | Azure App Only | Allows you to list all site collections in tenant. | | Full Admin | OAuth App Only | **Removed April 2, 2026.** Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. | | ADFS Account | Account is set up as a **Site Collection Auditor** | Allows you to list all site collections *if the user is a tenant administrator*. | | ADFS Account | Account is set up with limited permissions | Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Use this option if your deployment requires the Lucidworks crawl account to have the fewest privileges possible. |