Crawl SharePoint Online with Minimal Throttling
SharePoint Online is a shared cloud service. Indexing content with SharePoint Online is much slower than crawling SharePoint On-Premises servers. SharePoint Online throttles the traffic heavily, keeping the SharePoint services responsive and healthy.
Microsoft has published a purposely vague article, which does not include a throttling algorithm, for further understanding How to Avoid Throttling.
Based on the information provided, we will discuss the techniques currently used to eliminate throttling. These recommendations are subject to change, as this is an evolving topic.
Update the User Agent String to Link to an Azure App
Each SharePoint datasource requires your company name in the user agent string.
Example:
ISV|AcmeInc|Fusion/5.3
This “decorates” traffic uniquely so Microsoft will know exactly who API calls are coming from.
Do not use the default user agent string. It will not uniquely identify your company’s traffic. |
Divide and Conquer with Multiple Service Accounts
SharePoint Online rate limits are due to multiple threads accessing SharePoint APIs concurrently. SharePoint sees that sort of traffic and throttles it heavily.
To reduce the traffic and throttling:
-
Split SharePoint site collections between datasources.
-
Create multiple SharePoint Online service accounts and spread them across the datasources.
Example - configured SharePoint online optimized connector datasource SPO_DS_1:
Start links: https://tenant.sharepoint.com/sites/test1 https://tenant.sharepoint.com/sites/test2 https://tenant.sharepoint.com/sites/test3 https://tenant.sharepoint.com/sites/test4 https://tenant.sharepoint.com/sites/test5 Service account: service.account@tenant.onmicrosoft.com Fetch Threads: 5 Number of prefetch threads: 10 User agent: ISV|YourCompanyName|Fusion/5.x.x
The crawling will begin with great speed. However, eventually errors are generated in the logs:
Error message: [429 TOO MANY REQUESTS]
Divide and conquer:
-
Create five service accounts, exactly like the first.
-
Create five datasources. Give each datasource its very own service account and a subset of the SharePoint site collections that are to be crawled.
-
Set the fetch threads to 1.
Configuration Results:
-
SPO_DS_1:
Start links: https://tenant.sharepoint.com/sites/test1
Service account: service.account@tenant1.onmicrosoft.com
Fetch Threads: 1
Number of prefetch threads: 1 -
SPO_DS_2:
Start links: https://tenant.sharepoint.com/sites/test2
Service account: service.account@tenant2.onmicrosoft.com
Fetch Threads: 1
Number of prefetch threads: 1 -
SPO_DS_3:
Start links: https://tenant.sharepoint.com/sites/test3
Service account: service.account@tenant3.onmicrosoft.com
Fetch Threads: 1
Number of prefetch threads: 1 -
SPO_DS_4:
Start links: https://tenant.sharepoint.com/sites/test4
Service account: service.account@tenant4.onmicrosoft.com
Fetch Threads: 1
Number of prefetch threads: 1 -
SPO_DS_5:
Start links: https://tenant.sharepoint.com/sites/test5
Service account: service.account@tenant5.onmicrosoft.com
Fetch Threads: 1
Number of prefetch threads: 1
The idea is to reduce rate limiting. Microsoft will see requests coming from five accounts rather than a single account. To prevent multiple SharePoint jobs from running concurrently, chain jobs.
The key is to prevent Fusion from making too many SharePoint API requests to the SharePoint Online servers concurrently. If too aggressive, Microsoft will throttle. |
Lucidworks Job Scheduler to Limit Concurrent Datasources
In addition to limiting a single datasource from making to many connections as once, limit the number of SharePoint Online datasources running concurrently. To avoid too many concurrent datasources running, use the Lucidworks Job Scheduler to chain SharePoint datasources to run one at a time.
See Schedule a Job
Use the “Trigger job after another data source completes” feature.
Example single datasource at a time:
SPO_DS_1 Schedule: Every day at 06:00:00
SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1
SPO_DS_3 Schedule: Trigger job upon completion of SPO_DS_2
SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3
SPO_DS_5 Schedule: Trigger job upon completion of SPO_DS_4
In the above case Job 2 (SPO_DS_2) is triggered when Job 1 (SPO_DS_1) completes and so on.
For practical purposes one datasource crawling may be too slow. To increase the number of concurrent jobs use 'MaxConcurrentSharePointJobs'.
Example 'MaxConcurrentSharePointJobs=3':
SPO_DS_1 Schedule: Every day at 06:00:00
SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1
SPO_DS_3 Schedule: Every day at 06:00:00
SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3
SPO_DS_5 Schedule: Every day at 06:00:00
SPO_DS_6 Schedule: Trigger job upon completion of SPO_DS_5