Product Selector

Fusion 5.11
    Fusion 5.11

    Crawl SharePoint Online with Minimal Throttling

    SharePoint Online is a shared cloud service. Indexing content with SharePoint Online is much slower than crawling SharePoint On-Premises servers. SharePoint Online throttles the traffic heavily, keeping the SharePoint services responsive and healthy.

    Microsoft has published a purposely vague article, which does not include a throttling algorithm, for further understanding How to Avoid Throttling.

    Based on the information provided, we will discuss the techniques currently used to eliminate throttling. These recommendations are subject to change, as this is an evolving topic.

    Each SharePoint datasource requires your company name in the user agent string.

    Example:

    ISV|AcmeInc|Fusion/5.3

    This “decorates” traffic uniquely so Microsoft will know exactly who API calls are coming from.

    Do not use the default user agent string. It will not uniquely identify your company’s traffic.

    Divide and Conquer with Multiple Service Accounts

    SharePoint Online rate limits are due to multiple threads accessing SharePoint APIs concurrently. SharePoint sees that sort of traffic and throttles it heavily.

    To reduce the traffic and throttling:

    • Split SharePoint site collections between datasources.

    • Create multiple SharePoint Online service accounts and spread them across the datasources.

    Example - configured SharePoint online optimized connector datasource SPO_DS_1:

    Start links:
    https://tenant.sharepoint.com/sites/test1
    https://tenant.sharepoint.com/sites/test2
    https://tenant.sharepoint.com/sites/test3
    https://tenant.sharepoint.com/sites/test4
    https://tenant.sharepoint.com/sites/test5
    
    Service account: service.account@tenant.onmicrosoft.com
    Fetch Threads: 5
    Number of prefetch threads: 10
    User agent: ISV|YourCompanyName|Fusion/5.x.x

    The crawling will begin with great speed. However, eventually errors are generated in the logs:

    Error message: [429 TOO MANY REQUESTS]

    Divide and conquer:

    1. Create five service accounts, exactly like the first.

    2. Create five datasources. Give each datasource its very own service account and a subset of the SharePoint site collections that are to be crawled.

    3. Set the fetch threads to 1.

    Configuration Results:
    • SPO_DS_1:
      Start links: https://tenant.sharepoint.com/sites/test1
      Service account: service.account@tenant1.onmicrosoft.com
      Fetch Threads: 1
      Number of prefetch threads: 1

    • SPO_DS_2:
      Start links: https://tenant.sharepoint.com/sites/test2
      Service account: service.account@tenant2.onmicrosoft.com
      Fetch Threads: 1
      Number of prefetch threads: 1

    • SPO_DS_3:
      Start links: https://tenant.sharepoint.com/sites/test3
      Service account: service.account@tenant3.onmicrosoft.com
      Fetch Threads: 1
      Number of prefetch threads: 1

    • SPO_DS_4:
      Start links: https://tenant.sharepoint.com/sites/test4
      Service account: service.account@tenant4.onmicrosoft.com
      Fetch Threads: 1
      Number of prefetch threads: 1

    • SPO_DS_5:
      Start links: https://tenant.sharepoint.com/sites/test5
      Service account: service.account@tenant5.onmicrosoft.com
      Fetch Threads: 1
      Number of prefetch threads: 1

    The idea is to reduce rate limiting. Microsoft will see requests coming from five accounts rather than a single account. To prevent multiple SharePoint jobs from running concurrently, chain jobs.

    The key is to prevent Fusion from making too many SharePoint API requests to the SharePoint Online servers concurrently. If too aggressive, Microsoft will throttle.

    Lucidworks Job Scheduler to Limit Concurrent Datasources

    In addition to limiting a single datasource from making to many connections as once, limit the number of SharePoint Online datasources running concurrently. To avoid too many concurrent datasources running, use the Lucidworks Job Scheduler to chain SharePoint datasources to run one at a time.

    Use the “Trigger job after another data source completes” feature.

    Example single datasource at a time:
    SPO_DS_1 Schedule: Every day at 06:00:00
    SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1
    SPO_DS_3 Schedule: Trigger job upon completion of SPO_DS_2
    SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3
    SPO_DS_5 Schedule: Trigger job upon completion of SPO_DS_4

    In the above case Job 2 (SPO_DS_2) is triggered when Job 1 (SPO_DS_1) completes and so on.

    For practical purposes one datasource crawling may be too slow. To increase the number of concurrent jobs use 'MaxConcurrentSharePointJobs'.

    Example 'MaxConcurrentSharePointJobs=3':
    SPO_DS_1 Schedule: Every day at 06:00:00
    SPO_DS_2 Schedule: Trigger job upon completion of SPO_DS_1
    SPO_DS_3 Schedule: Every day at 06:00:00
    SPO_DS_4 Schedule: Trigger job upon completion of SPO_DS_3
    SPO_DS_5 Schedule: Every day at 06:00:00
    SPO_DS_6 Schedule: Trigger job upon completion of SPO_DS_5