SharePoint Online V1 Optimized Connector and Datasource Configuration
- How to use this connector
- 1. Decide what you need to crawl
- 2. Set up permissions for the crawl
- How to set up a crawl account
- How to provide admin access to crawl
- Understanding incremental crawls
- Throttling or rate limiting
- View the SharePoint export database file
The SharePoint Online V1 Optimized connector retrieves data from cloud-based SharePoint repositories. Authentication requires a Sharepoint user who has permissions to access Sharepoint via the SOAP API. This user must be registered with the Sharepoint Online authentication server; it is not necessarily the same as the user in Active Directory or LDAP.
|As of Fusion 4.2.4, the V1 platform version this connector is replaced by a new platform version, V1 Optimzied. See the SharePoint and SharePoint Online connector platform versions for details about the differences between V1 and V1 Optimized platform versions of this connector.|
To retrieve data from on-premises SharePoint installation, see the SharePoint V1 Optimized connector.
This connector can access a SharePoint repository running on the following platforms:
Microsoft SharePoint 2010
Microsoft SharePoint 2013
Microsoft SharePoint 2016
Microsoft SharePoint 2019
How to use this connector
1. Decide what you need to crawl
The first and most important thing to do is determine what you are trying to crawl, and to pick your “Start Links” accordingly.
Choose one of the following:
How to crawl an entire SharePoint Web application
Leave the Limit Documents > Fetch all site collections option checked (as it is by default).
Specify the Web application URL as a site.
|Crawling an entire SharePoint Web application requires administrative access to SharePoint.|
How to crawl a subset of SharePoint site collections
Uncheck the Limit Documents > Fetch all site collections option.
Specify a "Start Link" for each site collection that you want to crawl.
How to crawl a specific sub-site, list, or list item:
Uncheck the Limit Documents > Fetch all site collections option.
Specify a "Start Link" for each site collection that contains the item you want to fetch.
Specify a non-wildcard Inclusive Regular Expression for each parent.
For example, if you want to crawl
https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesitethen you must include inclusive regexes for all parents along the way:
\https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.*Important
If you exclude a parent item of the site, the connector will not crawl the site because it will never spider down to it during the crawl process.
2. Set up permissions for the crawl
You have two options here:
How to set up a crawl account
1. Create a service account and license the account (if needed)
If you are crawling SharePoint Online, you may need to create a license for the crawl account.
Log in as a SharePoint administrator, and go to your admin center.
If you are using an on-premise active directory synced to SharePoint Online, then you need to create an Active Directory account, and license the Active Directory account on SharePoint Online.
If you are using SharePoint Online user accounts, add a user as the “Lucidworks Fusion service account”.
Add the user as “User (no administrator access)”.
2. Create a Lucidworks Fusion crawl permission
Create a new Crawl permission group by going to Site Settings:
Click Site Permissions:
Click Permission Levels:
Click Add a Permission Level:
Name the new perission level “Lucidworks Fusion Service Permission” and assign these permissions:
View Items - View items in lists and documents in document libraries.
Open Items - View the source of documents with server-side file handlers.
View Versions - View past versions of a list item or document.
View Application Pages - View forms, views, and application pages. Enumerate lists.
View Web Analytics Data - View reports on Web site usage.
Browse Directories - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.
View Pages - View pages in a Web site.
Enumerate Permissions - Enumerate permissions on the Web site, list, folder, document, or list item.
Browse User Information - View information about users of the Web site.
Use Remote Interfaces - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.
Open - Allows users to open a Web site, list, or folder in order to access items inside that container.
Edit Personal User Information - Allows a user to change his or her own user information, such as adding a picture.
3. Create a Fusion crawl group and assign the crawl service account to it
For each top-level site you want to be able to crawl, go to Site Settings:
Click Site Permissions:
Click Create Group:
Give it the Lucidworks Crawl permission you created earlier:
Add the service account to the Crawl group:
At this point, the user should be able to crawl without needing administrator rights.
Limitations of a crawling SharePoint Online with a non-administrative account
There is one important drawback of crawling SharePoint Online with a non-administrative account: Only SharePoint Online Administrators are allowed to list site collections from SharePoint Online.
So if you want to crawl multiple site collections from your SharePoint Online tenant, you must either
list them in the Start Links explicitly, or
provide a SharePoint administrator account when crawling SharePoint Online.
The diagram below illustrates in red what a non-administrator user can crawl:
A non-administrator can be configured to list sub-sites in a site collection. But a non-administrative user cannot list the site collections given the tenant URL.
For example: A non-admin user can list the sub-sites in
https://lucidworks.sharepoint.com/sites/sitecol, such as
and so on.
But only an admin can list the Site Collections in
How to provide admin access to crawl
You have two options for giving administrative access to Fusion to crawl your accounts: * You can create a service account with admin access (not recommended). * You can create an app-only authentication key.
+ If you choose this approach, you can either use OAuth or the JWT private key option.
Understanding incremental crawls
After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls".
Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started.
If you are crawling an entire SharePoint Web application and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.
If you are filtering on fields, be sure to leave the
Throttling or rate limiting
SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling.
Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.
This section explains how to identify the errors that indicate that throttling is taking place, and how to adjust your connector’s configuration to help avoid it.
When SharePoint Online performs rate limiting, you may see one of two types of errors in the Log Viewer:
429 - Too many requests
This is by far the most common rate limiting error you will see in the logs. This is SharePoint Online’s main mechanism to protect itself from service interruptions due to denial-of-service (DOS) attacks.
503 - Server too busy
This error is less common, but the result is the same.
How to avoid throttling
You have a couple of options to avoid throttling.
Decrease the number of threads
If you see many
503 errors, you are probably hitting SharePoint Online with too many concurrent fetchers.
Set Crawl Performance > Fetch Threads to a lower value.
Set Crawl Performance > Prefetch Threads to a lower value.
Stagger the datasource jobs
If you have multiple SharePoint Online datasource jobs that run at the same time, use the Scheduler to stagger their schedules instead.
Increase the number of retries
By default, the connector is configured with retries. This provides a chance for the requests that were rate-limited to run again.
You can increase the number of retries and the interval between retries. This helps prevent missing documents due to rate limiting.
|When you are receiving many rate limiting errors, retrying is unlikely to help. Decrease your traffic instead.|
View the SharePoint export database file
When a crawl is performed with the V1 Optimized connector, a SharePoint export database file is created. This file contains various metadata related to the SharePoint data. It does not store file contents from the SharePoint data.
A Java web viewer,
sharepoint-exporter.jar, is included to browse the export database file. The web viewer is located in the following directory:
The web viewer is launched with the following arguments:
-exportDirectoryPath- The full path to the export database file.
-port- The port which the web viewer server will run on. If unassigned, a random port will be selected.
java -cp /opt/fusion/4.2/apps/connectors/connectors-classic/plugins/lucidworks.sharepoint-online-optimized/assets/sharepoint-exporter/sharepoint-exporter.jar com.lucidworks.fusion.connector.plugins.sharepoint.exporter.SharepointExportWeb -port 5000 -exportDirectoryPath /opt/fusion/4.2/data/connectors/connectors-classic/lucid.sharepoint-online-optimized/example_spo
The files are viewed by navigating the directory with any browser.
When entering configuration values in the UI, use unescaped characters, such as
The connector classic log directory,
$FUSION_HOME/fusion/var/log/connectors/connectors-classic/ by default, contains the log file
<ds-name>.log. Look here for diagnostic information to help with troubleshooting. This file is created the first time a crawl is started with the SharePoint Online V1 Optimized connector.