The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.
1. Decide what you need to crawl
The first and most important thing to do is determine what you are trying to crawl, and to pick your “Start Links” accordingly.
Choose one of the following:
-
An entire SharePoint Web application (all site collections in a specific SharePoint URL).
How to crawl an entire SharePoint Web application
-
Leave the Limit Documents > Fetch all site collections option checked (as it is by default).
-
Specify the Web application URL as a site.
For example:
https://lucidworks.sharepoint.local/
Note
|
Crawling an entire SharePoint Web application requires administrative access to SharePoint. |
How to crawl a subset of SharePoint site collections
-
Uncheck the Limit Documents > Fetch all site collections option.
-
Specify a "Start Link" for each site collection that you want to crawl.
Examples:
https://lucidworks.sharepoint.local/sites/site1
,https://lucidworks.sharepoint.local/sites/site2
,https://lucidworks.sharepoint.local/sites/site3
How to crawl a specific sub-site, list, or list item:
-
Uncheck the Limit Documents > Fetch all site collections option.
-
Specify a "Start Link" for each site collection that contains the item you want to fetch.
-
Specify a non-wildcard Inclusive Regular Expression for each parent.
For example, if you want to crawl
https://lucidworks.sharepoint.local/sites/mysitecol/myparentsite/somesite
then you must include inclusive regexes for all parents along the way:\https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/myparentsite \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite \https\:\/\/lucidworks\.sharepoint\.local\/sites\/mysitecol\/somesite\/.*
ImportantIf you exclude a parent item of the site, the connector will not crawl the site because it will never spider down to it during the crawl process.
2. Set up permissions for the crawl
You have two options here:
-
Set up a crawl account with only as much permission as it needs.
This approach has the security advantage of providing minimal access to Fusion. However, the crawl account cannot retrieve the list of site collections behind a Web application URL.
How to set up a crawl account
1. Create a Lucidworks Fusion crawl permission
-
Navigate to Central Administration > Manage web application > Permission Policy.
-
Click Add permission policy level. In this example, the permission level is named "fusion_crawl_policy".
-
If you need to list all site collections in a SharePoint web application, select the option Site Collection Auditor:
-
Grant the following permissions:
-
View Items - View items in lists and documents in document libraries.
-
Open Items - View the source of documents with server-side file handlers.
-
View Versions - View past versions of a list item or document.
-
View Application Pages - View forms, views, and application pages. Enumerate lists.
Site Permissions-
Browse Directories - Enumerate files and folders in a Web site using SharePoint Designer and Web DAV interfaces.
-
View Pages - View pages in a Web site.
-
Enumerate Permissions - Enumerate permissions on the Web site, list, folder, document, or list item.
-
Browse User Information - View information about users of the Web site.
-
Use Remote Interfaces - Use SOAP, Web DAV, the Client Object Model or SharePoint Designer interfaces to access the Web site.
-
Open - Allows users to open a Web site, list, or folder in order to access items inside that container.
-
2. Grant user permission to the user policy
-
Navigate to Central Administration > Manage web application > User Policy > Add Users.
-
Create a new user with the new policy permission level, "fusion_crawl_policy", selected:
How to provide admin access to crawl
See the SharePoint documentation for instructions.