Product Selector

Fusion 5.9
    Fusion 5.9

    SharePoint V1 OptimizedConnector Configuration Reference

    The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.

    Deprecation and removal notice

    This connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0. Use the SharePoint Optimized V2 connector instead.

    For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.

    Test NTLM permissions to successfully crawl a site collection

    This is only applicable to Sharepoint on-premise deployments.

    To verify the NTLM account has appropriate permissions to crawl a site collection using the SharePoint V2 connector:

    1. Copy the check-ntlm-account-can-crawl-sharepoint-site-collection.ps1 PowerShell script below to a folder on your computer.

    $site_col_url="https://your.sharepoint-site.com/sites/mysitecol"
    
    $cred = (Get-Credential)
    
    if (-not ([System.Management.Automation.PSTypeName]'ServerCertificateValidationCallback').Type)
    {
    $certCallback = @"
        using System;
        using System.Net;
        using System.Net.Security;
        using System.Security.Cryptography.X509Certificates;
        public class ServerCertificateValidationCallback
        {
            public static void Ignore()
            {
                if(ServicePointManager.ServerCertificateValidationCallback ==null)
                {
                    ServicePointManager.ServerCertificateValidationCallback +=
                        delegate
                        (
                            Object obj,
                            X509Certificate certificate,
                            X509Chain chain,
                            SslPolicyErrors errors
                        )
                        {
                            return true;
                        };
                }
            }
        }
    "@
        Add-Type $certCallback
     }
    
    [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12;
    [ServerCertificateValidationCallback]::Ignore()
    
    $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
    $headers.Add("Content-Type", "text/xml")
    $headers.Add("SOAPAction", "http://schemas.microsoft.com/sharepoint/soap/GetUpdatedFormDigestInformation")
    $headers.Add("X-RequestForceAuthentication", "true")
    $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")
    
    $body = "<?xml version=`"1.0`" encoding=`"utf-8`"?>`n<soap:Envelope xmlns:xsi=`"http://www.w3.org/2001/XMLSchema-instance`" xmlns:xsd=`"http://www.w3.org/2001/XMLSchema`" xmlns:soap=`"http://schemas.xmlsoap.org/soap/envelope/`">`n  <soap:Body>`n    <GetUpdatedFormDigestInformation xmlns=`"http://schemas.microsoft.com/sharepoint/soap/`" />`n  </soap:Body>`n</soap:Envelope>"
    
    $response = Invoke-RestMethod "${site_col_url}/_vti_bin/sites.asmx" -Method 'POST' -Headers $headers -Body $body -Credential $cred
    
    $digest_value = $response.Envelope.Body.GetUpdatedFormDigestInformationResponse.FirstChild.DigestValue
    
    $headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
    $headers.Add("Content-Type", "text/xml")
    $headers.Add("X-RequestForceAuthentication", "true")
    $headers.Add("X-RequestDigest", $digest_value)
    $headers.Add("Accept", "application/json")
    $headers.Add("X-FORMS_BASED_AUTH_ACCEPTED", "f")
    
    $body = @'
    <Request AddExpandoFieldTypeSuffix="true" SchemaVersion="14.0.0.0" LibraryVersion="16.0.0.0"
             ApplicationName=".NET Library" xmlns="http://schemas.microsoft.com/sharepoint/clientquery/2009">
        <Actions>
            <ObjectPath Id="2" ObjectPathId="1"/>
            <ObjectPath Id="4" ObjectPathId="3"/>
            <Query Id="5" ObjectPathId="3">
                <Query SelectAllProperties="false">
                    <Properties>
                        <Property Name="Webs" SelectAll="true">
                            <Query SelectAllProperties="false">
                                <Properties/>
                            </Query>
                        </Property>
                        <Property Name="Title" ScalarProperty="true"/>
                        <Property Name="ServerRelativeUrl" ScalarProperty="true"/>
                        <Property Name="RoleDefinitions" SelectAll="true">
                            <Query SelectAllProperties="false">
                                <Properties/>
                            </Query>
                        </Property>
                        <Property Name="RoleAssignments" SelectAll="true">
                            <Query SelectAllProperties="false">
                                <Properties/>
                            </Query>
                        </Property>
                        <Property Name="HasUniqueRoleAssignments" ScalarProperty="true"/>
                        <Property Name="Description" ScalarProperty="true"/>
                        <Property Name="Id" ScalarProperty="true"/>
                        <Property Name="LastItemModifiedDate" ScalarProperty="true"/>
                    </Properties>
                </Query>
            </Query>
        </Actions>
        <ObjectPaths>
            <StaticProperty Id="1" TypeId="{3747adcd-a3c3-41b9-bfab-4a64dd2f1e0a}" Name="Current"/>
            <Property Id="3" ParentId="1" Name="Web"/>
        </ObjectPaths>
    </Request>
    '@
    
    $response = Invoke-RestMethod "${site_col_url}/_vti_bin/client.svc/ProcessQuery" -Method 'POST' -Headers $headers -Body $body -Credential $cred
    $response | ConvertTo-Json -Depth 100
    1. Change the value in the first line: $site_col_url="https://your.sharepoint-site.com/sites/mysitecol" to the URL of your site collection.

    2. Execute the script. If the result is:

      • A JSON output of your site’s metadata, the account permissions are set correctly.

      • An error such as a 403, 401, or other error, the account permissions are not set correctly. Set permissions correctly and run the script again to verify it executes successfully.

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    An Optimized Connector for SharePoint 2010, 2013, 2016, 2019 and SharePoint Online

    description - string

    Optional description

    <= 125 characters

    pipeline - stringrequired

    Name of the IndexPipeline used for processing output.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    diagnosticLogging - boolean

    Enable diagnostic logging; disabled by default

    Default: false

    parserId - stringrequired

    The Parser to use in the associated IndexPipeline.

    coreProperties - Core Properties

    Common behavior and performance settings.

    fetchSettings - Fetch Settings

    System level settings for controlling fetch behavior and performance.

    pluginInstances - number

    Maximum number of plugin instances for distributed fetching. Only specified number of plugin instanceswill do fetching. This is useful for distributing load between different instances.

    >= 1

    <= 1

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1

    Multiple of: 1

    asyncParsing - boolean

    When enabled, content will be indexed asynchronously.

    Default: false

    fetchResponseScheduledTimeout - number

    The maximum amount of time for a response to be scheduled. The task will be canceled if this setting is exceeded.

    >= 1000

    <= 500000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 300000

    Multiple of: 1

    indexingInactivityTimeout - number

    The maximum amount of time to wait for indexing results (in seconds). If exceeded, the job will fail with an indexing inactivity timeout.

    >= 60

    <= 691200

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 86400

    Multiple of: 1

    pluginInactivityTimeout - number

    The maximum amount of time to wait for plugin activity (in seconds). If exceeded, the job will fail with a plugin inactivity timeout.

    >= 60

    <= 691200

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 600

    Multiple of: 1

    indexMetadata - boolean

    When enabled the metadata of skipped items will be indexed to the content collection.

    Default: false

    indexContentFields - boolean

    When enabled, content fields will be indexed to the crawl-db collection.

    Default: false

    numFetchThreads - number

    Maximum number of fetch threads; defaults to 20.This setting controls the number of threads that call the Connectors fetch method.Higher values can, but not always, help with overall fetch performance.

    >= 1

    <= 500

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 20

    Multiple of: 1

    indexingThreads - number

    Maximum number of indexing threads; defaults to 4.This setting controls the number of threads in the indexing service used for processing content documents emitted by this datasource.Higher values can sometimes help with overall fetch performance.

    >= 1

    <= 10

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 4

    Multiple of: 1

    id - stringrequired

    A unique identifier for this Configuration.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    properties - SharePoint properties

    Plugin specific properties.

    webApplication - Web application config

    The SharePoint Web application to crawl.

    webApplicationUrl - string

    >= 1 characters

    fetchSiteCollections - boolean

    This feature requires site collection administrator rights on your Sharepoint instance. If enabled, the sharepoint crawler will fetch all site collections from the web application automatically. If not enabled, you must explicitly list all site collections in the siteCollections parameter.

    Default: true

    forceFullCrawl - boolean

    Do this if you want to force a full crawl each time you run this datasource.

    Default: false

    siteCollections - array[string]

    A list of site collections to crawl. Because only site collection administrators or site collection auditors can list the site collections in a SharePoint web application, you can use this when you are crawling as a user that is not an admin/auditor. This allows you to explicitly list site collections you want to crawl. Specify paths relative to the web application url, such as /sites/site1

    Default:

    includedFileExtensions - array[string]

    Set of file extensions to be fetched. If specified, all non-matching files will be skipped.

    Default:

    excludedFileExtensions - array[string]

    A set of all file extensions to be skipped from the fetch.

    Default:

    inclusiveRegexes - array[string]

    Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.

    Default:

    exclusiveRegexes - array[string]

    Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.

    Default:

    includeContentsExtensions - array[string]

    Only files with these file extensions will not have their contents downloaded when indexing this item. The list item metadata will still be indexed but the file contents will not. The comparison is not case sensitive, and you do not have to specify the '.' but it still work if you do. For example "zip" and ".zip" are both acceptable. The whitespace will also be trimmed.

    Default:

    excludeContentsExtensions - array[string]

    File extensions of files that will not have their contents downloaded when indexing this item. The list item metadata will still be indexed but the file contents will not. The comparison is not case sensitive, and you do not have to specify the '.' but it still work if you do. For example "zip" and ".zip" are both acceptable. The whitespace will also be trimmed.

    Default:

    restrictToSpecificItems - array[string]

    Instead of specifying regular expressions to restrict the SharePoint items that are crawled, this allows you to specify specific SharePoint item URLs of the resources that are to be crawled. The crawl will then be restricted to only include these specified SharePoint items URLs. You can specify list, sub-site, folder, and list item URLs.

    Default:

    apiQueryRowLimit - number

    >= 1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 5000

    Multiple of: 1

    changeApiQueryRowLimit - number

    >= 1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 2000

    Multiple of: 1

    aclCommitAfter - number

    When doing solr update to the acl collection, specify the commitWithin parameter to use when updating.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 60000

    Multiple of: 1

    siteCollectionDeletionThreshold - number

    Site collections will be removed from the index after they are no longer available for this many hours. Set to 0 for immediate deletion. Default is 2 weeks.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 336

    Multiple of: 1

    solrSocketTimeout - number

    Socket timeout when performing solr operations.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 60000

    Multiple of: 1

    moderationStatusFilter - array[number]

    If specified, only index items with the following moderation statuses specified. Valid values are: 0 = The list item is approved, 1 = The list item has been denied approval, 2 = The list item is pending approval, 3 = The list item is in the draft or checked out state, 4 = The list item is scheduled for automatic approval at a future date.

    fetchTaxonomies - boolean

    Fetch Taxonomy data from sharepoint.

    Default: false

    siteCollectionTaxonomyCacheSize - number

    To make the connector faster, when the taxonomy terms for a site collection are needed, they are cached to avoid looking up from disk again. This is the size of that cache.

    >= 1

    <= 10000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 10

    Multiple of: 1

    fetchACLs - boolean

    Fetch Access Control Data

    Default: true

    asyncParsing - boolean

    Enable only if Tika Async is configured in the Fusion environment. Note: To enable async-parsing, check Core Properties -> Fetch Settings -> Async Parsing (since Fusion 5.8.0)

    Default: false

    zkHosts - string

    Solr zk hosts string used for direct connections to solr.

    contentCommitAfter - number

    When doing solr update to the content collection, specify the commitWithin parameter to use when updating.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 60000

    Multiple of: 1

    zkChroot - string

    Solr zk chroot string used for direct connections to solr.

    solrConnectionTimeout - number

    Connection timeout when performing solr operations.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 60000

    Multiple of: 1

    includedListBaseTypes - array[string]

    If specified, the only SharePoint lists that will be fetched are the ones that match one of these base types. Accepts values (not case sensitive): [None, GenericList, DocumentLibrary, Unused, DiscussionBoard, Survey, Issue]

    includedObjectTypes - array[string]

    If specified, only fetch specific SharePoint objects. SharePoint object types that can be specified (not case sensitive): [Site, List, List_Item, Folder, Attachment]

    proxyProperties - Proxy options

    A set of options for configuring the proxy.

    username - string

    Proxy username

    >= 1 characters

    password - string

    Proxy password

    >= 1 characters

    url - string

    The proxy URL

    >= 1 characters

    ntlmProperties - NTLM Authentication settings

    user - string

    User

    >= 1 characters

    password - string

    Password

    >= 1 characters

    domain - string

    Domain

    >= 1 characters

    workstation - string

    Workstation

    >= 1 characters

    sharepointOnlineAuthProperties - SharePoint Online Authentication

    Settings relevant only when crawling SharePoint online .

    account - string

    Your Microsoft SharePoint Online Account name which takes the form of username@domain.com

    >= 1 characters

    password - string

    Password for your Microsoft SharePoint Online Account.

    >= 1 characters

    sessionExpirationMs - number

    How long in milliseconds before new SharePoint online authentication cookies should be fetched.

    >= 1

    <= 172800000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 7200000

    Multiple of: 1

    userAgent - string

    The user agent header decorates the http traffic. This is important for preventing hard rate limiting by SharePoint Online.

    Default: ISV|Lucidworks|Fusion/5.x

    capUserAgent - string

    When "O365 Conditional Access Policy (CAP) setting" is enabled, we need to use a compliant User-Agent that matches one of the supported devices when doing O365 STS authentication. For example if iOS is a supported platform, set this to 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) CriOS/60.0.3112.89 Mobile/14G60 Safari/602.1'

    <= 4000 characters

    >= 1 characters

    appAuthClientId - string

    Applicable to SharePoint Online App-Auth Public/Private Service Account. The Azure client ID of your application.

    <= 100 characters

    >= 1 characters

    appAuthPkcs12KeystoreBase64String - string

    Applicable to SharePoint Online App-Auth only. This is the base64 string of your PKCS12 keystore loaded with the PFX certificate file supplied by Azure AD. To get this value, first take the Azure AD yourcert.pfx you recieved from Azure and convert to PKCS12 keystore format (example "keytool -importkeystore -srckeystore yourcert.pfx -srcstoretype pkcs12 -destkeystore yourcert.p12 -deststoretype pkcs12"). Next convert yourcert.p12 to base64 string.

    <= 10000 characters

    >= 1 characters

    appAuthPkcs12KeystorePassword - string

    Applicable to SharePoint Online App-Auth Public/Private Service Account. Password of the PKCS12 keystore.

    <= 100 characters

    >= 1 characters

    appAuthClientSecret - string

    Applicable to SharePoint Online OAuth App-Auth only. The Azure client ID of your application.

    <= 100 characters

    >= 1 characters

    appAuthRefreshToken - string

    Applicable to SharePoint Online OAuth App-Auth only. This is a refresh token which is reusable for up to 12 hours. You must obtain a new tokenusing the OAuth login process if the token becomes expired.

    <= 1000 characters

    >= 1 characters

    appAuthTenant - string

    Applicable to SharePoint Online App-Auth only. The Office365 tenant name to use when authenticating with Azure AD.

    <= 2083 characters

    >= 1 characters

    appAuthAzureLoginEndpoint - string

    Applicable to SharePoint Online App-Auth Public/Private Service Account. The Azure login endpoint to use when authenticating.

    <= 2083 characters

    >= 1 characters

    Default: https://login.windows.net

    jsAuthConfigJson - string

    JS Auth config json file contains a list of WebCredential to do a web driver login process.

    jsAuthLoginUrl - string

    JS Auth Login Url to use when doing the login process.

    jsAuthSeleniumUrl - string

    URL of the Selenium grid service to use while obtaining performing WebDriver auth to sharepoint online.

    maximumItemLimitConfig - Item Count Limit

    maxItems - number

    Limits the number of items emitted to the configured IndexPipeline. The default is no limit (-1).

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    sizeLimitProperties - Item Size Limits

    For documents which do not meet the maximum/minimum size limits, only metadata will be indexed without body.The documents will indicate reason why content is not indexed, with the field '_lw_contents_excluded_s: file size'

    maxSizeBytes - number

    Used for excluding items when the item size is larger than the configured value.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    minSizeBytes - number

    Used for excluding items when the item size is smaller than the configured value.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1

    Multiple of: 1

    fetchRetryProperties - Retry Options

    A set of options for configuring retry behavior.

    delayMs - number

    Sets the delay between retries, exponentially backing off to the maxDelayTimeMs and multiplying successive delays by the delayFactor

    >= 1

    <= 9223372036854776000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1000

    Multiple of: 1

    maxDelayTimeMs - number

    The maximum time wait time between successive retries.

    >= 1

    <= 600000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 300000

    Multiple of: 1

    maxTimeLimitMs - number

    This setting is used to limit the maximum amount of time spent on retries. Note: this will be ignored if "Maximum Retries" is specified.

    >= 1

    <= 28800000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 600000

    Multiple of: 1

    errorExclusions - array[string]

    Optional regex list that will be matched against failed attempts exception class and message. If any regex matches, do not retry this request. This is needed to prevent the retryer from retrying non-recoverable errors that were not already ignored by the connector implementation.

    delayFactor - number

    The retryer will retry failed operations in the case that they might succeed if attempted again. The retryer will sleep an exponential amount of time after the first failed attempt and retry in exponentially incrementing amounts after each failed attempt up to the maximumTime. nextWaitTime = exponentialIncrement * multiplier.

    >= 1

    <= 9999

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 2

    Multiple of: 1

    maxRetries - number

    The retryer will retry failed operations in the case that they might succeed if attempted again. This parameter states the number of attempts to retry until giving up. This parameter, if specified, will override the "Stop retrying after time (milliseconds)" parameter.

    <= 100

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 3

    Multiple of: 1

    connections - Http client options

    A set of options for configuring the http client.

    maxConnections - number

    The maximum number of connections

    >= 1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 5000

    Multiple of: 1

    maxPerRoute - number

    Defines a connection limit per one HTTP route. In simple cases you can understand this as a per target host limit. Under the hood things are a bit more interesting: HttpClient maintains a couple of HttpRoute objects, which represent a chain of hosts each, like proxy1 -> proxy2 -> targetHost. Connections are pooled on per-route basis. In simple cases, when you're using default route-building mechanism and provide no proxy suport, your routes are likely to include target host only, so per-route connection pool limit effectively becomes per-host limit.

    >= 1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1000

    Multiple of: 1

    ignoreSSLValidationExceptions - boolean

    Do not attempt to do an SSL Handshake and do not verify the hostname of SSL certificates. Use this when accessing an https url with a self-signed or enterprise certificate authority that you do not want to put in the Java keystore.

    Default: false

    readTimeoutMs - number

    >= -1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 60000

    Multiple of: 1

    connectTimeoutMs - number

    >= -1

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 300000

    Multiple of: 1

    debug - Debug options

    Special properties used for debugging the connector.

    onlyFetchAcls - boolean

    Do a full crawl where we only crawl acls. Also - when the ACLs are all fully indexed, clear any old ACL documents from previous crawl(s) for this datasource. This gives you a fresh SharePoint ACLs without effecting the content.

    Default: false

    logThreadDumpEveryNSeconds - number

    For diagnostic purposes, write a thread dump to logs every N seconds. If set <= 0, no dump is taken.

    >= -1

    <= 9999999

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    simulate429ErrorsEveryNRequests - number

    If > 0, simulate a SharePoint 429 status (too-many-requests) error such that there will be one error per this many requests.

    >= -1

    <= 999999

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    preserveFullExportDb - boolean

    The list* tables are normally cleared prior to saving the crawl database. This gives option to leave these files for analysis. This parameter is ignored if using a persistent volume to store the crawl DB because the data will always be saved in that case.

    Default: false

    onlyFetchMetadata - boolean

    For diagnostic purposes, do a dry run where the connector will only generate the metadata sharepoint export database and index the ACL records in the ACL collection, but will not fetch content.

    Default: false

    logAclInserts - boolean

    For diagnostic purposes, log all documents inserted into the ACL collection.

    Default: false

    security - 

    collectionId - string

    Id of the collection to be used for storing ACL records. If not specified, ACL collection name will be generated automatically using pattern '<datasource_id>_access_control_hierarchy'.