Product Selector

Fusion 5.9
    Fusion 5.9

    AWS S3 V2Connector Configuration Reference

    The AWS S3 V2 connector crawls items in a single bucket. You must specify the bucket name and AWS region in which that bucket is located.

    You may crawl specific items in a bucket. If no items are specified, the entire bucket will be crawled.

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    This connector includes an option to Enable Stray Content Deletion. When stray content deletion is enabled, content that was removed from the data source is deleted from the index in Fusion. When stray content deletion is disabled, content that was removed from the datasource is not deleted from the index in Fusion.

    Required permissions

    The connector requires ListBucket and GetObject permissions.

    The following is an IAM policy example. When you set permissions, replace bucketname with the value used in your implementation.

    "Statement": [
             {
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::bucketname/*"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::bucketname"
                ],
                "Effect": "Allow"
            }
    ]

    Retry logic

    The retryCount field sets the number of times the S3 client connection should retry when a document fails to index. Issues with AWS connectivity might result in the S3 connector being unable to crawl all of the data. The default for this field is retrying three times. If you are having trouble with AWS connectivity, try setting this field to a higher value, for example, 10 retries.

    Remote connectors

    V2 connectors support running remotely in Fusion versions 5.7.1 and later. Refer to Configure Remote V2 Connectors.

    Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:

    additionalVolumes:
    - name: fusion-data1-pvc
        persistentVolumeClaim:
        claimName: fusion-data1-pvc
    - name: fusion-data2-pvc
        persistentVolumeClaim:
        claimName: fusion-data2-pvc
    additionalVolumeMounts:
    - name: fusion-data1-pvc
        mountPath: "/connector/data1"
    - name: fusion-data2-pvc
        mountPath: "/connector/data2"

    You may also need to specify the user that is authorized to access the file system, as in this example:

    securityContext:
        fsGroup: 1002100000
        runAsUser: 1002100000

    Connector to index content in AWS S3 buckets.

    properties - S3 properties

    Plugin specific properties.

    application - S3 Application

    bucketName - string

    region - string

    The AWS region in which the bucket is located.

    Default: us-west-2

    Allowed values: ap-south-1eu-south-1us-gov-east-1ca-central-1eu-central-1us-west-1us-west-2af-south-1eu-north-1eu-west-3eu-west-2eu-west-1ap-northeast-2ap-northeast-1me-south-1sa-east-1ap-east-1cn-north-1us-gov-west-1ap-southeast-1ap-southeast-2us-iso-east-1us-east-1us-east-2cn-northwest-1us-isob-east-1aws-globalaws-cn-globalaws-us-gov-globalaws-iso-globalaws-iso-b-global

    objectKeys - array[string]

    Limit the crawl to a set of Files or Folders inside the bucket. Folders must end with '/'. Valid input examples: 'folderName/', 'folder/subFolder/', 'file.txt', 'folder/file.txt'

    authenticationConfig - S3 Authentication settings

    awsBasicAuthConfig - AWS Basic Authentication settings

    accessKey - string

    An AWS Access Key ID that can access the content.

    secretKey - string

    The AWS Secret Key associated with the Access Key.

    awsSessionAuthenticationConfig - AWS Session Authentication settings

    accessKey - string

    An AWS Access Key ID that can access the content.

    secretKey - string

    The AWS Secret Key associated with the Access Key.

    sessionToken - string

    awsInstanceCredentialsAuthConfig - AWS Instance Credentials Authentication settings

    instanceCredentials - boolean

    Use AWS instance credentials rather than an AWS key. Requires that Fusion 5 be hosted in an EKS. You can specify another AWS region through Region property in S3 Application settings.

    Default: false

    proxyConfig - S3 Proxy settings

    proxyEndpoint - string

    The optional proxy protocol, host and endpoint through which the client will connect

    proxyUsername - string

    The optional username to use when connecting through a proxy

    proxyPassword - string

    The optional password to use when connecting through a proxy

    enableStrayContentDeletion - boolean

    When enabled, folders and files that are not found will be deleted.

    Default: true

    maximumItemLimitConfig - Item Count Limits

    maxItems - number

    Limits the number of items emitted to the configured IndexPipeline. The default is no limit (-1).

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    sizeLimitProperties - Item Size Limits

    Options for including or excluding items based on size, in bytes.

    maxSizeBytes - number

    Used for excluding items when the item size is larger than the configured value.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: -1

    Multiple of: 1

    minSizeBytes - number

    Used for excluding items when the item size is smaller than the configured value.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1

    Multiple of: 1

    regexConfig - Regular expression rules

    inclusiveRegexes - array[string]

    Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.

    Default:

    exclusiveRegexes - array[string]

    Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.

    Default:

    regexCacheSize - number

    The number of regex matches to cache when evaluating regular expressions. For example if you exclude files by filename, each filename's regex result will be cached so that if this same filename came up again, the regex matches would be remembered.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 10000

    Multiple of: 1

    extensionConfig - File Extension rules

    includedFileExtensions - array[string]

    Set of file extensions to be fetched. If specified, all non-matching files will be skipped.

    Default:

    excludedFileExtensions - array[string]

    A set of all file extensions to be skipped from the fetch.

    Default:

    regexCacheSize - number

    The number of regex matches to cache when evaluating regular expressions. For example if you exclude files by filename, each filename's regex result will be cached so that if this same filename came up again, the regex matches would be remembered.

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 10000

    Multiple of: 1

    requestConfig - Request Settings

    Options to configure the client

    pageSize - number

    Maximum number of items per paginated request

    >= -2147483648

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 1000

    Multiple of: 1

    retryCount - number

    Number of times the S3 client connection should retry

    <= 2147483647

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 3

    Multiple of: 1

    documentsConfig - Indexing settings

    Options to control how documents will be indexed

    indexFolderMetadata - boolean

    Enable indexing of folder metadata. Each folder will be represented by a document in the collection.

    Default: false

    parserId - stringrequired

    The Parser to use in the associated IndexPipeline.

    id - stringrequired

    A unique identifier for this Configuration.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    pipelineId - stringrequired

    Name of the IndexPipeline used for processing output.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    description - string

    Optional description

    <= 125 characters

    diagnosticLogging - boolean

    Enable diagnostic logging; disabled by default

    Default: false

    coreProperties - Core Properties

    Common behavior and performance settings.

    fetchSettings - Fetch Settings

    System level settings for controlling fetch behavior and performance.

    numFetchThreads - number

    Maximum number of fetch threads; defaults to 5. This setting controls the number of threads that call the Connectors fetch method. Higher values can, but not always, help with overall fetch performance.

    >= 1

    <= 500

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 5

    Multiple of: 1

    indexingThreads - number

    Maximum number of indexing threads; defaults to 4. This setting controls the number of threads in the indexing service used for processing content documents emitted by this datasource. Higher values can sometimes help with overall fetch performance.

    >= 1

    <= 10

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 4

    Multiple of: 1

    pluginInstances - number

    Maximum number of plugin instances for distributed fetching. Only specified number of plugin instances will do fetching. This is useful for distributing load between different instances.

    <= 500

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 0

    Multiple of: 1

    fetchResponseScheduledTimeout - number

    The maximum amount of time for a response to be scheduled. The task will be canceled if this setting is exceeded.

    >= 1000

    <= 500000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 300000

    Multiple of: 1

    indexingInactivityTimeout - number

    The maximum amount of time to wait for indexing results (in seconds). If exceeded, the job will fail with an indexing inactivity timeout.

    >= 60

    <= 691200

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 86400

    Multiple of: 1

    pluginInactivityTimeout - number

    The maximum amount of time to wait for plugin activity (in seconds). If exceeded, the job will fail with a plugin inactivity timeout.

    >= 60

    <= 691200

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 600

    Multiple of: 1

    indexMetadata - boolean

    When enabled the metadata of skipped items will be indexed to the content collection.

    Default: false

    indexContentFields - boolean

    When enabled, content fields will be indexed to the crawl-db collection.

    Default: false

    asyncParsing - boolean

    When enabled, content will be indexed asynchronously.

    Default: false