Google Cloud Storage (GCS) V2Connector Configuration Reference

Table of Contents

Features
Remote connectors
Configuration

The Fusion Google Cloud Storage (GCS) V2 connector enables indexing datasets from GCS buckets into Fusion 5. The connector leverages the Google Cloud API for authentication and fetching content and metadata.

GCS connector can index:

CSV
JSON
PDF
Word docs
Other rich text formats

Features

Service account authentication
Full crawl of storage buckets and objects
Recrawl buckets and objects
Remove deleted objects
Update objects
Cascade deletion of objects in deleted buckets
Document parsing support
Bucket and object filtering

Authentication:

The GCS connector supports JSON key authentication. The full JSON key content must be copied and pasted into the service account JSON key box.

Full crawl

GCS crawls all available buckets in a project. In order to crawl all available buckets, the account used in the configuration needs the correct permissions enabled.

Crawl specific buckets

If the account used has limited permissions, or if a user wants to only crawl specific buckets, use the Specify buckets to crawl setting. First add the name of the buckets you would like to crawl, then download bucket objects and metadata.

Recrawl

The GCS connector picks up content changes (added/updated/deleted) to keep the Solr index up to date.

Remote connectors

V2 connectors support running remotely in Fusion versions 5.7.1 and later. Refer to Configure Remote V2 Connectors.

Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:

additionalVolumes:
- name: fusion-data1-pvc
    persistentVolumeClaim:
    claimName: fusion-data1-pvc
- name: fusion-data2-pvc
    persistentVolumeClaim:
    claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
    mountPath: "/connector/data1"
- name: fusion-data2-pvc
    mountPath: "/connector/data2"

You may also need to specify the user that is authorized to access the file system, as in this example:

securityContext:
    fsGroup: 1002100000
    runAsUser: 1002100000

Configuration

Name Title Description

Name	Title	Description
`authenticationProperties`	Authentication settings	Connect to the bucket store using a service account. The service account requires the following permissions: storage.buckets.list to crawl all the available buckets; storage.objects.list and storage.objects.get to access to the objects in the buckets.
`applicationProperties`	Limit documents	Bucket and Object filtering options.
`jsonKey`	Service account Json key	Json key contents from authorized service account.
`buckets`	Bucket list	Add the bucket names to crawl. Leave blank to crawl all the available buckets.
`includedFileExtensions`	Included file extensions	Set of file extensions to be fetched. If specified all non-matching files will be skipped.
`excludedFileExtensions`	Excluded file extensions	A set of all file extensions to be skipped from the fetch.
`inclusiveRegexes`	Inclusive regexes	Regular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression.
`exclusiveRegexes`	Exclusive regexes	Regular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression.
`maxSizeBytes`	Maximum File Size	Used for excluding objects when the objects size is larger than the configured value.
`minSizeBytes`	Minimum File Size	Used for excluding objects when the objects size is smaller than the configured value.
`bucketPrefix`	Bucket prefix	Filter results to buckets whose names begin with this prefix. Useful only when 'Bucket List' property is empty.
`blobsPrefix`	Object prefix	Filter results to objects whose names begin with this prefix.
`pageSize`	Buckets and Objects page size	Maximum number of buckets or objects returned per page.

authenticationProperties

Authentication settings

Connect to the bucket store using a service account. The service account requires the following permissions: storage.buckets.list to crawl all the available buckets; storage.objects.list and storage.objects.get to access to the objects in the buckets.

applicationProperties

Limit documents

Bucket and Object filtering options.

jsonKey

Service account Json key

Json key contents from authorized service account.

buckets

Bucket list

Add the bucket names to crawl. Leave blank to crawl all the available buckets.

includedFileExtensions

Included file extensions

Set of file extensions to be fetched. If specified all non-matching files will be skipped.

excludedFileExtensions

Excluded file extensions

A set of all file extensions to be skipped from the fetch.

inclusiveRegexes

Inclusive regexes

Regular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression.

exclusiveRegexes

Exclusive regexes

Regular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression.

maxSizeBytes

Maximum File Size

Used for excluding objects when the objects size is larger than the configured value.

minSizeBytes

Minimum File Size

Used for excluding objects when the objects size is smaller than the configured value.

bucketPrefix

Bucket prefix

Filter results to buckets whose names begin with this prefix. Useful only when 'Bucket List' property is empty.

blobsPrefix

Object prefix

Filter results to objects whose names begin with this prefix.

pageSize

Buckets and Objects page size

Maximum number of buckets or objects returned per page.

Connector for Google Cloud Store

description - string

Optional description

<= 125 characters

pipeline - stringrequired

Name of the IndexPipeline used for processing output.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

diagnosticLogging - boolean

Enable diagnostic logging; disabled by default

Default: false

parserId - stringrequired

The Parser to use in the associated IndexPipeline.

coreProperties - Core Properties

Common behavior and performance settings.

fetchSettings - Fetch Settings

System level settings for controlling fetch behavior and performance.

indexingInactivityTimeout - number

The maximum amount of time to wait for indexing results (in seconds). If exceeded, the job will fail with an indexing inactivity timeout.

>= 60

<= 691200

exclusiveMinimum: false

exclusiveMaximum: false

Default: 86400

Multiple of: 1

pluginInactivityTimeout - number

The maximum amount of time to wait for plugin activity (in seconds). If exceeded, the job will fail with a plugin inactivity timeout.

>= 60

<= 691200

exclusiveMinimum: false

exclusiveMaximum: false

Default: 600

Multiple of: 1

indexContentFields - boolean

When enabled, content fields will be indexed to the crawl-db collection.

Default: false

asyncParsing - boolean

When enabled, content will be indexed asynchronously.

Default: false

indexingThreads - number

Maximum number of indexing threads; defaults to 4.This setting controls the number of threads in the indexing service used for processing content documents emitted by this datasource.Higher values can sometimes help with overall fetch performance.

>= 1

<= 10

exclusiveMinimum: false

exclusiveMaximum: false

Default: 4

Multiple of: 1

pluginInstances - number

Maximum number of plugin instances for distributed fetching. Only specified number of plugin instanceswill do fetching. This is useful for distributing load between different instances.

<= 500

exclusiveMinimum: false

exclusiveMaximum: false

Default: 0

Multiple of: 1

fetchResponseScheduledTimeout - number

The maximum amount of time for a response to be scheduled. The task will be canceled if this setting is exceeded.

>= 1000

<= 500000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 300000

Multiple of: 1

indexMetadata - boolean

When enabled the metadata of skipped items will be indexed to the content collection.

Default: false

numFetchThreads - number

Maximum number of fetch threads; defaults to 20.This setting controls the number of threads that call the Connectors fetch method.Higher values can, but not always, help with overall fetch performance.

>= 1

<= 500

exclusiveMinimum: false

exclusiveMaximum: false

Default: 20

Multiple of: 1

id - stringrequired

A unique identifier for this Configuration.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

properties - GCS properties

Plugin specific properties.

authenticationProperties - Authentication settings

jsonKey - string

Json key contents from authorized service account

applicationProperties - Limit documents

Bucket and Object filtering options

buckets - array[string]

Add the bucket names to crawl. Leave blank to crawl all the available buckets

Default:

includedFileExtensions - array[string]

Set of file extensions to be fetched. If specified, all non-matching files will be skipped.

Default:

excludedFileExtensions - array[string]

A set of all file extensions to be skipped from the fetch.

Default:

inclusiveRegexes - array[string]

Regular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression.

Default:

exclusiveRegexes - array[string]

Regular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression.

Default:

maxSizeBytes - number

Used for excluding objects when the objects size is larger than the configured value.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: -1

Multiple of: 1

minSizeBytes - number

Used for excluding objects when the objects size is smaller than the configured value.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1

Multiple of: 1

bucketPrefix - string

Filter results to buckets whose names begin with this prefix. Useful only when 'Bucket List' property is empty

blobsPrefix - string

Filter results to objects whose names begin with this prefix

pageSize - number

Maximum number of buckets or objects returned per page

>= 1

<= 1000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1000

Multiple of: 1