Google Cloud Storage (GCS) V2Connector Configuration Reference
The Fusion Google Cloud Storage (GCS) V2 connector enables indexing datasets from GCS buckets into Fusion 5. The connector leverages the Google Cloud API for authentication and fetching content and metadata.
-
CSV
-
JSON
-
PDF
-
Word docs
-
Other rich text formats
-
Service account authentication
-
Full crawl of storage buckets and objects
-
Recrawl buckets and objects
-
Remove deleted objects
-
Update objects
-
Cascade deletion of objects in deleted buckets
-
Document parsing support
-
Bucket and object filtering
-
Jenkins build support
The GCS connector supports JSON key authentication. The full JSON key content must be copied and pasted into the service account JSON key box.
GCS crawls all available buckets in a project. In order to crawl all available buckets, the account used in the configuration needs the correct permissions enabled.
If the account used has limited permissions, or if a user wants to only crawl specific buckets, use the Specify buckets to crawl
setting. First add the name of the buckets you would like to crawl, then download bucket objects and metadata.
The GCS connector picks up content changes (added/updated/deleted) to keep the Solr index up to date.
Below is an example configuration showing how to specify the file system to index under the connector-plugins
entry in your values.yaml
file:
additionalVolumes:
- name: fusion-data1-pvc
persistentVolumeClaim:
claimName: fusion-data1-pvc
- name: fusion-data2-pvc
persistentVolumeClaim:
claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
mountPath: "/connector/data1"
- name: fusion-data2-pvc
mountPath: "/connector/data2"
You may also need to specify the user that is authorized to access the file system, as in this example:
securityContext:
fsGroup: 1002100000
runAsUser: 1002100000
Name |
Title |
Description |
|
Authentication settings |
Connect to the bucket store using a service account. The service account requires the following permissions: storage.buckets.list to crawl all the available buckets; storage.objects.list and storage.objects.get to access to the objects in the buckets. |
|
Limit documents |
Bucket and Object filtering options. |
|
Service account Json key |
Json key contents from authorized service account. |
|
Bucket list |
Add the bucket names to crawl. Leave blank to crawl all the available buckets. |
|
Included file extensions |
Set of file extensions to be fetched. If specified all non-matching files will be skipped. |
|
Excluded file extensions |
A set of all file extensions to be skipped from the fetch. |
|
Inclusive regexes |
Regular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression. |
|
Exclusive regexes |
Regular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression. |
|
Maximum File Size |
Used for excluding objects when the objects size is larger than the configured value. |
|
Minimum File Size |
Used for excluding objects when the objects size is smaller than the configured value. |
|
Bucket prefix |
Filter results to buckets whose names begin with this prefix. Useful only when 'Bucket List' property is empty. |
|
Object prefix |
Filter results to objects whose names begin with this prefix. |
|
Buckets and Objects page size |
Maximum number of buckets or objects returned per page. |
Connector for Google Cloud Store
description - string
Optional description
<= 125 characters
pipeline - stringrequired
Name of the IndexPipeline used for processing output.
>= 1 characters
Match pattern: ^[a-zA-Z0-9_-]+$
diagnosticLogging - boolean
Enable diagnostic logging; disabled by default
Default: false
parserId - stringrequired
The Parser to use in the associated IndexPipeline.
coreProperties - Core Properties
Common behavior and performance settings.
fetchSettings - Fetch Settings
System level settings for controlling fetch behavior and performance.
numFetchThreads - number
Maximum number of fetch threads; defaults to 5.This setting controls the number of threads that call the Connectors fetch method.Higher values can, but not always, help with overall fetch performance.
>= 1
<= 500
exclusiveMinimum: false
exclusiveMaximum: false
Default: 5
Multiple of: 1
indexingThreads - number
Maximum number of indexing threads; defaults to 4.This setting controls the number of threads in the indexing service used for processing content documents emitted by this datasource.Higher values can sometimes help with overall fetch performance.
>= 1
<= 10
exclusiveMinimum: false
exclusiveMaximum: false
Default: 4
Multiple of: 1
pluginInstances - number
Maximum number of plugin instances for distributed fetching. Only specified number of plugin instanceswill do fetching. This is useful for distributing load between different instances.
<= 500
exclusiveMinimum: false
exclusiveMaximum: false
Default: 0
Multiple of: 1
fetchResponseScheduledTimeout - number
The maximum amount of time for a response to be scheduled. The task will be canceled if this setting is exceeded.
>= 1000
<= 500000
exclusiveMinimum: false
exclusiveMaximum: false
Default: 300000
Multiple of: 1
indexingInactivityTimeout - number
The maximum amount of time to wait for indexing results (in seconds). If exceeded, the job will fail with an indexing inactivity timeout.
>= 60
<= 691200
exclusiveMinimum: false
exclusiveMaximum: false
Default: 86400
Multiple of: 1
pluginInactivityTimeout - number
The maximum amount of time to wait for plugin activity (in seconds). If exceeded, the job will fail with a plugin inactivity timeout.
>= 60
<= 691200
exclusiveMinimum: false
exclusiveMaximum: false
Default: 600
Multiple of: 1
indexMetadata - boolean
When enabled the metadata of skipped items will be indexed to the content collection.
Default: false
indexContentFields - boolean
When enabled, content fields will be indexed to the crawl-db collection.
Default: false
asyncParsing - boolean
When enabled, content will be indexed asynchronously.
Default: false
id - stringrequired
A unique identifier for this Configuration.
>= 1 characters
Match pattern: ^[a-zA-Z0-9_-]+$
properties - GCS properties
Plugin specific properties.
authenticationProperties - Authentication settings
Connect to the bucket store using a service account. The service account requires the following permissions:
storage.buckets.list to crawl all the available buckets;
storage.objects.list and storage.objects.get to access to the objects in the buckets
jsonKey - string
Json key contents from authorized service account
applicationProperties - Limit documents
Bucket and Object filtering options
buckets - array[string]
Add the bucket names to crawl. Leave blank to crawl all the available buckets
Default:
includedFileExtensions - array[string]
Set of file extensions to be fetched. If specified, all non-matching files will be skipped.
Default:
excludedFileExtensions - array[string]
A set of all file extensions to be skipped from the fetch.
Default:
inclusiveRegexes - array[string]
Regular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression.
Default:
exclusiveRegexes - array[string]
Regular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression.
Default:
maxSizeBytes - number
Used for excluding objects when the objects size is larger than the configured value.
>= -2147483648
<= 2147483647
exclusiveMinimum: false
exclusiveMaximum: false
Default: -1
Multiple of: 1
minSizeBytes - number
Used for excluding objects when the objects size is smaller than the configured value.
>= -2147483648
<= 2147483647
exclusiveMinimum: false
exclusiveMaximum: false
Default: 1
Multiple of: 1
bucketPrefix - string
Filter results to buckets whose names begin with this prefix. Useful only when 'Bucket List' property is empty
blobsPrefix - string
Filter results to objects whose names begin with this prefix
pageSize - number
Maximum number of buckets or objects returned per page
>= 1
<= 1000
exclusiveMinimum: false
exclusiveMaximum: false
Default: 1000
Multiple of: 1