Latest version: v2.1.0 Compatible with Fusion version: 5.2.0 and later The Fusion Google Cloud Storage (GCS) V2 connector enables indexing datasets from GCS buckets into Fusion 5. The connector leverages the Google Cloud API for authentication and fetching content and metadata. GCS connector can index:
  • CSV
  • JSON
  • PDF
  • Word docs
  • Other rich text formats

Features

  1. Service account authentication
  2. Full crawl of storage buckets and objects
  3. Recrawl buckets and objects
  4. Remove deleted objects
  5. Update objects
  6. Cascade deletion of objects in deleted buckets
  7. Document parsing support
  8. Bucket and object filtering

Authentication:

The GCS connector supports JSON key authentication. The full JSON key content must be copied and pasted into the service account JSON key box.

Full crawl

GCS crawls all available buckets in a project. In order to crawl all available buckets, the account used in the configuration needs the correct permissions enabled.

Crawl specific buckets

If the account used has limited permissions, or if a user wants to only crawl specific buckets, use the Specify buckets to crawl setting. First add the name of the buckets you would like to crawl, then download bucket objects and metadata.

Recrawl

The GCS connector picks up content changes (added/updated/deleted) to keep the Solr index up to date.

Remote connectors

V2 connectors support running remotely in Fusion versions 5.7.1 and later.
Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:
additionalVolumes:
- name: fusion-data1-pvc
    persistentVolumeClaim:
    claimName: fusion-data1-pvc
- name: fusion-data2-pvc
    persistentVolumeClaim:
    claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
    mountPath: "/connector/data1"
- name: fusion-data2-pvc
    mountPath: "/connector/data2"
You may also need to specify the user that is authorized to access the file system, as in this example:
securityContext:
    fsGroup: 1002100000
    runAsUser: 1002100000

Configuration

NameTitleDescription
authenticationPropertiesAuthentication settingsConnect to the bucket store using a service account. The service account requires the following permissions: storage.buckets.list to crawl all the available buckets; storage.objects.list and storage.objects.get to access to the objects in the buckets.
applicationPropertiesLimit documentsBucket and Object filtering options.
jsonKeyService account Json keyJson key contents from authorized service account.
bucketsBucket listAdd the bucket names to crawl. Leave blank to crawl all the available buckets.
includedFileExtensionsIncluded file extensionsSet of file extensions to be fetched. If specified all non-matching files will be skipped.
excludedFileExtensionsExcluded file extensionsA set of all file extensions to be skipped from the fetch.
inclusiveRegexesInclusive regexesRegular expressions for bucket or object name patterns to include. This will limit this datasource to only items that match the regular expression.
exclusiveRegexesExclusive regexesRegular expressions for bucket or object name patterns to exclude. This will limit this datasource to only items that do not match the regular expression.
maxSizeBytesMaximum File SizeUsed for excluding objects when the objects size is larger than the configured value.
minSizeBytesMinimum File SizeUsed for excluding objects when the objects size is smaller than the configured value.
bucketPrefixBucket prefixFilter results to buckets whose names begin with this prefix. Useful only when ‘Bucket List’ property is empty.
blobsPrefixObject prefixFilter results to objects whose names begin with this prefix.
pageSizeBuckets and Objects page sizeMaximum number of buckets or objects returned per page.