username - string
An AWS Access Key ID that can access the content.
password - string
The AWS Secret Key associated with the Access Key.
db - Connector DB
Type and properties for a ConnectorDB implementation to use with this datasource.
type - string
Fully qualified class name of ConnectorDb implementation.
>= 1 characters
Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb
inlinks - boolean
Keep track of incoming links. This negatively impacts performance and size of DB.
Default: false
aliases - boolean
Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.
Default: false
inv_aliases - boolean
Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.
Default: false
url - string
A fully-qualified S3 URL, including bucket and sub-bucket paths, as required, e.g., 's3://{bucketName}/{path}'.
>= 1 characters
Match pattern: .*:.*
max_docs - integer
Maximum number of documents to fetch. The default (-1) means no limit.
>= -1
exclusiveMinimum: false
Default: -1
max_bytes - integer
Maximum size (in bytes) of documents to fetch or -1 for unlimited file size.
>= -1
exclusiveMinimum: false
Default: 10485760
index_directories - boolean
Set to true to add directories to the index as documents. If set to false, directories will not be added to the index, but they will still be traversed for documents.
Default: false
max_threads - integer
The maximum number of threads to use for fetching data. Note: Each thread will create a new connection to the repository, which may make overall throughput faster, but this also requires more system resources, including CPU and memory.
Default: 1
add_failed_docs - boolean
Set to true to add documents even if they partially fail processing. Failed documents will be added with as much metadata as available, but may not include all expected fields.
Default: false
crawl_item_timeout - integer
Time in milliseconds to fetch any individual document.
exclusiveMinimum: true
Default: 600000
maximum_connections - integer
Maximum number of concurrent connections to the filesystem. A large number of documents could cause a large number of simultaneous connections to the repository and lead to errors or degraded performance. In some cases, reducing this number may help performance issues.
Default: 1000
initial_mapping - Initial field mapping
Provides mapping of fields before documents are sent to an index pipeline.
skip - boolean
Set to true to skip this stage.
Default: false
label - string
A unique label for this stage.
<= 255 characters
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
reservedFieldsMappingAllowed - boolean
Default: false
retentionMappings - array[object]
Fields that should be kept or deleted
Default:
object attributes:{field
required : {
display name: Field
type: string
}operation
: {
display name: Operation
type: string
}}
updateMappings - array[object]
Values that should be added to or set on a field. When a value is added, any values previously on the field will be retained. When a value is set, any values previously on the field will be overwritten.
Default:
object attributes:{field
required : {
display name: Field
type: string
}value
required : {
display name: Value
type: string
}operation
: {
display name: Operation
type: string
}}
translationMappings - array[object]
Fields that should be moved or copied to another field. When a field is moved, the values from the source field are moved over to the target field and the source field is removed. When a field is copied, the values from the source field are copied over to the target field and the source field is retained.
Default: {"source":"fetch_time","target":"fetch_time_dt","operation":"move"}{"source":"ds:description","target":"description","operation":"move"}
object attributes:{source
required : {
display name: Source Field
type: string
}target
required : {
display name: Target Field
type: string
}operation
: {
display name: Operation
type: string
}}
unmappedRule - Unmapped Fields
Fields not mapped by the above rules. By default, any remaining fields will be kept on the document.
keep - boolean
Keep all unmapped fields
Default: true
delete - boolean
Delete all unmapped fields
Default: false
fieldToMoveValuesTo - string
Move all unmapped field values to this field
fieldToCopyValuesTo - string
Copy all unmapped field values to this field
valueToAddToUnmappedFields - string
Add this value to all unmapped fields
valueToSetOnUnmappedFields - string
Set this value on all unmapped fields
crawl_depth - integer
Number of levels in a directory or site tree to descend for documents.
>= -1
exclusiveMinimum: false
Default: -1
bounds - string
Limits the crawl to a specific directory sub-tree, hostname or domain.
Default: tree
Allowed values: treehostdomainnone
include_paths - array[string]
Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.
exclude_paths - array[string]
Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.
include_extensions - array[string]
List the file extensions to be fetched. Note: Files with possible matching MIME types but non-matching file extensions will be skipped. Extensions should be listed without periods, using whitespace to separate items (e.g., 'pdf zip').
use_instance_creds - boolean
Use provider chain that use system properties rather than an AWS key. Can be used to provide AWS EC2 instance credentials (For fusion hosted in an EC2 instance, as its nodes already have an ec2 instance role assigned). Detailed information can be found AWS SDK documentation (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html) You can specify another AWS region through "SSE-KMS Encryption>AWS Region" option. Other parameters can be set through helm chart or deployment modification.
Default: false
commit_on_finish - boolean
Set to true for a request to be sent to Solr after the last batch has been fetched to commit the documents to the index.
Default: true
verify_access - boolean
Set to true to require successful connection to the filesystem before saving this datasource.
Default: true
use_sigv4 - boolean
Sets the name of the signature algorithm to use for signing requests to "AWSS3V4SignerType". Required to retrieve encrypted objects by SSE-KMS.
Default: false
aws_region - string
Sets the region to be used by the client. This will be used to determine both the service endpoint (eg: https://sns.us-west-2.amazonaws.com) and signing region.
Default: us-west-2
Allowed values: us-gov-west-1us-gov-east-1us-east-1us-east-2us-west-1us-west-2eu-west-1eu-west-2eu-west-3eu-central-1eu-north-1eu-south-1ap-east-1ap-south-1ap-southeast-1ap-southeast-2ap-northeast-1ap-northeast-2sa-east-1cn-north-1cn-northwest-1ca-central-1me-south-1af-south-1
retryDelay - integer
The initial retry time delay, in milliseconds.
>= 1000
exclusiveMinimum: false
Default: 1000
stopRetry - integer
The maximum time to retry failed requests, in minutes.
>= 1
exclusiveMinimum: false
Default: 5
proxyHost - string
The optional proxy host the client will connect through
proxyPort - integer
The optional proxy port the client will connect through
proxyUsername - string
The optional proxy user name to use if connecting through a proxy
proxyPassword - string
The optional proxy password to use when connecting through a proxy
proxyHttps - boolean
Force the HTTPS protocol to use for connecting to the proxy.