https://github.com/lucidworks/connectors-sdk-resources/
.
Develop a Custom Connector
ConnectorConfig
.@Property
are considered to be configuration properties.
For example, @Property() String name();
results in a String property called name
.
This property would then be present in the generated schema.Here is an example of the most basic configuration, along with required annotations:@RootSchema
is used by Fusion when showing the list of available connectors.
The ConnectorConfig
base interface represents common, top-level settings required by all connectors.
The type
parameter of the ConnectorConfig
class indicates the interface to use for custom properties.Once a connector configuration has been defined, it can be associated with the ConnectorPlugin
class.
From that point, the framework takes care of providing the configuration instances to your connector.
It also generates the schema, and sends it along to Fusion when it connects to Fusion.Schema metadata can be applied to properties using additional annotations. For example, applying limits to the min/max length of a string, or describing the types of items in an array.Nested schema metadata can also be applied to a single field by using “stacked” schema based annotations:$FUSION_HOME
is your Fusion installation directory and <version>
is your Fusion version number..zip
file. This zip
must contain only one connector plugin.Here is an example of how to start up using the web connector:$FUSION_HOME
is your Fusion installation directory and <version>
is your Fusion version number..zip
file. This zip
must contain only one connector plugin.Here is an example of how to start up using the web connector:$FUSION_HOME
is your Fusion installation directory and <version>
is your Fusion version number..zip
file. This zip
must contain only one connector plugin.Here is an example of how to start up using the web connector:Fusion release | SDK version |
5.11.x - 5.12.x | 4.2.1 |
5.9.x - 5.10.x | 4.2.0 |
5.8.0 - 5.8.x | 4.1.4 |
5.6.x - 5.7.x | 4.1.3 |
5.5.1-1 - 5.5.1-x | 4.1.2 |
5.5.1 - 5.5.x | 4.1.2 |
5.5.0 | 4.1.1 |
5.4.4 - 5.4.x | 4.1.0 |
5.4.0 - 5.4.3 | 4.0.0 |
5.3.0 - 5.3.x | 3.0.0 |
5.2.1 - 5.2.x | 2.0.3 |
5.2.0 | 2.0.2 |
5.1.2 - 5.1.x | 2.0.1 |
5.1.0 - 5.1.1 | 2.0.0 |
5.0.2 | 2.0.0-pre-release |
4.2.6 | 1.5.0 |
4.2.4 - 4.2.5 | 1.4.0 |
4.2.2 - 4.2.3 | 1.3.0 |
4.2.1 | 1.2.0 |
4.2.0 | 1.1.0 |
Field Name | Field Description | Example value |
id | Unique candidate indentifier | content:/app |
jobId_s | Unique job identifier. All items processed in the new job will have a different jobId. | KTPbmHYTqm |
blockId_s | A BlockId identifies a series of 1 or more Jobs, and the lifetime of a BlockId spans from the start of a crawl to the crawls completion.When a Job starts and the previous Job did not complete (failed or stopped), the previous Job’s BlockId is reused. The same BlockId will be reused until the crawl successfully completes.BlockIds are used to quickly identify items in the CrawlDb which may not have been fully processed (complete). | KwhuWW7wya |
state_s | State transition. Possible values (FetchInput, Document, Skip, Error, Checkpoint, ACI(AccessControItem), Delete, FetchResult). | Document |
targetPhase_s | Name of the phase this item is emitted to. | content |
sourcePhase_s | Name of the phase an item was emitted from. | content |
isTransient_b | Flag to indicate that the item should be removed from CrawDB after it has been processed. | false |
isLeafNode_b | This flag is used to prioritize the processing leaf node instead of nested nodes to avoid emitting of too many Candidates. | false |
createdAt_l | Item created timestamp. | 1566508663611 |
createdAt_tdt | Item created ISO date. | 2019-08-22T21:17:43.611Z |
modifiedAt_l | Timestamp value which is updated when item changes its state. Also, if purge stray items feature is enabled in the connector plugin, this field is used to determine whether the item is stray or not, then the item is deleted if it’s a stray item. | 1566508665709 |
modifiedAt_tdt | ISO date value which is updated when item changes its state. It serves same purpose as modifiedAt_l. | 2019-08-22T21:17:45.709Z |
fetchInput_id_s | FetchInput Id. | /app |
blockID
field in the item metadata to match the blockID
of the current job.
A blockId
is used to quickly identify items in the CrawlDB which may not have been fully processed, or completed. A completed job is one that naturally stops due to source data being fully processed, as opposed to jobs that are manually stopped or fail.
blockId’s identify a series of one or more jobs. The lifetime of a
blockIdspans from either the start of the initial crawl (or immediately after a completed one), all the way to completion. The SDK controller will generate and use a new
blockId` when:
FINISHED
.STOPPED
. In this case, the previous job’s blockId is reused. The same blockId will be reused until the crawl successfully completes. The SDK controller will continue checking the CrawlDB for incomplete items, which are identified by having a blockID that doesn’t match the previous job blockID. This approach ensures all items within the job are completed before the next job beings, even if the job was stopped multiple times before completion.
When an item is considered a new candidate, the item’s blockId does not change. Later, when the item is fully processed by the fetcher, the blockId is added to the item metadata and stored in the SDK CrawlDB. The item is then considered complete but will only be sent to fetchers when a new blockId is generated.
When all items are complete, the SDK will check for checkpoints, as detailed in Checkpoint Design.
Item A
.
4. The controller receives the candidate and stores it in the SDK CrawlDB. Mandatory fields are set to the item metadata, but the blockId
field is not set.
5. Later, in the same job, the candidate Item A is selected by the SDK controller, which sends it to a fetcher.
6. The fetcher receives the candidate and processes it.
7. The fetcher emits a Document
from the candidate.
8. The fetcher emits a FetchResult
to the SDK controller.
9. The SDK controller receives both the Document and the FetchResult.
Transient=true
. The IncrementalContentFetcher
is an example.