Plugin
class describes the various subcomponents of a connector, as well as its dependencies.
The current set of subcomponents include the Fetcher
, validation, security filtering, and configuration suggestions.
You can think of the Plugin
class as the component that glues everything together for an implementation. It is also a place where fetch ‘phases’ are defined, by binding fetcher classes to phase names.
You can see an example
here.
Notice that each subcomponent can have a set of unique dependencies, or several subcomponents can share dependencies.
Model
interface. By simply adding methods
with @Property
annotations, you can dynamically generate a type-safe configuration object from a connector’s configuration data.
Similarly, you can generate a Fusion-compatible JSON schema by implementing Model
.
You can see an example of a configuration implementation
here.
More detailed information on the configuration and schema capabilities can be found
here.
Fetcher
is where most of the work is done for any connector.
This interface provides methods that define how data is fetched and indexed.
For a connector to have its content indexed into Fusion, it must emit messages, which are simple objects
that contain metadata and content. See the message definitions below for more details.
Fetcher
has a simple lifecycle for dealing with fetching content.
The flow outline is as follows:
ConnectorJob
starts, the Fetcher’s start()
method is called. The main use case for a start() call
is to run setup code. This method is called on every fetcher bound to a phase once per job run.
Documents
, which are indexed directly into the associated content collection. There are several other
types of messages; see Message Definitions.
start
, fetch
, and stop
methods are called on the fetcher instance bound to the phase.
See the example
here.
Candidate
is metadata emitted from a Fetcher
that represents a resource to eventually fetch.
Once this message is received by the controller service, it is persisted, then added to a fetch queue.
When this item is then dequeued, a connector instance within the cluster is selected, and the message is sent as a FetchInput
.
The FetchInput
is received in the fetch()
method of the connector.
At this point, the connector normally emits a Content
or Document
message, which is then indexed into Fusion.
The general flow of how Candidates are processed is the key to enabling distributed fetching within the connectors framework.
FetchInput
represents an item to be fetched. Example values of FetchInput
are a file path, a URL, or a SQL statement. FetchInputs
are passed to the fetch() method of a Fetcher and are derived from Candidate
metadata.
Document
is a value that is emitted from a connector and represents structured content.
Once the controller service receives a Document
message, its metadata is persisted in the crawl-db
and then sent to the associated IndexPipeline
.
Content
message represents raw content that must be parsed in order to be indexed into Fusion. They are analogous to InputStreams, and their bytes are streamed to Fusion.
Content
types are actually composed of three different subtypes:
content-type
and any other metadata related to the source data.Content
is done.
The end result of sending a Content
stream, is a set of parsed documents within the Fusion Collection associated with the connector.Skip
messages represent items that were not fetched for some reason.
For example, items that fail validation rules related to path depth or availability.
Each Skip can contains details on why the item was skipped.
Error
messages indicate errors for a specific item. For example, when a connector’s fetch()
method is called
with an nonexistent FetchInput
, the connector can emit an error that captures the details (“not found”, for example).
Errors are recorded in the data store, but are not sent to the associated IndexPipeline
.
Deletes
tell the controller service to remove a specific item from the data store and associated Fusion collection.
AccessControlItem
message represent group, user, role, etc. used for security filtering.
Initial Item state | Valid transition state | Comment |
---|---|---|
Candidate | FetchInput | Candidates emitted are stored as FetchInput |
FetchInput | Document, Skip, Error, AccessControlItem | |
Document | Skip, Error | |
Skip | Document, Error | |
Error | Document, Skip | |
Checkpoint | Checkpoint | Checkpoint should not transition to any other state |
Delete | No transition state | |
AccessControlItem | No transition state |
transient
refers to a Candidate’s persistance with respect to the Crawl-db.
transient=true
, this means that we will clear the candidate from the Crawl-db after each crawl completes. Transient candidates will not be re-fetched in the next crawl. This means that subsequent crawls will need to create new candidates using the data source properties and checkpoints. An example of when you would want to do this is with a connector that has an “delta change” feature that can provide you the Created/Updated/Removed documents since you last crawled. You can avoid having to revisit every candidate from previous crawls because you have the means to know exactly what was changed. This is much faster than revisiting each candidate in the entire crawl database… so you should always prefer this option when it is a possibility.transient=false
, this means that we want to store the candidate in the CrawlDB, then we will send them to be fetched again so that they are reevaluated in each subsequent crawl. An example of when you want to do this is in a “Re-crawl Strategy” where you must revisit an item previously crawled explicitly each time subsequent crawls are run. Because revisiting each item is typically quite slow, you would only do this when the data source you are crawling provides no “delta change” feature that can provide you the Created/Updated/Removed documents since you last crawled.