Datasources

A datasource is the configured connection between Fusion’s Connectors and your data sources.

Datasource configuration

In order to connect a data source to Fusion, you must create and configure the datasource in Fusion:

  • Using the Fusion UI

    1. Click Applications > Collections.

    2. Select a Collection, or click Add a Collection to create a new one.

    3. Click Add a Datasource.

    4. Select a datasource type; these correspond to Fusion’s Connectors.

    5. Enter configuration information in the datasource panel that appears.

    6. Click Save.

  • Using the Connector Datasources API

A datasource must be configured with the following set of properties:

  • the connector type used to access the repository

  • the repository name and location

  • rules for transforming raw data into a structured JSON PipelineDocument object

  • the index pipeline to which to submit the JSON object

  • the collection where the documents will ultimately be indexed

A datasource configuration may also include the following properties, depending on the application, data repository, and connector type:

  • authentication information, such as username, group names or IDs, passwords, or credential files.

  • rules for crawling a website or filesystem:

    • start location

    • which nodes to crawl

    • which files to retrieve or exclude

The Connectors and Datasources Reference provides complete details about datasource configuration.

Datasource jobs

In the Fusion UI, the datasource configuration panel is accessed via the "Datasource" link on the home panel for a collection. Once configured, this panel provides a control to run the datasource; when running it provides controls to stop or abort the run. Once a run has started, a "job history" controls provides information on current and completed jobs. Job status information is stored in ZooKeeper. When crawling or spidering a website or filesystem, a record of the crawl is stored in directory fusion/3.1.x/data/connectors/crawldb.

Datasources are defined via the Connector Datasources API. Once configured, the connector job is initiated using the Connector Jobs API, and this service is used to request job history information as well.

Information on all runs of a particular datasource is retrieved by sending a GET request to the Fusion API services endpoint:

api/apollo/connectors/jobs/<datasource id>

For example, to see the status of all jobs run for a datasource named 'myDatasource', you can submit the following request using the curl utility the following GET request can be sent you can get information on all runs by sending a GET request via the command-line curl utility:

curl -u user:pass http://localhost:8764/api/apollo/connectors/jobs/myDatasource
Note
For a limited range of document formats, documents can be added to a collection by pushing to an index pipeline directly, without use of a connector and datasource. Use cases for this include loading a massive dataset, or for application development, testing and troubleshooting. See Pushing Documents to a Pipeline for details.