JDBC Connector and Datasource Configuration

The JDBC connector fetches documents from a relational database via SQL queries. Under the hood, this connector implements the Solr DataImportHandler (DIH) plugin.

SQL queries for document retrieval

The JDBC connector is configured with the SQL statement used to retrieve a resultset from the database. Each row in the result set is treated as a Solr document. This statement is specified as the value of the required property sql_select_statement.

The column names will be used as Solr document field names; use the SQL "AS" keyword to rename column names as needed. Column and table names should be treated as if they are case-insentive, even though some databases allow use of mixed case names. All Solr documents must have a unique key, which is the Fusion "id" field. Therefore, the results set must contain a column "id" which can be used as a unique key for the resulting document.

SELECT customer_id AS id, * from customers

Delta queries

Delta queries provide incremental updates to the contents of a collection by indexing only those records in the database which have been changed since the database was last indexed by this connector. The SQL statement is specified as the value of the property delta_sql_query.

Delta queries select only primary key values, therefore, the query must use the primary key and it must also have a "WHERE" clause which specifies a "last_modified" condition as follows:

SELECT customer_id AS id from customers WHERE last_modified > $

The dollar-sign character '$' is required; it holds the last successful import time from the database.

Nested queries

Nested queries are used to index information which is stored in the database across a series of tables where the is a one-to-many or many-to-many reletionship between them. This statement is specified as the value of the property nested_queries.

A nested query is used in conjunction with the SQL query specified by the sql_select_statement statement. The dollar-sign character '$' specifies the primary key in the resultset retrieved by the sql_select_statement statement.

The following example shows the pair of query, nested query statements used to index list of tags assigned to documents:

SELECT id FROM document
SELECT tag FROM tag INNER JOIN document_tag ON document_tag.tag_id=tag.id WHERE document_tag.doc_id=$

Uploading a JDBC driver

Fusion stores JDBC drivers in the blob store. You can upload a driver using the Fusion UI or the Blob Store API.

To upload a JDBC driver using the Fusion UI
  1. In the Fusion UI, navigate to DevOps > Blob Management.

  2. Click Add.

  3. Select JDBC Driver.

    Uploading a connector

    The "New 'JDBC Driver' Upload" panel appears.

  4. Click Choose File and select the .jar file from your file system.

    Uploading connector

  5. Click Upload.

    The new driver’s blob manifest appears.

    Uploaded connector

    From this screen you can also delete or replace the driver.

To install a JDBC driver using the API
  1. Upload the .jar file to Fusion’s blob store using the /blobs/{id} endpoint.

    Specify an arbitrary blob ID, and a resourceType value of plugin:connector, as in this example:

    curl -u user:pass -H "content-type:application/java-archive" -H "content-length:707261" -X PUT --data-binary @postgresql-42.0.0.jar http://localhost:8764/api/apollo/blobs/mydriver?resourceType=driver:jdbc

    Success response:

    {
      "name" : "mydriver",
      "contentType" : "application/java-archive",
      "size" : 707261,
      "modifiedTime" : "2017-06-09T19:00:48.919Z",
      "version" : 0,
      "md5" : "c67163ca764bfe632f28229c142131b5",
      "metadata" : {
        "subtype" : "driver:jdbc",
        "drivers" : "org.postgresql.Driver",
        "resourceType" : "driver:jdbc"
      }
    }

    Fusion automatically publishes the event to the cluster, and the listeners perform the driver installation process on each node.

    Tip
    If the blob ID is identical to an existing one, the old driver will be uninstalled and the new driver will installed in its place. To get the list of existing blob IDs, run: curl -u user:_password_ localhost:8764/api/apollo/blobs
  2. To verify the uploaded driver, run:

    curl -u user:pass http://localhost:8764/api/apollo/blobs/<id>/manifest

    Where the <id> is the name specified during upload, such as "mydriver" above. A success response looks like this:

    {
      "name" : "mydriver",
      "contentType" : "application/java-archive",
      "size" : 707261,
      "modifiedTime" : "2017-06-09T19:05:17.897Z",
      "version" : 1569755095787110400,
      "md5" : "c67163ca764bfe632f28229c142131b5",
      "metadata" : {
        "subtype" : "driver:jdbc",
        "drivers" : "org.postgresql.Driver",
        "resourceType" : "driver:jdbc"
      }
    }

Indexing binary data

The JDBC connector in Fusion does not automatically discover and index binary data you may have stored in your database (such as PDF files). However, you can configure Fusion to recognize and extract binary data correctly by modifying the datasource configuration file. This file is created when the datasource is first run, and then it is created in fusion/3.1.x/data/connectors/lucid.jdbc/datasources/ datasourceID/conf. The name of the file will include the name of the datasource, as in dataconfig_datasourceName.xml. If you are familiar with Solr’s DIH, you will recognize this as a standard dataconfig.xml file.

Follow these steps to modify the configuration file:

  1. Add a name attribute for the database containing your binary data to the dataSource entry.

  2. Set the convertType attribute for the dataSource to false. This prevents Fusion from treating binary data as strings.

  3. Add a FieldStreamDataSource to stream the binary data to the Tika entity processor.

  4. Specify the dataSource name in the root entity.

  5. Add an entity for your FieldStreamDataSource using the TikaEntityProcessor to take the binary data from the FieldStreamDataSource, parse it, and specify a field for storing the processed data.

  6. Reload the Solr core to apply your configuration changes.

Troubleshooting

When using the JDBC connector, it is recommended that you work closely with your database administrator to formulate efficient and robust queries.

One source of possible problems is the driver being used. In some cases, indexing may fail due to problems with the driver, in particular older versions of Oracle’s JDBC driver. If you have checked that your connection information is correct and your database is allowing the connection, you may want to research if there are any known bugs with the driver you are using.

With Oracle databases, note that column names not enclosed in double-quotes are converted to upper-case, but Solr field names are case sensitive. If your column-to-field mapping is not happening properly, check your SQL statement for any lower-case names not enclosed in double-quotes.

Dates can also be problematic. Solr has a different date format than many relational databases. If you want date and time fields to be indexed properly, you may need to convert database dates into the proper format using date/string convert functions. In Oracle this is the TO_CHAR function; in Microsoft SQL, this is the DATEPART function.

In MySQL databases, dates are allowed to be 0-strings, such as 0000-00-00, which is not acceptable to JDBC. If you have legacy date data you may need to add the query parameter "zeroDateTimeBehavior=convertToNull" to your JDBC request string, as in ` jdbc:mysql://localhost/myDatabase?zeroDateTimeBehavior=convertToNull `. This will convert the zero-string dates to null values that can be added to the index.

Finally, database timeouts are another problematic area. There are several possible solutions to this, from increasing the timeout in the JDBC request (with "netTimeoutForStreamingResults"), altering the SQL statement to page the results, or dumping the records to CSV and indexing them with another connector.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.