JDBC V2 - Lucidworks documentation

Latest version: v2.7.0
Compatible with Fusion version: 5.9.0 and later

Add a JDBC driver to Fusion

Perform these prerequisites to ensure the connector can reliably access, crawl, and index your data. Proper setup helps avoid configuration or permission errors, so use the following guidelines to keep your content available for discovery and search in Fusion. Before creating your datasource, ensure Fusion can load your JDBC driver. These steps let the connector pull in the driver automatically at runtime. Fusion UI:

Navigate to System > Blobs.
Click Add > JDBC Driver.
Choose your JAR file and click Upload.

API: In the following steps, replace ID with the name of your JAR file.

PUT your driver JAR file to /api/blobs/ID?resourceType=driver:jdbc.
Verify with GET /api/blobs/ID/manifest.

See the Blob Store API for additional guidance on sending these requests.

Upload a JDBC Driver to Fusion

The JDBC V2 connector is supported, and fetches documents from a relational database via SQL queries. Under the hood, this connector implements the Solr DataImportHandler (DIH) plugin.Fusion stores JDBC drivers in the blob store. You can upload a driver using the Fusion UI or the Blob Store API.

How to upload a JDBC driver using the Fusion UI

In the Fusion UI, navigate to System > Blobs.
Click Add.
Select JDBC Driver. The “New ‘JDBC Driver’ Upload” panel appears.
Click Choose File and select the .jar file from your file system.
Click Upload. The new driver’s blob manifest appears.

From this screen you can also delete or replace the driver.

How to install a JDBC driver using the API

Upload the JAR file to Fusion’s blob store using the /blobs/{id} endpoint. Specify an arbitrary blob ID, and a resourceType value of plugin:connector, as in this example:

curl -u USERNAME:PASSWORD -H "content-type:application/java-archive" -H "content-length:707261" -X PUT --data-binary @postgresql-42.0.0.jar http://localhost:8764/api/blobs/mydriver?resourceType=driver:jdbc

Success response:

{
  "name" : "mydriver",
  "contentType" : "application/java-archive",
  "size" : 707261,
  "modifiedTime" : "2017-06-09T19:00:48.919Z",
  "version" : 0,
  "md5" : "c67163ca764bfe632f28229c142131b5",
  "metadata" : {
    "subtype" : "driver:jdbc",
    "drivers" : "org.postgresql.Driver",
    "resourceType" : "driver:jdbc"
  }
}

Fusion automatically publishes the event to the cluster, and the listeners perform the driver installation process on each node.

If the blob ID is identical to an existing one, the old driver will be uninstalled and the new driver will installed in its place. To get the list of existing blob IDs, run: curl -u USERNAME:PASSWORD https://FUSION_HOST:FUSION_PORT/api/blobs

To verify the uploaded driver, run:

curl -u USERNAME:PASSWORD https://FUSION_HOST:FUSION_PORT/api/blobs/BLOB_ID/manifest

Where the BLOB_ID is the name specified during upload, such as “mydriver” above. A success response looks like this:

{
  "name" : "mydriver",
  "contentType" : "application/java-archive",
  "size" : 707261,
  "modifiedTime" : "2017-06-09T19:05:17.897Z",
  "version" : 1569755095787110400,
  "md5" : "c67163ca764bfe632f28229c142131b5",
  "metadata" : {
    "subtype" : "driver:jdbc",
    "drivers" : "org.postgresql.Driver",
    "resourceType" : "driver:jdbc"
  }
}

Supported JDBC drivers

This section contains a list of supported JDBC drivers that are compatible with any driver/database that implements a SQL standard.

Authentication parameters may be provided as part of the connection string. It is not necessary to include a username and password in the datasource configuration.

MySQL

Download driver
Default driver class name: com.mysql.cj.jdbc.Driver
Connection URL specification
- Example: jdbc:mysql://mysql:3306/testdb

Postgresql

Download driver
Default driver class name: org.postgresql.Driver
Connection URL specification
- Example: jdbc:postgresql://postgres:5432/testdb

Microsoft SQL Server and Azure SQL Service

Download driver
Default driver class name: com.microsoft.sqlserver.jdbc.SQLServerDriver
Connection URL specification
- Example (SQL Server): jdbc:sqlserver://mssql:1433
- Example (cloud-based Azure SQL Service): jdbc:sqlserver://azure-test.database.windows.net:1433;database=testdbencrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;

Oracle database

Download driver
Default driver class name: oracle.jdbc.OracleDriver
Connection URL specification
- Example: jdbc:oracle:thin:@oracledb:1521:orc1

IBM DB2

Download driver
Default driver class name: com.ibm.db2.jcc.DB2Driver
Connection URL specification
- Example: jdbc:db2://db2:50000/testdb

Authentication

Setting up the correct authentication according to your organization’s data governance policies helps keep sensitive data secure while allowing authorized indexing. If authentication is required to access your data, the JDBC V2 connector uses standard database credentials to log in. In your datasource configuration, supply:

username using the database account’s username.
password using the corresponding password.

Remote connectors

V2 connectors support running remotely in Fusion versions 5.7.1 and later.

Configure remote V2 connectors

If you need to index data from behind a firewall, you can configure a V2 connector to run remotely on-premises using TLS-enabled gRPC.

Prerequisites

Before you can set up an on-prem V2 connector, you must configure the egress from your network to allow HTTP/2 communication into the Fusion cloud. You can use a forward proxy server to act as an intermediary between the connector and Fusion.The following is required to run V2 connectors remotely:

The plugin zip file and the connector-plugin-standalone JAR.
A configured connector backend gRPC endpoint.
Username and password of a user with a remote-connectors or admin role.
If the host where the remote connector is running is not configured to trust the server’s TLS certificate, you must configure the file path of the trust certificate collection.

If your version of Fusion doesn’t have the remote-connectors role by default, you can create one. No API or UI permissions are required for the role.

Connector compatibility

Only V2 connectors are able to run remotely on-premises. You also need the remote connector client JAR file that matches your Fusion version. You can download the latest files at V2 Connectors Downloads.

Whenever you upgrade Fusion, you must also update your remote connectors to match the new version of Fusion.

The gRPC connector backend is not supported in Fusion environments deployed on AWS.

System requirements

The following is required for the on-prem host of the remote connector:

(Fusion 5.9.0-5.9.10) JVM version 11
(Fusion 5.9.11) JVM version 17
Minimum of 2 CPUs
4GB Memory

Note that memory requirements depend on the number and size of ingested documents.

Enable backend ingress

In your values.yaml file, configure this section as needed:

ingress:
  enabled: false
  pathtype: "Prefix"
  path: "/"
  #host: "ingress.example.com"
  ingressClassName: "nginx"   # Fusion 5.9.6 only
  tls:
    enabled: false
    certificateArn: ""
    # Enable the annotations field to override the default annotations
    #annotations: ""

Set enabled to true to enable the backend ingress.
Set pathtype to Prefix or Exact.
Set path to the path where the backend will be available.
Set host to the host where the backend will be available.
In Fusion 5.9.6 only, you can set ingressClassName to one of the following:
- nginx for Nginx Ingress Controller
- alb for AWS Application Load Balancer (ALB)
Configure TLS and certificates according to your CA’s procedures and policies.
TLS must be enabled in order to use AWS ALB for ingress.

Connector configuration example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443 # mandatory
  plain-text: false # optional, false by default.  
    proxy-server: # optional - needed when a forward proxy server is used to provide outbound access to the standalone connector
    host: host
    port: some-port
    user: user # optional
    password: password # optional
  trust: # optional - needed when the client's system doesn't trust the server's certificate
    cert-collection-filepath: path1

proxy: # mandatory fusion-proxy
  user: admin
  password: password123
  url: https://fusiontest.com/ # needed only when the connector plugin requires blob store access

plugin: # mandatory
  path: ./fs.zip
  type: #optional - the suffix is added to the connector id
    suffix: remote

Minimal example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443

proxy:
  user: admin
  password: "password123"

plugin:
  path: ./testplugin.zip

Logback XML configuration file example

<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOGDIR:-.}/connector.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <!-- rollover daily -->
            <fileNamePattern>${LOGDIR:-.}/connector-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>50MB</maxFileSize>
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</configuration>

Run the remote connector

java [-Dlogging.config=[LOGBACK_XML_FILE]] \
  -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

The logging.config property is optional. If not set, logging messages are sent to the console.

Test communication

You can run the connector in communication testing mode. This mode tests the communication with the backend without running the plugin, reports the result, and exits.

java -Dstandalone.connector.connectivity.test=true -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

Encryption

In a deployment, communication to the connector’s backend server is encrypted using TLS. You should only run this configuration without TLS in a testing scenario. To disable TLS, set plain-text to true.

Egress and proxy server configuration

One of the methods you can use to allow outbound communication from behind a firewall is a proxy server. You can configure a proxy server to allow certain communication traffic while blocking unauthorized communication. If you use a proxy server at the site where the connector is running, you must configure the following properties:

Host. The hosts where the proxy server is running.
Port. The port the proxy server is listening to for communication requests.
Credentials. Optional proxy server user and password.

When you configure egress, it is important to disable any connection or activity timeouts because the connector uses long running gRPC calls.

Password encryption

If you use a login name and password in your configuration, run the following utility to encrypt the password:

Enter a user name and password in the connector configuration YAML.

Run the standalone JAR with this property:

-Dstandalone.connector.encrypt.password=true

Retrieve the encrypted passwords from the log that is created.
Replace the clear password in the configuration YAML with the encrypted password.

Connector restart (5.7 and earlier)

The connector will shut down automatically whenever the connection to the server is disrupted, to prevent it from getting into a bad state. Communication disruption can happen, for example, when the server running in the connectors-backend pod shuts down and is replaced by a new pod. Once the connector shuts down, connector configuration and job execution are disabled. To prevent that from happening, you should restart the connector as soon as possible.You can use Linux scripts and utilities to restart the connector automatically, such as Monit.

Recoverable bridge (5.8 and later)

If communication to the remote connector is disrupted, the connector will try to recover communication and gRPC calls. By default, six attempts will be made to recover each gRPC call. The number of attempts can be configured with the max-grpc-retries bridge parameters.

Job expiration duration (5.9.5 only)

The timeout value for irresponsive backend jobs can be configured with the job-expiration-duration-seconds parameter. The default value is 120 seconds.

Use the remote connector

Once the connector is running, it is available in the Datasources dropdown. If the standalone connector terminates, it disappears from the list of available connectors. Once it is re-run, it is available again and configured connector instances will not get lost.

Enable asynchronous parsing (5.9 and later)

To separate document crawling from document parsing, enable Tika Asynchronous Parsing on remote V2 connectors.

Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:

additionalVolumes:
- name: fusion-data1-pvc
    persistentVolumeClaim:
    claimName: fusion-data1-pvc
- name: fusion-data2-pvc
    persistentVolumeClaim:
    claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
    mountPath: "/connector/data1"
- name: fusion-data2-pvc
    mountPath: "/connector/data2"

You may also need to specify the user that is authorized to access the file system, as in this example:

securityContext:
    fsGroup: 1002100000
    runAsUser: 1002100000

Crawl behavior

The JDBC V2 connector retrieves data based on the user-supplied SQL query. The SQL implementation allows full use of all features, but for best results, structure the SQL query to utilize pagination.

Pagination

Automatic pagination

Automatic pagination is enabled by default. You can disable it by setting enableAutomaticPagination to false. The batchSize field sets the number of documents displayed per page of query results. When a SQL query includes pagination terms such as OFFSET, LIMIT, or ROWS ONLY, the SQL query terms override automatic pagination.

DB2 notes

When crawling an IBM DB2 database, the pagination method depends on your database version:

IBM DB2 version < 11.5 - Pagination is performed using subqueries.
IBM DB2 version >= 11.5 - Pagination is performed using rows and offsets.
Appending --subquery to the SQL query statement forces the connector to paginate using subqueries regardless of the IBM DB2 database version.

Special variables for pagination

The connector uses placeholders for pagination that are updated based on batchSize:

${limit}: used as the LIMIT paramater in native SQL
${offset}: used as the OFFSET parameter in native SQL

Here are examples of SQL statements using these parameters:

MySQL, postgresql, and DB2:

SELECT * FROM example_table LIMIT ${limit} OFFSET ${offset}

Microsoft, Azure, and Oracle:

SELECT * FROM example_table ORDER BY id OFFSET ${offset} ROWS FETCH NEXT ${limit} ROWS ONLY

Nested queries

The connector supports nested queries. Nested queries are a set of SQL queries performed on each row returned by the main query. Because they are executed on every row returned, nested queries can significantly degrade performance. Use the ${id} variable to retrieve items associated with a specific primary key:

SELECT * FROM example_table ORDER BY id OFFSET ${offset} ROWS FETCH NEXT ${limit} ROWS ONLY

Incremental crawl

In addition to the main SQL query, users can also specify an optional delta SQL query that returns only new and modified items to increase performance during recrawls. The special ${limit} and ${offset} pagination variables are the same as in normal crawls.

Special incremental crawl variable

Use the ${last_index_time} for an incremental crawl as a placeholder variable that contains the time the last crawl completed. This variable is used to filter results for items added or modified since the last time the datasource job was run and is stored as a timestamp in the format of yyyy-MM-dd HH:mm:ss.SSS

The format may not be compatible with all driver date math implementations without additional conversion steps.

Example:

SELECT * FROM example_table WHERE (timestamp_column >= ${last_index_time})

Depending on the database used, it may be necessary to surround ${last_index_time} in single quotes. For example, '${last_index_time}'.

Stray content deletion

Stray content deletion is a plugin feature that deletes documents from a content collection that are not rediscovered on subsequent crawls after the first crawl. This process removes stale documents from the content collection and is enabled by default. Stray content deletion must be configured by the user because native SQL does not provide the ability to retrieve rows that have been deleted from a table.

Using a delta query to perform incremental crawling only returns new and modified items. If stray content deletion is enabled when the delta query is run, unmodified items that are still valid are deleted.

Binary content indexing

Binary content indexing is available in versions 2.7.0 and later of the JDBC V2 connector when using MSSQL or Oracle databases.

The JDBC V2 connector supports binary content indexing, which enables full-text search of documents, images, and other binary data stored in database BINARY, VARBINARY, IMAGE, or BLOB columns. To use binary content indexing:

Navigate to the JDBC connector.
In the Query field, ensure that your SQL query specifies the columns explicitly with primary keys. A query such as SELECT * is not supported in binary content indexing. The queries are case-sensitive.
Select Binary content settings. Two more text boxes display.
In the Binary content column name field, enter the database column that contains your binary content. This field is required if you are using binary content indexing. The value is case sensitive.
If your database contains a separate column with file names, enter that column name in the Binary file column name field.
If you do not enter a column name, the JDBC connector automatically attempts to detect a column name based on common naming practices. To prevent an incorrect column name in this value, enter a value manually.
Click Save.
Start another crawl to begin indexing the binary content.

After the crawl is completed, verify that your binary content has been indexed properly. To verify your indexing status:

Navigate to the Query Workbench
A list of documents displays. Click Show fields next to a document to view the indexing results.
The binary content is stored in a text field. The format varies depending on the type of binary content that was indexed. Verify that the contents display.

The default field for binary content is body_t. You can edit your index pipeline if you want this data to be indexed into a different field.

Processor flow diagram

The following diagram represents the flow for crawls. Incremental crawls follow the same basic flow as full crawls, except that they emit a page candidate from a delta SQL query if provided. If no delta query is provided, then incremental crawls use the original query.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Concepts

Connectors

Developers

Downloads

​Add a JDBC driver to Fusion

​How to upload a JDBC driver using the Fusion UI

​How to install a JDBC driver using the API

​Supported JDBC drivers

​MySQL

​Postgresql

​Microsoft SQL Server and Azure SQL Service

​Oracle database

​IBM DB2

​Authentication

​Remote connectors

​Prerequisites

​Connector compatibility

​System requirements

​Enable backend ingress

​Connector configuration example

​Minimal example

​Logback XML configuration file example

​Run the remote connector

​Test communication

​Encryption

​Egress and proxy server configuration

​Password encryption

​Connector restart (5.7 and earlier)

​Recoverable bridge (5.8 and later)

​Job expiration duration (5.9.5 only)

​Use the remote connector

​Enable asynchronous parsing (5.9 and later)

​Crawl behavior

​Pagination

​Automatic pagination

​DB2 notes

​Special variables for pagination

​Nested queries

​Incremental crawl

​Special incremental crawl variable

​Stray content deletion

​Binary content indexing

​Processor flow diagram

​Configuration

Add a JDBC driver to Fusion

How to upload a JDBC driver using the Fusion UI

How to install a JDBC driver using the API

Supported JDBC drivers

MySQL

Postgresql

Microsoft SQL Server and Azure SQL Service

Oracle database

IBM DB2

Authentication

Remote connectors

Prerequisites

Connector compatibility

System requirements

Enable backend ingress

Connector configuration example

Minimal example

Logback XML configuration file example

Run the remote connector

Test communication

Encryption

Egress and proxy server configuration

Password encryption

Connector restart (5.7 and earlier)

Recoverable bridge (5.8 and later)

Job expiration duration (5.9.5 only)

Use the remote connector

Enable asynchronous parsing (5.9 and later)

Crawl behavior

Pagination

Automatic pagination

DB2 notes

Special variables for pagination

Nested queries

Incremental crawl

Special incremental crawl variable

Stray content deletion

Binary content indexing

Processor flow diagram

Configuration