Index Binary Data from JDBC

The JDBC connector fetches documents from a relational database via SQL queries. Under the hood, this connector implements the Solr DataImportHandler (DIH) plugin.

The JDBC connector in Fusion does not automatically discover and index binary data you may have stored in your database (such as PDF files). However, you can configure Fusion to recognize and extract binary data correctly by modifying the datasource configuration file. This file is created when the datasource is first run, and then it is created in VAR-FUSIONPATH/data/connectors/lucid.jdbc/datasources/ datasourceID/conf. The name of the file will include the name of the datasource, as in dataconfig_datasourceName.xml. If you are familiar with Solr’s DIH, you will recognize this as a standard dataconfig.xml file.

Follow these steps to modify the configuration file:

  1. Add a name attribute for the database containing your binary data to the dataSource entry.

  2. Set the convertType attribute for the dataSource to false. This prevents Fusion from treating binary data as strings.

  3. Add a FieldStreamDataSource to stream the binary data to the Tika entity processor.

  4. Specify the dataSource name in the root entity.

  5. Add an entity for your FieldStreamDataSource using the TikaEntityProcessor to take the binary data from the FieldStreamDataSource, parse it, and specify a field for storing the processed data.

  6. Reload the Solr core to apply your configuration changes.