V1 and V2 Connectors - Lucidworks documentation

There are two types of frameworks for Fusion connectors: V1 (also referred to as classic or built-in) and V2 (also known as plugin).

V1 (Classic) connectors

V1 connectors are developed with a general-purpose crawler framework called Anda, created by Lucidworks. Anda helps simplify and streamline crawler development, reducing the task of developing a new crawler to gain access to your data. In Fusion 5, V1 connectors are included in the Fusion image. You can install or update a connector at any time through the UI (under Datasources).

Install or update a connector - Fusion 5

When you create a new datasource that requires an uninstalled connector, Fusion releases 5.2 and later automatically download and install the connector using the Datasources dropdown. You can also update the connector using the Blob Store UI or via the Connector API.

In your Fusion app, navigate to Indexing > Datasources.
Click Add.
In the list of connectors, scroll down to the connectors marked Not Installed and select the one you want to install.
Fusion automatically downloads it and moves it to the list of installed connectors.

After you install a connector, you can Configure a New Datasource.

You can view and download all current and previous V2 connector releases at Download Connectors.

Install or update a connector using the Blob Store UI

Download the connector zip file from Download V2 connectors.
Do not expand the archive; Fusion consumes it as-is.
In your Fusion app, navigate to System > Blobs.
Click Add.
Select Connector Plugin. The “New Connector Plugin Upload” panel appears.
Click Choose File and select the downloaded zip file from your file system.
Click Upload. The new connector’s blob manifest appears. From this screen you can also delete or replace the connector.

Wait several minutes for the connector to finish uploading to the blob store before installing the connector using the Datasources dropdown.

Install or update a connector using the Connector API

Download the connector zip file from Download V2 connectors.
Do not expand the archive; Fusion consumes it as-is.
Upload the connector zip file to Fusion’s plugins. Specify a pluginId as in this example:
```
curl -H 'content-type:application/zip' -u USERNAME:PASSWORD -X PUT 'https://FUSION_HOST:FUSION_PORT/api/connectors/plugins?id=lucidworks.{pluginId}' --data-binary @{plugin_path}.zip
```
Fusion automatically publishes the event to the cluster, and the listeners perform the connector installation process on each node.
If the pluginId is identical to an existing one, the old connector will be uninstalled and the new connector will be installed in its place. To get the list of existing plugin IDs, run: curl -u USERNAME:PASSWORD https://FUSION_HOST:FUSION_PORT/api/connectors/plugins
Look in https://FUSION_HOST:FUSION_PORT/apps/connectors/plugins/ to verify the new connector is installed.

Reinstall a connector

To reinstall a connector for any reason, first delete the connector then use the preceding steps to install it again. This may take a few minutes to complete depending on how quickly the pods are deleted and recreated.

V2 (Plugin) connectors

Fusion 4.2 supports V2 connectors, which utilize a Java SDK framework. Fusion V2 connectors are installed via Datasources in the UI or by using the Connector Plugins Repository API. In addition to the features and benefits provided by V1 connectors, V2 connectors offer:

Updates and improvements delivered separately from Fusion releases. Update a V2 connector by installing the latest plugin version.
Security Access-control Lists (ACL) which are separate from content.
Improved scalability. Jobs can be scaled by simply adding instances of the connector. The fetching process supports distributed fetching, allowing many instances to contribute to the same job.
The ability to develop a custom connector.

Develop a Custom Connector

Java SDK configuration

To build a valid connector configuration, you must:

Define an interface.
Extend ConnectorConfig.
Apply a few annotations.
Define connector methods and annotations.

All methods that are annotated with @Property are considered to be configuration properties. For example, @Property() String name(); results in a String property called name. This property would then be present in the generated schema.Here is an example of the most basic configuration, along with required annotations:

@RootSchema(
    title = "My Connector",
    description = "My Connector description",
    category = "My Category"
)
public interface MyConfig extends ConnectorConfig<MyConfig.Properties> {
  @Property(
      title = "Properties",
      required = true
  )
  public Properties properties();
  /**
    * Connector specific settings
    */
  interface Properties extends FetcherProperties {
    @Property(
        title = "My custom property",
        description = "My custom property description"
    )
    public Integer myCustomProperty();
  }
}

The metadata defined by @RootSchema is used by Fusion when showing the list of available connectors. The ConnectorConfig base interface represents common, top-level settings required by all connectors. The type parameter of the ConnectorConfig class indicates the interface to use for custom properties.Once a connector configuration has been defined, it can be associated with the ConnectorPlugin class. From that point, the framework takes care of providing the configuration instances to your connector. It also generates the schema, and sends it along to Fusion when it connects to Fusion.Schema metadata can be applied to properties using additional annotations. For example, applying limits to the min/max length of a string, or describing the types of items in an array.Nested schema metadata can also be applied to a single field by using “stacked” schema based annotations:

interface MySetConfig extends Model {
    @SchemaAnnotations.Property(title = "My Set")
    @SchemaAnnotations.ArraySchema(defaultValue = "[\"a\"]")
    @SchemaAnnotations.StringSchema(defaultValue = "some-set-value", minLength = 1, maxLength = 1)
    Set<String> mySet();
  }

Plugin client

The Fusion connector plugin client provides a wrapper for the Fusion Java plugin-sdk so that plugins do not need to directly talk with gRPC code. Instead, they can use high-level interfaces and base classes, like Connector and Fetcher.The plugin client also provides a standalone “runner” that can host a plugin that was built from the Fusion Java Connector SDK. It does this by loading the plugin zip file, then calling on the wrapper to provide the framework interactions.

Standalone Connector Plugin Application

The second goal of the plugin-client is to allow Java SDK plugins to run remotely. The instructions for deploying a connector using this method are provided below.

Locating the UberJar

The uberjar is located in this location in the Fusion file system:

$FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar

where $FUSION_HOME is your Fusion installation directory and <version> is your Fusion version number.

Starting the Host

To start the host app, you need a Fusion SDK-based connector, built into the standard packaging format as a .zip file. This zip must contain only one connector plugin.Here is an example of how to start up using the web connector:

java -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

To run the client with remote debugging enabled:

java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5010 -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

Java SDK security

Fusion Connector Plugin Client

Standalone Connector Plugin Application

The second goal of the plugin-client is to allow Java SDK plugins to run remotely. The instructions for deploying a connector using this method are provided below.

Locating the UberJar

The uberjar is located in this location in the Fusion file system:

$FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar

where $FUSION_HOME is your Fusion installation directory and <version> is your Fusion version number.

Starting the Host

java -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

To run the client with remote debugging enabled:

java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5010 -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

Simple Connector

Fusion Connector Plugin Client

Standalone Connector Plugin Application

The second goal of the plugin-client is to allow Java SDK plugins to run remotely. The instructions for deploying a connector using this method are provided below.

Locating the UberJar

The uberjar is located in this location in the Fusion file system:

$FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar

where $FUSION_HOME is your Fusion installation directory and <version> is your Fusion version number.

Starting the Host

java -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

To run the client with remote debugging enabled:

java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5010 -jar $FUSION_HOME/apps/connectors/connectors-rpc/client/connector-plugin-client-<version>-uberjar.jar fusion-connectors/build/plugins/connector-web-4.0.0-SNAPSHOT.zip

Remote V2 connectors

V2 Connectors can be hosted within Fusion, or can run remotely.

Hosted connectors are part of the Fusion cluster. The same connector type on separate nodes can act as separate connectors.
Remote connectors become clients of Fusion and run a lightweight process and communicate to Fusion using an efficient messaging format.
Remote connectors can be located wherever the data is located, which might be required for performance or security and access.

gRPC framework

V2 connectors use Google’s gRPC framework as the underlying client/server technology. This offers:

Increased flexibility in the way services and their methods are defined.
HTTP/2 based transport.
Efficient serialization format for data handling (protocol buffers).
Bi-directional/multiplexed streams.
As of Fusion 5.6.1, V2 connectors using a gRPC backend can be run remotely.

Learn more

Add Tesseract Optical Character Recognition to Fusion Connectors

Tesseract Optical Character Recognition (OCR) solution

The Tesseract OCR is an open source solution that can be added to interact with Fusion connectors in releases 5.2 and later. The example in this topic represents a classic REST service that interfaces with V1 connectors including functions such as file upload and web crawl.

To set up OCR for V2 connectors, you must repeat this process for each individual Docker image related to the connector.

Prerequisites

The following must be established before adding the Tesseract OCR solution:

A local environment for installing and managing Fusion 5 that includes Google Cloud Tools and other required components.
The Docker daemon must be running on MacOS and a Docker account for hub.docker.com.
Fusion 5 installed and deployed.

Add Tesseract OCR solution

Execute the following to create a Docker file:
```
FROM lucidworks/classic-rest-service:5.2.1
USER root
RUN apt-get install -y tesseract-ocr
USER 8764
```
The file:
- Directs Kubernetes Helm to use an existing image with the <repo>/<image>:<tag> format as the basis for the new image.
- Switches to the root user to perform the Tesseract install.
- Switches back to user 8764 because the classic REST service pod in Kubernetes is not permitted to run as root.
Build the new Docker image in the same directory as your Dockerfile. Enter values that reflect your image and directory. For example: docker build -t jdoe/lucidworks/classic-rest-service-ocr:1.0.1
In Fusion 5, the dependency check in Fusion must be included in any custom operation. You must add the dependency image where the custom connector image is stored (at the same level and in the same repository). The sample commands are:
```
docker pull lucidworks/check-fusion-dependency:v1.2.0
docker tag lucidworks/check-fusion-dependency:v1.2.0 jdoe/check-fusion-dependency:v1.2.0
docker push jdoe/check-fusion-dependency:v1.2.0
```
Access the Docker hub to view the image-related information such as name, tag, digest, and operating system.

Open the fusion_values.yaml file and replace the existing connector image with the custom version. For example:

classic-rest-service:
   image:
   repository: jdoe
   name: classic-rest-service-ocr
   tag: 1.0.0
   nodeSelector:
       cloud.google.com/gke-nodepool: default-pool

Execute the standard process to upgrade (rebuild) the Fusion cluster.

Access the Tesseract pod using ssh and run tesseract -v to verify Tesseract is installed and working correctly. The result is similar to the following:

<<K9s-Shell>> Pod: jdoe-poc/jdoe-classic-rest-service-0 | Container: classic-rest-service
fusion@jdoe-poc-classic-rest-service-0:/$ tesseract -v
tesseract 4.0.0
leptonica-1.76.0
    libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

Access each Fusion parser used for a datasource that performs OCR and select the following items:
- Apache Tika
- Include images
Scan one of the following files to test the OCR function:
- A .pdf file, that may contain an underlying .tiff file
- A .jpeg file
- A .gif file
Verify the parser correctly extracts the information, which includes the body_t field.

Fusion Connectors

​V1 (Classic) connectors

​Install a connector using the Datasources dropdown

​Install or update a connector using the Blob Store UI

​Install or update a connector using the Connector API

​Reinstall a connector

​V2 (Plugin) connectors

​Java SDK configuration

​Plugin client

​Standalone Connector Plugin Application

​Locating the UberJar

​Starting the Host

​Java SDK security

​Fusion Connector Plugin Client

​Standalone Connector Plugin Application

​Locating the UberJar

​Starting the Host

​Simple Connector

​Fusion Connector Plugin Client

​Standalone Connector Plugin Application

​Locating the UberJar

​Starting the Host

​Remote V2 connectors

​gRPC framework

​Learn more

​Tesseract Optical Character Recognition (OCR) solution

​Prerequisites

​Add Tesseract OCR solution

V1 (Classic) connectors

Install a connector using the Datasources dropdown

Install or update a connector using the Blob Store UI

Install or update a connector using the Connector API

Reinstall a connector

V2 (Plugin) connectors

Java SDK configuration

Plugin client

Standalone Connector Plugin Application

Locating the UberJar

Starting the Host

Java SDK security

Fusion Connector Plugin Client

Standalone Connector Plugin Application

Locating the UberJar

Starting the Host

Simple Connector

Fusion Connector Plugin Client

Standalone Connector Plugin Application

Locating the UberJar

Starting the Host

Remote V2 connectors

gRPC framework

Learn more

Tesseract Optical Character Recognition (OCR) solution

Prerequisites

Add Tesseract OCR solution