Add Tesseract Optical Character Recognition to Fusion Connectors
Tesseract Optical Character Recognition (OCR) solution
The Tesseract OCR is an open source solution that can be added to interact with Fusion connectors in releases 5.2 and later. The example in this topic represents a classic REST service that interfaces with V1 connectors including functions such as file upload and web crawl.
To set up OCR for V2 connectors, you must repeat this process for each individual Docker image related to the connector. |
Prerequisites
The following must be established before adding the Tesseract OCR solution:
-
A local environment for installing and managing Fusion 5 that includes Google Cloud Tools and other required components.
-
The Docker daemon must be running on MacOS and a Docker account for hub.docker.com.
-
Fusion release 5.2 or later installed and deployed.
Add Tesseract OCR solution
-
Execute the following to create a Docker file:
FROM lucidworks/classic-rest-service:5.2.1 USER root RUN apt-get install -y tesseract-ocr USER 8764
The file:
-
Directs Kubernetes Helm to use an existing image with the
<repo>/<image>:<tag>
format as the basis for the new image. -
Switches to the
root
user to perform the Tesseract install. -
Switches back to user
8764
because the classic REST service pod in Kubernetes is not permitted to run asroot
.
-
-
Build the new Docker image in the same directory as your
Dockerfile
. Enter values that reflect your image and directory. For example:docker build -t jdoe/lucidworks/classic-rest-service-ocr:1.0.1
In Fusion 5.2 and later, the dependency check in Fusion must be included in any custom operation. You must add the dependency image where the custom connector image is stored (at the same level and in the same repository). The sample commands are:
docker pull lucidworks/check-fusion-dependency:v1.2.0 docker tag lucidworks/check-fusion-dependency:v1.2.0 jdoe/check-fusion-dependency:v1.2.0 docker push jdoe/check-fusion-dependency:v1.2.0
Access the Docker hub to view the image-related information such as name, tag, digest, and operating system.
-
Open the
fusion_values.yaml
file and replace the existing connector image with the custom version. For example:classic-rest-service: image: repository: jdoe name: classic-rest-service-ocr tag: 1.0.0 nodeSelector: cloud.google.com/gke-nodepool: default-pool
-
Execute the standard process to upgrade (rebuild) the Fusion cluster.
-
Access the Tesseract pod using ssh and run
tesseract -v
to verify Tesseract is installed and working correctly. The result is similar to the following:<<K9s-Shell>> Pod: jdoe-poc/jdoe-classic-rest-service-0 | Container: classic-rest-service fusion@jdoe-poc-classic-rest-service-0:/$ tesseract -v tesseract 4.0.0 leptonica-1.76.0 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE
-
Access each Fusion parser used for a datasource that performs OCR and select the following items:
-
Apache Tika
-
Include images
-
-
Scan one of the following files to test the OCR function:
-
A
.pdf
file, that may contain an underlying.tiff
file -
A
.jpeg
file -
A
.gif
file
-
-
Verify the parser correctly extracts the information, which includes the
body_t
field.