Web V2Connector Configuration Reference
The Web V2 connector retrieves data from a Web site using HTTP and starting from a specified URL.
Starting with Fusion 5.9.11, users of the Web V2 connector must upgrade to version 2.0.0 or later. Previous versions (e.g., 1.4.0) are incompatible due to changes introduced by the upgraded JDK in Fusion 5.9.11.
For Web V2 v2.1.0 and later, up to three Web V2 connectors can run simultaneously in a single cluster. This prevents reaching a max concurrency limit per Web V2 connector, which affects how much data can be sent to Selenium Grid at one time. |
Fusion 5.6 and later uses the Open Graph Protocol as the default configuration for fields. Deviation from that standard configuration may exclude information from indexing during the crawl.
If crawls fail with a corrupted CrawlDB error, reinstall the connector. |
Remote connectors
V2 connectors support running remotely in Fusion versions 5.7.1 and later. Refer to Configure Remote V2 Connectors.
Below is an example configuration showing how to specify the file system to index under the connector-plugins
entry in your values.yaml
file:
additionalVolumes:
- name: fusion-data1-pvc
persistentVolumeClaim:
claimName: fusion-data1-pvc
- name: fusion-data2-pvc
persistentVolumeClaim:
claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
mountPath: "/connector/data1"
- name: fusion-data2-pvc
mountPath: "/connector/data2"
You may also need to specify the user that is authorized to access the file system, as in this example:
securityContext:
fsGroup: 1002100000
runAsUser: 1002100000
Selenium Grid setup
If you are using Web V2 2.1.0 or later, you must also use Selenium Grid as part of your Web V2 connector setup.
For hosted connectors, Selenium Grid support is available through Kubernetes. For remote connectors, Selenium Grid support is avialable through Docker Compose. See the Web V2 remote support repository for full setup instructions and YAML files.
Before you set up Selenium Grid, install the connector standalone plugin file for the version of Fusion that you are using and the most recent version of the Web V2 connector. Verify that you have the correct files at the Lucidworks plugins site.
The Selenium services require an x86 architecture to run properly. Running the Selenium services on an ARM-based system such as Apple Silicon is not supported. |
Set up Selenium Grid in Kubernetes
If you are using a hosted connector, use Kubernetes to set up Selenium Grid. These steps explain how to deploy the Selenium Hub component and the two Chrome browser nodes that connect to Selenium Hub. The referenced YAML files are available in the k8s
directory of the Web V2 remote support repository.
To set up Selenium Grid:
-
In a terminal, apply the Kubernetes YAML configurations:
kubectl apply -f deployment.yaml -n NAMESPACE kubectl apply -f chrome-deployment.yaml -n NAMESPACE kubectl apply -f service.yaml -n NAMESPACE
-
Verify that the deployments are successful. Replace
NAMESPACE
with your namespace.kubectl get pods -n NAMESPACE kubectl get services -n NAMESPACE
-
Adjust the network policy to allow port 4444. Enter the following command in a terminal:
kubectl edit networkpolicy NAMESPACE-connector-plugin -n NAMESPACE
-
Add the following snippet to the file:
- ports: - port: 4444 protocol: TCP - port: 4444 protocol: UDP
-
Save the file.
Set up Selenium Grid in Docker Compose
If you are using a remote connector, use Docker Compose to set up Selenium Grid.
The referenced YAML files are available in the Web V2 remote support repository.
Before setting up Selenium Grid in Docker Compose, you must know what version of JDK your Fusion connectors are using. If you are using Fusion 5.9.10 or earlier, you are using JDK 11. If you are using Fusion 5.9.11 or later, you are using JDK 17.
To set up Selenium Grid in Docker Compose:
-
Visit the the Web V2 remote support repository and select the folder corresponding to your JDK version. Download the contents of that folder.
-
Edit the
bin/conf/connector-config.yaml
file to configure the Kafka bridge settings, the proxy settings, and the plugin path. The following snippet shows an example configuration. Quotation marks are required around the password.kafka-bridge: target: EXAMPLE_CONNECTORS_BACKEND.example.com:443 # Uncomment proxy-server section if needed proxy: user: EXAMPLE_USERNAME password: "EXAMPLE_PASSWORD" url: https://FUSION_HOST:FUSION_PORT/ plugin: path: /app/connector-plugin.zip type: suffix: remote-
-
Save the configuration file.
Now you can start the Docker Compose environment, which uses standard Docker Compose commands. You can start the environment in background mode or with live logs. To start Docker Compose in background mode, navigate to the directory for your environment and enter the following command in a terminal:
docker-compose up -d
To start Docker Compose in live mode, navigate to the directory for your environment and enter the following command in a terminal:
docker-compose up
When you’ve started Docker Compose, verify that the services are running. You can access the Selenium Grid console at http://localhost:4444/ui
. Verify that the Selenium Hub is running and the Chrome nodes are connected.
The Lucidworks connector is available on port 8764. Run docker-compose logs lucidworks-connector
in a terminal to verify that the service is up.
To check the container status, run docker-compose ps
in a terminal and verify that all containers are up.
Press Ctrl-C
to stop the services when viewing logs in real-time.
To stop all services, run docker-compose down
in a terminal. If you want to remove all volumes when stopping all services, run docker-compose down -v
.
Enable Javascript Evaluation in Fusion
JavaScript evaluation allows the Web V2 connector to extract content from a website that is only available after JavaScript has rendered the document. It is available for Web V2 v2.1.0 and later on hosted and remote connectors. To enable JavaScript evaluation in the Web V2 connector:
-
Navigate to your Web V2 datasource in Fusion.
-
Select Javascript Evaluation Properties. A variety of settings displays. In this section you can customize your JavaScript evaluation settings.
-
Select Evaluate JavaScript. This is required for using JavaScript evaluation.
-
If you specified a SmartForms or SAML element in the Crawl Authentication Properties area, select Evaluate JavaScript during SmartForms/SAML Login.
-
Headless browser mode is selected by default, which runs the browser performing the website crawl in the background without being visible. If your website renders pages on the server side, the Headless browser field must be unchecked for the crawl to work correctly and retrieve links. If your website renders pages on the client side, the Headless browser field should be checked.
-
Click Apply.
For the full JavaScript evaluation settings, see javascriptEvaluationConfig
under Properties
in the configuration specifications.
Resources
For help with authentication, see:
Configuration
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|