Web V2 - Lucidworks documentation

Latest version: v2.2.1
Compatible with Fusion version: 5.9.0 and later

The Web V2 connector for Fusion crawls and ingests content from websites over HTTP. It works best for crawling static content by retrieving data from a website starting at one or more seed URLs (the startLinks), following links up to a configurable depth, and indexing the fetched pages. Starting with Fusion 5.9.11, users of the Web V2 connector must upgrade to version 2.0.0 or later. Previous versions are incompatible due to changes introduced by the upgraded JDK in Fusion 5.9.11.

For Web V2 v2.1.0 and later, up to three Web V2 connectors can run simultaneously in a single cluster. This prevents reaching a max concurrency limit per Web V2 connector, which affects how much data can be sent to Selenium Grid at one time.

Fusion 5.6 and later uses the Open Graph Protocol as the default configuration for fields. Deviation from that standard configuration may exclude information from indexing during the crawl.

If crawls fail with a corrupted CrawlDB error, reinstall the connector.

Prerequisites

Perform these prerequisites to ensure the connector can reliably access, crawl, and index your data. Proper setup helps avoid configuration or permission errors, so use the following guidelines to keep your content available for discovery and search in Fusion. Confirm network and security configurations:

Egress must allow HTTP/2 traffic from your on-prem host into the Fusion gRPC endpoint.
You need a Fusion user account with the remote-connectors or admin role to authenticate the connector.
If the standalone host doesn’t trust Fusion’s TLS cert, point to your truststore file path.

Authentication

Setting up the correct authentication according to your organization’s data governance policies helps keep sensitive data secure while allowing authorized indexing. The Web V2 connector supports many crawl-time authentication methods, including OAuth token retrieval. You configure these under Crawl Authentication Properties in your datasource. The supported authentication schemes include the following:

Basic HTTP Authentication: Provide host, port, realm (if any), username, and password.
Digest HTTP Authentication: Same parameters as Basic, but using the Digest challenge/response scheme.
NTLM Authentication For Windows-style authentication: Provide domain, workstation, host, port, realm (if any), username, and password.
Form Authentication: Post to a login form URL with whatever name/value pairs the site expects, plus a TTL for the session.
SAML/Smart Form Authentication: For multi-step or SAML-backed form logins, submit a sequence of forms until you’re authenticated.
OAuth 2.0: Fetch a bearer token using client-credentials before crawling. You supply the token-endpoint URL and parameter names/values (for example, client_id, client_secret, scope, or grant_type) in the Access Token section of Crawl Authentication Properties.

Once you’ve picked your scheme, Fusion will maintain cookies and session state automatically for the duration of the crawl.

Authentication resources

Configure Web Site Authentication

The Web V1 connector retrieves data from a Web site using HTTP and starting from a specified URL.The Web connector supports Basic, Digest, Form, and NTLM authentication to websites.The credentials for a crawl are stored in a credentials file in https://FUSION_HOST:FUSION_PORT/data/connectors/container/lucid.web/datasourceName, where datasourceName is the name of the datasource. After you create a datasource, Fusion creates this directory for you. The file should be a JSON formatted file, ending with the .json file extension. When defining the datasource, use the name of the file in the Authentication credentials filename field in the UI (or for the f.credentialsFile property if using the REST API).All authentication types require the credentials file to include a property called type that defines the type of authentication to use. The other required properties vary depending on the type of authentication chosen.

Basic form-based Authentication

To use basic form-based authentication, use form for the type. The other properties are:

ttl. The “time to live” for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.
action. The action to take to log in. That is, the URL for the login form.
params. The parameters for the form, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "form",
            "ttl" : 300000,
            "action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
            "params" : {
                "os_username" : "username",
                "os_password" : "password"
            }
        }
  } ]

Complex Form-based Authentication

Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this is websites that use SAML to log in a user via a central single-signon authority. To configure fusion to log in to a website like this, use smartForm for the type. The other properties are:

ttl. The “time to live” for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.
loginUrl. The URL on which the first page that initializes the login chain is located
params. A list of parameters to use for the form logins, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl. Additionally we expect that once that login has happened, a new form is presented to the user which then posts back to where we came from. No data need to be entered in this form, which is why we include an empty { } in the params list.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "smartForm",
            "ttl" : 300000,
            "loginUrl" : "http://some.example.com/login",
            "params" : [{
                "os_username" : "username",
                "os_password" : "password"
            }, {

            } ]
        }
  } ]

To figure out what parameters you need to specify, turn off JavaScript in your browser and go through the login work flow. Though you normally see only a single login form on your screen, you might find many more forms you need to submit before you get logged in when JavaScript is not available to perform those form submissions automatically. Each form in that login chain needs to be represented in the list of params. If no user input is required, simply include an empty { }.

Basic and Digest Authentication

Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types, in the credentials file, for the type property use “basic” or “digest”. The other properties are:

host. The host of the site.
port. The port, if any.
userName. The username to use for authentication.
password. The password for the userName.
realm. The security realm for the site, if any.

Example basic auth configuration:

[ {
        "credential" : {
            "type" : "basic",
            "ttl" : 300000,
            "userName" : "usr",
            "password" : "pswd",
            "host":"hostname.exampledomain.com”
            "port": 443
        }
  }
]

NTLM Authentication

To use NTLM authentication, in the credentials file, for the type property, use ntlm. The other properties available are:

host. The host of the site.
port. The port, if any.
userName. The username to use for authentication.
password. The password for the userName.
realm. The security realm for the site, if any.
domain. The domain.
workstation. The workstation, as needed.

Example NTLM credential configuration:

[ {"credential" :
   { "type" : "ntlm",
     "ttl" : 300000,
     "port" : 80,
     "host" : "someHost",
     "domain" : "someDomain",
     "userName" : "someUser",
     "password" : "XXXXXXXX"
   }
} ]

Crawl a Web site protected by Kerberos

In Fusion 4.1 and later, the Web connector can crawl Web sites protected by Kerberos using SPNEGO. This is a way to access Web sites without requiring a user’s login credentials.

Kerberos support requires Fusion 5.9.5.

The Fusion Web connector can optionally use Kerberos with SAML/Smart Form authentication.To crawl Kerberos-protected Web sites, first create the necessary configuration files, then configure Fusion to use them.

Create standard Java configuration files to connect to Kerberos

Fusion uses the JDK standard JAAS Kerberos implementation, which is based on three system properties that reference three separate files.The files are as follows:

On the Kerberos-protected server, a keytab file, named kerberuser.keytab in our examples.
On the Fusion system, a configuration file named login.conf.
On the Fusion system, an initialization file named krb5.ini.

Create a Kerberos keytab

Create and validate the keytab file for the Kerberos client principal you want to use to authenticate to the website.If you do not specify the kerberosPrincipalName and kerberosKeytabFilePath or kerberosKeytabBase64 when creating the Fusion datasource, Fusion uses the default login principal and ticket cache. You can see the default values by logging into the Fusion server as the user who runs Fusion and running klist.If you do not want to use the default account and credentials, specify these configuration properties when creating a keytab as well as in the Web datasource setup. Use the Kerberos user principal name (UPN), not the service principal name (SPN, which is used with the Kerberos security realm). In some cases the UPN can be a service.In our examples, the Fusion Web crawler authenticates to the Web sites using the user [email protected]. We create a keytab file kerbuser.keytab for the user principal [email protected].

Create a Kerberos keytab on Windows

Example:

ktpass -out kerbuser.keytab -princ [email protected] -mapUser kerbuser -mapOp set -pass YOUR_PASSWORD -crypto AES256-SHA1 -pType KRB5_NT_PRINCIPAL

The following weak encryption types are not supported by Fusion:

DES-CBC-CRC * DES_CBC_MD5
Microsoft Windows 2000 RC4-HMAC

Create a Kerberos keytab on Ubuntu Linux

Prerequisite: Install the krb5-user package: sudo apt-get install krb5-userExample:

ktutil
addent -password -p HTTP/[email protected] -k 1 -e aes128-cts-hmac-sha1-96
- it will ask you for password of kerbuser -
wkt kerbuser.keytab
q

Test the keytab

Once you create a keytab, verify that it works.Prerequisite: You need a version of curl installed that allows SPNEGO. To test whether your version of curl does this, run curl --version and make sure SPNEGO is in the output.Run the following curl command (replace the keytab path and site):

export KRB5CCNAME=FILE:/path/to/kerbuser.keytab curl -vvv --negotiate -u : http://your-site.com

Note that the first request is a 401 status code for the negotiate request followed by a second request, which is a status of 200.On the Fusion server, create login.conf and krb5.ini files as follows.

Create a login.conf on Windows

For hosted connectors, Selenium Grid support is available through Kubernetes. For remote connectors, Selenium Grid support is available through Docker Compose. See the Web V2 remote support repository for full setup instructions and YAML files.In this example, the keytab is stored at C:\\kerb\\kerbuser.keytab

KrbLogin {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="C:\\kerb\\kerbuser.keytab"
  useTicketCache=true
  principal="[email protected]"
  debug=true;
};

Create a login.conf on Linux

In this example, the keytab is stored at /home/lucidworks/kb.keytab

com.sun.security.jgss.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="/home/lucidworks/kerbuser.keytab"
  useTicketCache=true
  principal="[email protected]"
  debug=true;
};

The format of the login.conf is described on the Oracle Web site.

Create a krb5.ini

When you install krb5 on Linux, you can find a Kerberos configuration file in /etc/krb5.conf. You can optionally create a custom one instead.Creating a krb5.conf is the same for Linux and Windows. On Windows the file is krb5.ini.In this example the domain is WIN.LAB.LUCIDWORKS.COM, the Kerberos kdc host is my.kdc-dns.com, and the Kerberos admin server is my-admin-server-dns.com.Example:

    default_realm = WIN.LAB.LUCIDWORKS.COM
    default_tkt_enctypes = aes128-cts-hmac-sha1-96
    default_tgs_enctypes = aes128-cts-hmac-sha1-96
    permitted_enctypes = aes128-cts-hmac-sha1-96
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    udp_preference_limit = 1

WIN.LAB.LUCIDWORKS.COM = {
   kdc = my.kdc-dns.com
   admin_server = my.admin-server-dns.com
}

**WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM**

WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM

The format of the krb5.ini file is described in the MIT Kerberos documentation. You can change the encryption algorithms by changing the properties default_tkt_enctypes, default_tgs_enctypes, and permitted_enctypes as needed. For example:

default_tkt_enctypes = RC4-HMAC
default_tgs_enctypes = RC4-HMAC
Permitted_enctypes = RC4-HMAC

Configure Fusion to use Kerberos

Once you have the keytab, login.conf, and krb5.ini files, configure Fusion to use Kerberos. You must set a property in a Fusion configuration file in addition to defining the datasource in the Fusion UI.At the command line on every machine in your Fusion cluster:

In $FUSION_HOME/conf/fusion.cors (fusion.properties in Fusion 4.x), add the following property to the connectors-classic.jvmOptions setting: -Djavax.security.auth.useSubjectCredsOnly=false
Restart the connectors-classic service using ./bin/connectors-classic restart on Linux or bin\connectors-classic.cmd restart on Windows.

In the Fusion UI:

Click Indexing > Datasources.
Click Add+, then Web.
Enter a datasource ID and a start link.
Click Crawl authorization.
At the bottom of the section, check Enable SPNEGO/Kerberos Authentication.
You can either use the default principal name or specify a principal name to use.
- If you do not specify the principal name, then Fusion uses the default login principal and ticket cache. You can see those default values by logging into the Fusion server as the user who runs Fusion and running klist.
If you specify a principal name, you must provide a keytab, either in Base64 or as a file path.
- If you specify a keytab file path, the file must be on the machine running the Fusion connector, for each connector’s node in the cluster.
- The Base64 option lets you supply the keytab in one place, in the UI.
Fill in any remaining options to configure the datasource.
Click Save.

Troubleshoot Kerberos authentication

Problem: When trying to crawl a Kerberos-authenticated Web site, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
    at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=HTTP/[email protected], debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
    at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:83) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more
Caused by: javax.security.auth.login.LoginException: No key to store
    at com.sun.security.auth.module.Krb5LoginModule.commit(Krb5LoginModule.java:1119) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.login(LoginContext.java:588) ~[?:1.8.0_161]
    at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:76) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more

Resolution:First test your keytab as described in test the keytab above.If your keytab passes validation, another possibility is that the /tmp/krb* cache file got corrupted or is not compatible after you went through other troubleshooting steps. To rule that out, remove the /tmp/krb* cache file on all hosts, restart your connectors-classic, and try the crawl again. That is, on each host:

rm -f /tmp/krb*
$FUSION_HOME/bin/connectors-classic restart

401 error

Problem: Crawling using the Web connector with Kerberos results in a 401 error, but curl with Kerberos works fine.Resolution:Make sure you have this system property set in connectors-classic jvmOptions on all nodes:

-Djavax.security.auth.useSubjectCredsOnly=false

You must restart connectors-classic after making that change.If that doesn’t work, make sure the user you are authenticating with from Curl matches the user you are trying to authenticate with from the Web connector. To see your Kerberos principal user name, run klist.

Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”

Problem: When crawling using the Web connector with Kerberos enabled, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
	at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {[email protected], debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
	at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:83) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: javax.security.auth.login.LoginException: Pre-authentication information was invalid (24)
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.KrbException: Pre-authentication information was invalid (24)
	at sun.security.krb5.KrbAsRep.&lt;init>(KrbAsRep.java:76) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.Asn1Exception: Identifier does not match expected value (906)
	at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.init(ASRep.java:64) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.&lt;init>(ASRep.java:59) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsRep.&lt;init>(KrbAsRep.java:60) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.&lt;init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more

Resolution:Your keytab’s principal name doesn’t match the value on the ticket server. Check the principal name for your user.

Web V2 connector OAuth access token configuration

The Web V2 connector v1.4.0 added the ability to authenticate using OAuth.

Example with PingFederate

This example uses Client Credentials Grant Type and a basic configuration.

Create new app in PingIdentity.
Edit the app configuration:
1. Check Client Credentials Grant Type.
2. Set Token Endpoint Authentication Method to Client Secret Post.
Enable the app by clicking on the toggle button to the right of the app name.
Go to the Configuration tab and click Get Access Token. At this point it will probably show an error about scope. Fix it by adding a scope, such as testScope.
Click the Get Access Token button again under the Configuration tab. It should generate and display your access token.
Go to the Configuration tab and click URLs. Copy the Token Endpoint URL for further use.

You can view a sample of Client Credentials Grant Type access token request in the PingIdentity documentation. In this example the access token request is:

$ curl -X POST "https://auth.pingone.com/ENV_ID/as/token"
-d "grant_type=client_credentials&client_id=CLIENT_ID&client_secret=CLIENT_SECRET&scope=testScope" --header "Content-Type:application/x-www-form-urlencoded"
{
  "access_token" : "ACCESS_TOKEN",
  "token_type" : "Bearer",
  "expires_in" : 3600,
  "scope" : "testScope"
}

In Fusion, go to CRAWL AUTHENTICATION PROPERTIES in the datasource configuration and fill in the ACCESS TOKEN section by adding the URL and parameters of the request. Replace CLIENT_ID with your own client ID.
- ACCESS TOKEN REQUEST URL: https://auth.pingone.com/ENV_ID/as/token
- PARAMETER NAME | PARAMETER VALUE: grant_type | client_credentials
- PARAMETER NAME | PARAMETER VALUE: client_id | CLIENT_ID
- PARAMETER NAME | PARAMETER VALUE: scope | testScope

Example with Microsoft Azure AD

In Microsoft Entra ID, go to App Registrations and register a new app.
Go to your app and click Certificates & secrets and add a client secret.
Copy the client secret Value for further use.

View the Microsoft documentation for a request to get access token. In this example, we use the client credentials grant type:

$ curl -X POST "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token"
-d "client_id=CLIENT_ID&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default&client_secret=CLIENT_SECRET&grant_type=client_credentials"
-H "Content-Type: application/x-www-form-urlencoded"
{"token_type":"Bearer","expires_in":3599,"ext_expires_in":3599,"access_token":"ACCESS_TOKEN"}

Copy the endpoint URL and the parameters to the CRAWL AUTHENTICATION PROPERTIES > ACCESS TOKEN section in the datasource configuration in Fusion. Replace CLIENT_ID with your own client ID.
- ACCESS TOKEN REQUEST URL: https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token
- PARAMETER NAME | PARAMETER VALUE: grant_type | client_credentials
- PARAMETER NAME | PARAMETER VALUE: client_id | CLIENT_ID
- PARAMETER NAME | PARAMETER VALUE: scope | https://graph.microsoft.com/.default

Remote connectors

V2 connectors support running remotely in Fusion versions 5.7.1 and later.

Configure remote V2 connectors

If you need to index data from behind a firewall, you can configure a V2 connector to run remotely on-premises using TLS-enabled gRPC.

Prerequisites

Before you can set up an on-prem V2 connector, you must configure the egress from your network to allow HTTP/2 communication into the Fusion cloud. You can use a forward proxy server to act as an intermediary between the connector and Fusion.The following is required to run V2 connectors remotely:

The plugin zip file and the connector-plugin-standalone JAR.
A configured connector backend gRPC endpoint.
Username and password of a user with a remote-connectors or admin role.
If the host where the remote connector is running is not configured to trust the server’s TLS certificate, you must configure the file path of the trust certificate collection.

If your version of Fusion doesn’t have the remote-connectors role by default, you can create one. No API or UI permissions are required for the role.

Connector compatibility

Only V2 connectors are able to run remotely on-premises. You also need the remote connector client JAR file that matches your Fusion version. You can download the latest files at V2 Connectors Downloads.

Whenever you upgrade Fusion, you must also update your remote connectors to match the new version of Fusion.

The gRPC connector backend is not supported in Fusion environments deployed on AWS.

System requirements

The following is required for the on-prem host of the remote connector:

(Fusion 5.9.0-5.9.10) JVM version 11
(Fusion 5.9.11) JVM version 17
Minimum of 2 CPUs
4GB Memory

Note that memory requirements depend on the number and size of ingested documents.

Enable backend ingress

In your values.yaml file, configure this section as needed:

ingress:
  enabled: false
  pathtype: "Prefix"
  path: "/"
  #host: "ingress.example.com"
  ingressClassName: "nginx"   # Fusion 5.9.6 only
  tls:
    enabled: false
    certificateArn: ""
    # Enable the annotations field to override the default annotations
    #annotations: ""

Set enabled to true to enable the backend ingress.
Set pathtype to Prefix or Exact.
Set path to the path where the backend will be available.
Set host to the host where the backend will be available.
In Fusion 5.9.6 only, you can set ingressClassName to one of the following:
- nginx for Nginx Ingress Controller
- alb for AWS Application Load Balancer (ALB)
Configure TLS and certificates according to your CA’s procedures and policies.
TLS must be enabled in order to use AWS ALB for ingress.

Connector configuration example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443 # mandatory
  plain-text: false # optional, false by default.
  proxy-server: # optional - needed when a forward proxy server is used to provide outbound access to the standalone connector
    host: host
    port: some-port
    user: user # optional
    password: password # optional
  trust: # optional - needed when the client's system doesn't trust the server's certificate
    cert-collection-filepath: path1

proxy: # mandatory fusion-proxy
  user: admin
  password: password123
  url: https://fusiontest.com/ # needed only when the connector plugin requires blob store access

plugin: # mandatory
  path: ./fs.zip
  type: #optional - the suffix is added to the connector id
    suffix: remote

Minimal example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443

proxy:
  user: admin
  password: "password123"

plugin:
  path: ./testplugin.zip

Logback XML configuration file example

<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOGDIR:-.}/connector.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <!-- rollover daily -->
            <fileNamePattern>${LOGDIR:-.}/connector-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>50MB</maxFileSize>
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</configuration>

Run the remote connector

java [-Dlogging.config=[LOGBACK_XML_FILE]] \
  -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

The logging.config property is optional. If not set, logging messages are sent to the console.

Test communication

You can run the connector in communication testing mode. This mode tests the communication with the backend without running the plugin, reports the result, and exits.

java -Dstandalone.connector.connectivity.test=true -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

Encryption

In a deployment, communication to the connector’s backend server is encrypted using TLS. You should only run this configuration without TLS in a testing scenario. To disable TLS, set plain-text to true.

Egress and proxy server configuration

One of the methods you can use to allow outbound communication from behind a firewall is a proxy server. You can configure a proxy server to allow certain communication traffic while blocking unauthorized communication. If you use a proxy server at the site where the connector is running, you must configure the following properties:

Host. The hosts where the proxy server is running.
Port. The port the proxy server is listening to for communication requests.
Credentials. Optional proxy server user and password.

When you configure egress, it is important to disable any connection or activity timeouts because the connector uses long running gRPC calls.

Password encryption

If you use a login name and password in your configuration, run the following utility to encrypt the password:

Enter a user name and password in the connector configuration YAML.

Run the standalone JAR with this property:

-Dstandalone.connector.encrypt.password=true

Retrieve the encrypted passwords from the log that is created.
Replace the clear password in the configuration YAML with the encrypted password.

Connector restart (5.7 and earlier)

The connector will shut down automatically whenever the connection to the server is disrupted, to prevent it from getting into a bad state. Communication disruption can happen, for example, when the server running in the connectors-backend pod shuts down and is replaced by a new pod. Once the connector shuts down, connector configuration and job execution are disabled. To prevent that from happening, you should restart the connector as soon as possible.You can use Linux scripts and utilities to restart the connector automatically, such as Monit.

Recoverable bridge (5.8 and later)

If communication to the remote connector is disrupted, the connector will try to recover communication and gRPC calls. By default, six attempts will be made to recover each gRPC call. The number of attempts can be configured with the max-grpc-retries bridge parameters.

Job expiration duration (5.9.5 only)

The timeout value for irresponsive backend jobs can be configured with the job-expiration-duration-seconds parameter. The default value is 120 seconds.

Use the remote connector

Once the connector is running, it is available in the Datasources dropdown. If the standalone connector terminates, it disappears from the list of available connectors. Once it is re-run, it is available again and configured connector instances will not get lost.

Enable asynchronous parsing (5.9 and later)

To separate document crawling from document parsing, enable Tika Asynchronous Parsing on remote V2 connectors.

Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:

additionalVolumes:
- name: fusion-data1-pvc
    persistentVolumeClaim:
    claimName: fusion-data1-pvc
- name: fusion-data2-pvc
    persistentVolumeClaim:
    claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
    mountPath: "/connector/data1"
- name: fusion-data2-pvc
    mountPath: "/connector/data2"

You may also need to specify the user that is authorized to access the file system, as in this example:

securityContext:
    fsGroup: 1002100000
    runAsUser: 1002100000

Selenium Grid setup

If you are using Web V2 v2.1.0 or later, you must also use Selenium Grid as part of your Web V2 connector setup.

When using Web V2 v2.2.1 or later on Fusion 5.9.15 or later, you do not need to perform any Selenium setup. Selenium is installed automatically with the connector and removed automatically when the connector is removed.Manual installation of Selenium is still required in Fusion 5.9.14 or earlier, even if you use Web V2 v2.2.1.

For hosted connectors, Selenium Grid support is available through Kubernetes. For remote connectors, Selenium Grid support is avialable through Docker Compose. See the Web V2 remote support repository for full setup instructions and YAML files. Before you set up Selenium Grid, install the connector standalone plugin file for the version of Fusion that you are using and the most recent version of the Web V2 connector. Verify that you have the correct files at the Lucidworks plugins site.

The Selenium services require an x86 architecture to run properly. Running the Selenium services on an ARM-based system such as Apple Silicon is not supported.

Set up Selenium Grid in Kubernetes

If you are using a hosted connector, use Kubernetes to set up Selenium Grid. These steps explain how to deploy the Selenium Hub component and the two Chrome browser nodes that connect to Selenium Hub. The referenced YAML files are available in the k8s directory of the Web V2 remote support repository. To set up Selenium Grid:

In a terminal, apply the Kubernetes YAML configurations:

kubectl apply -f deployment.yaml -n NAMESPACE
kubectl apply -f chrome-deployment.yaml -n NAMESPACE
kubectl apply -f service.yaml -n NAMESPACE

Verify that the deployments are successful. Replace NAMESPACE with your namespace.
```
kubectl get pods -n NAMESPACE
kubectl get services -n NAMESPACE
```
Adjust the network policy to allow port 4444. Enter the following command in a terminal:
```
kubectl edit networkpolicy NAMESPACE-connector-plugin -n NAMESPACE
```

Add the following snippet to the file:

- ports:
  - port: 4444
    protocol: TCP
  - port: 4444
    protocol: UDP

Save the file.

Set up Selenium Grid in Docker Compose

If you are using a remote connector, use Docker Compose to set up Selenium Grid. The referenced YAML files are available in the Web V2 remote support repository. Before setting up Selenium Grid in Docker Compose, you must know what version of JDK your Fusion connectors are using. If you are using Fusion 5.9.10 or earlier, you are using JDK 11. If you are using Fusion 5.9.11 or later, you are using JDK 17. To set up Selenium Grid in Docker Compose:

Visit the the Web V2 remote support repository and select the folder corresponding to your JDK version. Download the contents of that folder.

Edit the bin/conf/connector-config.yaml file to configure the Kafka bridge settings, the proxy settings, and the plugin path. The following snippet shows an example configuration. Quotation marks are required around the password.

kafka-bridge:
  target: EXAMPLE_CONNECTORS_BACKEND.example.com:443
  # Uncomment proxy-server section if needed
proxy:
  user: EXAMPLE_USERNAME
  password: "EXAMPLE_PASSWORD"
  url: https://FUSION_HOST:FUSION_PORT/
plugin:
  path: /app/connector-plugin.zip
  type:
    suffix: remote-

Save the configuration file.

Now you can start the Docker Compose environment, which uses standard Docker Compose commands. You can start the environment in background mode or with live logs. To start Docker Compose in background mode, navigate to the directory for your environment and enter the following command in a terminal:

docker-compose up -d

To start Docker Compose in live mode, navigate to the directory for your environment and enter the following command in a terminal:

docker-compose up

When you’ve started Docker Compose, verify that the services are running. You can access the Selenium Grid console at http://localhost:4444/ui. Verify that the Selenium Hub is running and the Chrome nodes are connected. The Lucidworks connector is available on port 8764. Run docker-compose logs lucidworks-connector in a terminal to verify that the service is up. To check the container status, run docker-compose ps in a terminal and verify that all containers are up. Press Ctrl-C to stop the services when viewing logs in real-time. To stop all services, run docker-compose down in a terminal. If you want to remove all volumes when stopping all services, run docker-compose down -v.

Control the web connector’s crawl depth

The Max Crawling Depth setting controls how many levels the Web V2 connector follows from your starting URL. You can think of these crawl levels as link hops or navigation levels from a starting URL. Setting the crawl depth ensures you are crawling the appropriate pages for your search experience. As you increase the Max Crawling Depth value, the Web V2 connector may crawl exponentially more pages. For example, if the start link contains links to ten pages, and each of those pages links to ten different pages, then a crawl depth of 2 can grow to 100 pages. As the number of pages crawled increases, the time to complete the crawl also increases. Deeper crawl depths consume more memory, use more HTTP requests and bandwidth, and may cause your website to limit the number of pages the web connector can crawl or time out before completing the crawl. Without strict limits, it’s possible for the Web V2 connector to crawl extremely large websites indefinitely if the website contains dynamic content or circular links.

Use a start link

A start link can be a page of your website or a webpage with a list of HTML links. The Web V2 connector starts the crawl from this page and crawls the page to the specified depth. You may specify multiple start URLs in the Bulk Start Links field. A simple example is starting a crawl from a company’s home page with the following layout:

Company Home Page (Start page - Depth 0)
├── About Us (Depth 1)
│   ├── Our History (Depth 2)
│   └── Leadership Team (Depth 2)
├── Products (Depth 1)
│   ├── Product A (Depth 2)
│   │   └── Product A Details (Depth 3)
│   └── Product B (Depth 2)
└── Contact (Depth 1)

If you set the Max Crawling Depth field to a different value, the Web V2 connector crawls a different set of pages. For example:

Depth	Crawled pages	Example crawl results
0	Only start URLs	Just the home page
1	Start URLs and their direct links	Home page + About, Products, Contact
2	Previous + links from depth 1 pages	Depth 1 links plus History, Leadership, Product A, Product B
3	Previous + links from depth 2 pages	Depth 2 links plus Product A Details
-1	No limit	Everything the crawler can find

If you set the max crawling depth to 0, the web connector crawls only your start page if you are using a start URL. This setting is recommended for testing basic connectivity or troubleshooting. A max crawling depth of 2 or 3 is sufficient for most business needs, especially when combined with URL filters and page limits. Very large websites may need a max crawling depth of 4. If you need comprehensive coverage, set the max crawling depth to -1.

If you set the max crawling depth to -1, set strict document limits and URL filters so your crawls complete successfully.

Use a sitemap file

The Web V2 connector supports XML sitemap files. These files must be in XML format and are typically named sitemap.xml or similar. To use a sitemap instead of a start link, select Link Discovery, enter a sitemap URL in the Sitemap URLs field, and click Add. You may add multiple sitemap URLs. Unlike using a start link, setting the max crawling depth to 0 when using a sitemap crawls the URLs listed in the sitemap. Setting the max crawling depth to 1 crawls the URLs listed in the sitemap and the links included in those pages.

Configure the timeout

The connector timeout can be configured as infinite to handle large files. Configure the timeout by enabling Crawl performance properties and setting the Connection timeout (ms).

The Connection timeout (ms) option becomes visible only after selecting the Advanced checkbox.

Use the following guidelines when setting the timeout:

To enable an infinite timeout for large file downloads, set the timeout configuration to 0.
For standard timeout behavior, use any positive value. For example, a timeout of 30000 gives a 30-second timeout with clean connection termination.

Enable Javascript Evaluation in Fusion

JavaScript evaluation allows the Web V2 connector to extract content from a website that is only available after JavaScript has rendered the document. It is available for Web V2 v2.1.0 and later on hosted and remote connectors. To enable JavaScript evaluation in the Web V2 connector:

Navigate to your Web V2 datasource in Fusion.
Select Javascript Evaluation Properties. A variety of settings displays. In this section you can customize your JavaScript evaluation settings.
Select Evaluate JavaScript. This is required for using JavaScript evaluation.
If you specified a SmartForms or SAML element in the Crawl Authentication Properties area, select Evaluate JavaScript during SmartForms/SAML Login.
Headless browser mode is selected by default, which runs the browser performing the website crawl in the background without being visible. If your website renders pages on the server side, the Headless browser field must be unchecked for the crawl to work correctly and retrieve links. If your website renders pages on the client side, the Headless browser field should be checked.
Click Apply.

For the full JavaScript evaluation settings, see javascriptEvaluationConfig under Properties in the configuration specifications.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Fusion Connectors

​Prerequisites

​Authentication

​Authentication resources

​Basic form-based Authentication

​Complex Form-based Authentication

​Basic and Digest Authentication

​NTLM Authentication

​Crawl a Web site protected by Kerberos

​Create standard Java configuration files to connect to Kerberos

​Create a Kerberos keytab

Create a Kerberos keytab on Windows

Create a Kerberos keytab on Ubuntu Linux

Test the keytab

​Create a login.conf and krb5.ini

Create a login.conf on Windows

Create a login.conf on Linux

Create a krb5.ini

​Configure Fusion to use Kerberos

​Troubleshoot Kerberos authentication

​javax.security.auth.login.LoginException: No key to store

​401 error

​Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”

​Example with PingFederate

​Example with Microsoft Azure AD

​Remote connectors

​Prerequisites

​Connector compatibility

​System requirements

​Enable backend ingress

​Connector configuration example

​Minimal example

​Logback XML configuration file example

​Run the remote connector

​Test communication

​Encryption

​Egress and proxy server configuration

​Password encryption

​Connector restart (5.7 and earlier)

​Recoverable bridge (5.8 and later)

​Job expiration duration (5.9.5 only)

​Use the remote connector

​Enable asynchronous parsing (5.9 and later)

​Selenium Grid setup

​Set up Selenium Grid in Kubernetes

​Set up Selenium Grid in Docker Compose

​Control the web connector’s crawl depth

​Use a start link

​Use a sitemap file

​Configure the timeout

​Enable Javascript Evaluation in Fusion

​Configuration

Prerequisites

Authentication

Authentication resources

Basic form-based Authentication

Complex Form-based Authentication

Basic and Digest Authentication

NTLM Authentication

Crawl a Web site protected by Kerberos

Create standard Java configuration files to connect to Kerberos

Create a Kerberos keytab

Create a login.conf and krb5.ini

Configure Fusion to use Kerberos

Troubleshoot Kerberos authentication

javax.security.auth.login.LoginException: No key to store

401 error

Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”

Example with PingFederate

Example with Microsoft Azure AD

Remote connectors

Prerequisites

Connector compatibility

System requirements

Enable backend ingress

Connector configuration example

Minimal example

Logback XML configuration file example

Run the remote connector

Test communication

Encryption

Egress and proxy server configuration

Password encryption

Connector restart (5.7 and earlier)

Recoverable bridge (5.8 and later)

Job expiration duration (5.9.5 only)

Use the remote connector

Enable asynchronous parsing (5.9 and later)

Selenium Grid setup

Set up Selenium Grid in Kubernetes

Set up Selenium Grid in Docker Compose

Control the web connector’s crawl depth

Use a start link

Use a sitemap file

Configure the timeout

Enable Javascript Evaluation in Fusion

Configuration