Configure Web Site Authentication

The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.

The Web connector supports Basic, Digest, Form, and NTLM authentication to websites.

The credentials for a crawl are stored in a credentials file in VAR-FUSIONPATH/data/connectors/container/lucid.web/datasourceName, where datasourceName is the name of the datasource. After you create a datasource, Fusion creates this directory for you. The file should be a JSON formatted file, ending with the .json file extension. When defining the datasource, use the name of the file in the Authentication credentials filename field in the UI (or for the f.credentialsFile property if using the REST API).

All authentication types require the credentials file to include a property called type that defines the type of authentication to use. The other required properties vary depending on the type of authentication chosen.

Basic form-based Authentication

To use basic form-based authentication, use form for the type. The other properties are:

  • ttl - The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.

  • action - The action to take to log in. That is, the URL for the login form.

  • params - The parameters for the form, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "form",
            "ttl" : 300000,
            "action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
            "params" : {
                "os_username" : "username",
                "os_password" : "password"
            }
        }
  } ]

Complex Form-based Authentication

Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this is websites that use SAML to log in a user via a central single-signon authority. To configure fusion to log in to a website like this, use smartForm for the type. The other properties are:

  • ttl - The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.

  • loginUrl - The URL on which the first page that initializes the login chain is located

  • params - A list of parameters to use for the form logins, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl. Additionally we expect that once that login has happened, a new form is presented to the user which then posts back to where we came from. No data need to be entered in this form, which is why we include an empty { } in the params list.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "smartForm",
            "ttl" : 300000,
            "loginUrl" : "http://some.example.com/login",
            "params" : [{
                "os_username" : "username",
                "os_password" : "password"
            }, {

            } ]
        }
  } ]

To figure out what parameters you need to specify, turn off JavaScript in your browser and go through the login work flow. Though you normally see only a single login form on your screen, you might find many more forms you need to submit before you get logged in when JavaScript is not available to perform those form submissions automatically. Each form in that login chain needs to be represented in the list of params. If no user input is required, simply include an empty { }.

Basic and Digest Authentication

Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types, in the credentials file, for the type property use "basic" or "digest". The other properties are:

  • host - The host of the site.

  • port - The port, if any.

  • userName - The username to use for authentication.

  • password - The password for the userName.

  • realm - The security realm for the site, if any.

Example basic auth configuration:

[ {
        "credential" : {
            "type" : "basic",
            "ttl" : 300000,
            "userName" : "usr",
            "password" : "pswd",
            "host":"hostname.exampledomain.com”
            "port": 443
        }
  }
]

NTLM Authentication

To use NTLM authentication, in the credentials file, for the type property, use ntlm. The other properties available are:

  • host - The host of the site.

  • port - The port, if any.

  • userName - The username to use for authentication.

  • password - The password for the userName.

  • realm - The security realm for the site, if any.

  • domain - The domain.

  • workstation - The workstation, as needed.

Example NTLM credential configuration:

[ {"credential" :
   { "type" : "ntlm",
     "ttl" : 300000,
     "port" : 80,
     "host" : "someHost",
     "domain" : "someDomain",
     "userName" : "someUser",
     "password" : "XXXXXXXX"
   }
} ]

Crawl a Web site protected by Kerberos

In Fusion 4.1 and above, the Web connector can crawl Web sites protected by Kerberos using SPNEGO. This is a way to access Web sites without requiring a user’s login credentials.

The Fusion Web connector can optionally use Kerberos with SAML/Smart Form authentication.

To crawl Kerberos-protected Web sites, first create the necessary configuration files, then configure Fusion to use them.

Create standard Java configuration files to connect to Kerberos

Fusion uses the JDK standard JAAS Kerberos implementation, which is based on three system properties that reference three separate files.

The files are as follows:

  • On the Kerberos-protected server, a keytab file, named kerberuser.keytab in our examples.

  • On the Fusion system, a configuration file named login.conf.

  • On the Fusion system, an initialization file named krb5.ini.

Create a Kerberos keytab

Create and validate the keytab file for the Kerberos client principal you want to use to authenticate to the website.

If you do not specify the kerberosPrincipalName and kerberosKeytabFilePath or kerberosKeytabBase64 when creating the Fusion datasource, Fusion uses the default login principal and ticket cache. You can see the default values by logging into the Fusion server as the user who runs Fusion and running klist.

If you do not want to use the default account and credentials, specify these configuration properties when creating a keytab as well as in the Web datasource setup. Use the Kerberos user principal name (UPN), not the service principal name (SPN, which is used with the Kerberos security realm). In some cases the UPN can be a service.

In our examples, the Fusion Web crawler authenticates to the Web sites using the user kerbuser@win.lab.lucidworks.com. We create a keytab file kerbuser.keytab for the user principal kerbuser@WIN.LAB.LUCIDWORKS.COM.

Create a Kerberos keytab on Windows

Example:

ktpass -out kerbuser.keytab -princ kerbuser@WIN.LAB.LUCIDWORKS.COM -mapUser kerbuser -mapOp set -pass YOUR_PASSWORD -crypto ALL -pType KRB5_NT_PRINCIPAL
Create a Kerberos keytab on Ubuntu Linux

Prerequisite: Install the krb5-user package: sudo apt-get install krb5-user

Example:

ktutil
addent -password -p HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM -k 1 -e aes128-cts-hmac-sha1-96
- it will ask you for password of kerbuser -
wkt kerbuser.keytab
q
Test the keytab

Once you create a keytab, verify that it works.

Prerequisite: You need a version of curl installed that allows SPNEGO. To test whether your version of curl does this, run curl --version and make sure SPNEGO is in the output.

Run the following curl command (replace the keytab path and site):

export KRB5CCNAME=FILE:/path/to/kerbuser.keytab curl -vvv --negotiate -u : http://your-site.com

Note that the first request is a 401 status code for the negotiate request followed by a second request, which is a status of 200.

Create a login.conf and krb5.ini

On the Fusion server, create login.conf and krb5.ini files as follows.

Create a login.conf on Windows

In this example, the keytab is stored at C:\\kerb\\kerbuser.keytab

KrbLogin {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="C:\\kerb\\kerbuser.keytab"
  useTicketCache=true
  principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
  debug=true;
};
Create a login.conf on Linux

In this example, the keytab is stored at /home/lucidworks/kb.keytab

com.sun.security.jgss.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="/home/lucidworks/kerbuser.keytab"
  useTicketCache=true
  principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
  debug=true;
};

The format of the login.conf is described on the Oracle Web site.

Create a krb5.ini

When you install krb5 on Linux, you can find a Kerberos configuration file in /etc/krb5.conf. You can optionally create a custom one instead.

Creating a krb5.conf is the same for Linux and Windows. On Windows the file is krb5.ini.

In this example the domain is WIN.LAB.LUCIDWORKS.COM, the Kerberos kdc host is my.kdc-dns.com, and the Kerberos admin server is my-admin-server-dns.com.

Example:

[libdefaults]
    default_realm = WIN.LAB.LUCIDWORKS.COM
    default_tkt_enctypes = aes128-cts-hmac-sha1-96
    default_tgs_enctypes = aes128-cts-hmac-sha1-96
    permitted_enctypes = aes128-cts-hmac-sha1-96
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    udp_preference_limit = 1

[realms]
WIN.LAB.LUCIDWORKS.COM = {
   kdc = my.kdc-dns.com
   admin_server = my.admin-server-dns.com
}

[domain_realm]
.WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM

The format of the krb5.ini file is described in the MIT Kerberos documentation. You can change the encryption algorithms by changing the properties default_tkt_enctypes, default_tgs_enctypes, and permitted_enctypes as needed. For example:

default_tkt_enctypes = RC4-HMAC
default_tgs_enctypes = RC4-HMAC
Permitted_enctypes = RC4-HMAC

Configure Fusion to use Kerberos

Once you have the keytab, login.conf, and krb5.ini files, configure Fusion to use Kerberos. You must set a property in a Fusion configuration file in addition to defining the datasource in the Fusion UI.

At the command line on every machine in your Fusion cluster:

  1. In $FUSION_HOME/conf/fusion.cors (fusion.properties in Fusion 4.x), add the following property to the connectors-classic.jvmOptions setting: -Djavax.security.auth.useSubjectCredsOnly=false

  2. Restart the connectors-classic service using ./bin/connectors-classic restart on Linux or bin\connectors-classic.cmd restart on Windows.

In the Fusion UI:

  1. Click Indexing > Datasources.

  2. Click Add+, then Web.

  3. Enter a datasource ID and a start link.

  4. Click Crawl authorization.

  5. At the bottom of the section, check Enable SPNEGO/Kerberos Authentication.

  6. You can either use the default principal name or specify a principal name to use.

    • If you do not specify the principal name, then Fusion uses the default login principal and ticket cache. You can see those default values by logging into the Fusion server as the user who runs Fusion and running klist.

  7. If you specify a principal name, you must provide a keytab, either in Base64 or as a file path.

    • If you specify a keytab file path, the file must be on the machine running the Fusion connector, for each connector’s node in the cluster.

    • The Base64 option lets you supply the keytab in one place, in the UI.

  8. Fill in any remaining options to configure the datasource.

  9. Click Save.

Troubleshoot Kerberos authentication

javax.security.auth.login.LoginException: No key to store

Problem: When trying to crawl a Kerberos-authenticated Web site, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
    at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
    at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more
Caused by: javax.security.auth.login.LoginException: No key to store
    at com.sun.security.auth.module.Krb5LoginModule.commit(Krb5LoginModule.java:1119) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.login(LoginContext.java:588) ~[?:1.8.0_161]
    at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more

Resolution:

First test your keytab as described in test the keytab above.

If your keytab passes validation, another possibility is that the /tmp/krb* cache file got corrupted or is not compatible after you went through other troubleshooting steps. To rule that out, remove the /tmp/krb* cache file on all hosts, restart your connectors-classic, and try the crawl again. That is, on each host:

rm -f /tmp/krb*
$FUSION_HOME/bin/connectors-classic restart

401 error

Problem: Crawling using the Web connector with Kerberos results in a 401 error, but curl with Kerberos works fine.

Resolution:

Make sure you have this system property set in connectors-classic jvmOptions on all nodes:

-Djavax.security.auth.useSubjectCredsOnly=false

You must restart connectors-classic after making that change.

If that doesn’t work, make sure the user you are authenticating with from Curl matches the user you are trying to authenticate with from the Web connector. To see your Kerberos principal user name, run klist.

Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”

Problem: When crawling using the Web connector with Kerberos enabled, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
	at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: javax.security.auth.login.LoginException: Pre-authentication information was invalid (24)
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.KrbException: Pre-authentication information was invalid (24)
	at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:76) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.Asn1Exception: Identifier doesn't match expected value (906)
	at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.init(ASRep.java:64) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more

Resolution:

Your keytab’s principal name doesn’t match the value on the ticket server. Check the principal name for your user.