Configure Web Site Authentication
The Web V1 connector retrieves data from a Web site using HTTP and starting from a specified URL.
The Web connector supports Basic, Digest, Form, and NTLM authentication to websites.
The credentials for a crawl are stored in a credentials file in https://FUSION_HOST:FUSION_PORT/data/connectors/container/lucid.web/datasourceName
, where datasourceName
is the name of the datasource.
After you create a datasource, Fusion creates this directory for you. The file should be a JSON formatted file, ending with the .json
file extension.
When defining the datasource, use the name of the file in the Authentication credentials filename field in the UI (or for the f.credentialsFile
property if using the REST API).
All authentication types require the credentials file to include a property called type
that defines the type of authentication to use. The other required properties vary depending on the type of authentication chosen.
Basic form-based Authentication
To use basic form-based authentication, use form
for the type. The other properties are:
-
ttl. The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.
-
action. The action to take to log in. That is, the URL for the login form.
-
params. The parameters for the form, likely the username and password, but could be other required properties. In the example below, we pass two parameters,
os_username
andos_password
, which are expected by the system we crawl.
Here is an example using form-based authentication:
[ {
"credential" : {
"type" : "form",
"ttl" : 300000,
"action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
"params" : {
"os_username" : "username",
"os_password" : "password"
}
}
} ]
Complex Form-based Authentication
Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this is websites that use SAML to log in a user via a central single-signon authority.
To configure fusion to log in to a website like this, use smartForm
for the type. The other properties are:
-
ttl. The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.
-
loginUrl. The URL on which the first page that initializes the login chain is located
-
params. A list of parameters to use for the form logins, likely the username and password, but could be other required properties. In the example below, we pass two parameters,
os_username
andos_password
, which are expected by the system we crawl. Additionally we expect that once that login has happened, a new form is presented to the user which then posts back to where we came from. No data need to be entered in this form, which is why we include an empty{ }
in the params list.
Here is an example using form-based authentication:
[ {
"credential" : {
"type" : "smartForm",
"ttl" : 300000,
"loginUrl" : "http://some.example.com/login",
"params" : [{
"os_username" : "username",
"os_password" : "password"
}, {
} ]
}
} ]
To figure out what parameters you need to specify, turn off JavaScript in your browser and go through the login work flow.
Though you normally see only a single login form on your screen, you might find many more forms you need to submit before you get logged in when JavaScript is not available to perform those form submissions automatically.
Each form in that login chain needs to be represented in the list of params
. If no user input is required, simply include an empty { }
.
Basic and Digest Authentication
Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types, in the credentials file, for the type
property use "basic" or "digest". The other properties are:
-
host. The host of the site.
-
port. The port, if any.
-
userName. The username to use for authentication.
-
password. The password for the userName.
-
realm. The security realm for the site, if any.
Example basic auth configuration:
[ {
"credential" : {
"type" : "basic",
"ttl" : 300000,
"userName" : "usr",
"password" : "pswd",
"host":"hostname.exampledomain.com”
"port": 443
}
}
]
NTLM Authentication
To use NTLM authentication, in the credentials file, for the type
property, use ntlm
. The other properties available are:
-
host. The host of the site.
-
port. The port, if any.
-
userName. The username to use for authentication.
-
password. The password for the userName.
-
realm. The security realm for the site, if any.
-
domain. The domain.
-
workstation. The workstation, as needed.
Example NTLM credential configuration:
[ {"credential" :
{ "type" : "ntlm",
"ttl" : 300000,
"port" : 80,
"host" : "someHost",
"domain" : "someDomain",
"userName" : "someUser",
"password" : "XXXXXXXX"
}
} ]
Crawl a Web site protected by Kerberos
In Fusion 4.1 and later, the Web connector can crawl Web sites protected by Kerberos using SPNEGO. This is a way to access Web sites without requiring a user’s login credentials.
Kerberos support requires Fusion 5.9.5. |
The Fusion Web connector can optionally use Kerberos with SAML/Smart Form authentication.
To crawl Kerberos-protected Web sites, first create the necessary configuration files, then configure Fusion to use them.
Create standard Java configuration files to connect to Kerberos
Fusion uses the JDK standard JAAS Kerberos implementation, which is based on three system properties that reference three separate files.
The files are as follows:
-
On the Kerberos-protected server, a keytab file, named
kerberuser.keytab
in our examples. -
On the Fusion system, a configuration file named
login.conf
. -
On the Fusion system, an initialization file named
krb5.ini
.
Create a Kerberos keytab
Create and validate the keytab file for the Kerberos client principal you want to use to authenticate to the website.
If you do not specify the kerberosPrincipalName
and kerberosKeytabFilePath
or kerberosKeytabBase64
when creating the Fusion datasource, Fusion uses the default login principal and ticket cache.
You can see the default values by logging into the Fusion server as the user who runs Fusion and running klist
.
If you do not want to use the default account and credentials, specify these configuration properties when creating a keytab as well as in the Web datasource setup. Use the Kerberos user principal name (UPN), not the service principal name (SPN, which is used with the Kerberos security realm). In some cases the UPN can be a service.
In our examples, the Fusion Web crawler authenticates to the Web sites using the user kerbuser@win.lab.lucidworks.com.
We create a keytab file kerbuser.keytab
for the user principal kerbuser@WIN.LAB.LUCIDWORKS.COM.
Create a Kerberos keytab on Windows
Example:
ktpass -out kerbuser.keytab -princ kerbuser@WIN.LAB.LUCIDWORKS.COM -mapUser kerbuser -mapOp set -pass YOUR_PASSWORD -crypto AES256-SHA1 -pType KRB5_NT_PRINCIPAL
+
The following weak encryption types are not supported by Fusion:
|
Create a Kerberos keytab on Ubuntu Linux
Prerequisite: Install the krb5-user
package: sudo apt-get install krb5-user
Example:
ktutil
addent -password -p HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM -k 1 -e aes128-cts-hmac-sha1-96
- it will ask you for password of kerbuser -
wkt kerbuser.keytab
q
Test the keytab
Once you create a keytab, verify that it works.
Prerequisite: You need a version of curl installed that allows SPNEGO. To test whether your version of curl does this, run curl --version
and make sure SPNEGO is in the output.
Run the following curl command (replace the keytab path and site):
export KRB5CCNAME=FILE:/path/to/kerbuser.keytab curl -vvv --negotiate -u : http://your-site.com
Note that the first request is a 401 status code for the negotiate request followed by a second request, which is a status of 200.
Create a login.conf and krb5.ini
On the Fusion server, create login.conf
and krb5.ini
files as follows.
Create a login.conf on Windows
In this example, the keytab is stored at C:\\kerb\\kerbuser.keytab
KrbLogin {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="C:\\kerb\\kerbuser.keytab"
useTicketCache=true
principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
debug=true;
};
Create a login.conf on Linux
In this example, the keytab is stored at /home/lucidworks/kb.keytab
com.sun.security.jgss.initiate {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/home/lucidworks/kerbuser.keytab"
useTicketCache=true
principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
debug=true;
};
The format of the login.conf is described on the Oracle Web site.
Create a krb5.ini
When you install krb5 on Linux, you can find a Kerberos configuration file in /etc/krb5.conf
. You can optionally create a custom one instead.
Creating a krb5.conf
is the same for Linux and Windows. On Windows the file is krb5.ini
.
In this example the domain is WIN.LAB.LUCIDWORKS.COM, the Kerberos kdc host is my.kdc-dns.com
, and the Kerberos admin server is my-admin-server-dns.com
.
Example:
[libdefaults]
default_realm = WIN.LAB.LUCIDWORKS.COM
default_tkt_enctypes = aes128-cts-hmac-sha1-96
default_tgs_enctypes = aes128-cts-hmac-sha1-96
permitted_enctypes = aes128-cts-hmac-sha1-96
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
udp_preference_limit = 1
[realms]
WIN.LAB.LUCIDWORKS.COM = {
kdc = my.kdc-dns.com
admin_server = my.admin-server-dns.com
}
[domain_realm]
.WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
The format of the krb5.ini
file is described in the MIT Kerberos documentation.
You can change the encryption algorithms by changing the properties default_tkt_enctypes
, default_tgs_enctypes
, and permitted_enctypes
as needed. For example:
default_tkt_enctypes = RC4-HMAC
default_tgs_enctypes = RC4-HMAC
Permitted_enctypes = RC4-HMAC
Configure Fusion to use Kerberos
Once you have the keytab, login.conf
, and krb5.ini
files, configure Fusion to use Kerberos. You must set a property in a Fusion configuration file in addition to defining the datasource in the Fusion UI.
At the command line on every machine in your Fusion cluster:
-
In
$FUSION_HOME/conf/fusion.cors
(fusion.properties
in Fusion 4.x), add the following property to theconnectors-classic.jvmOptions
setting:-Djavax.security.auth.useSubjectCredsOnly=false
-
Restart the
connectors-classic
service using./bin/connectors-classic restart
on Linux orbin\connectors-classic.cmd restart
on Windows.
In the Fusion UI:
-
Click Indexing > Datasources.
-
Click Add+, then Web.
-
Enter a datasource ID and a start link.
-
Click Crawl authorization.
-
At the bottom of the section, check Enable SPNEGO/Kerberos Authentication.
-
You can either use the default principal name or specify a principal name to use.
-
If you do not specify the principal name, then Fusion uses the default login principal and ticket cache. You can see those default values by logging into the Fusion server as the user who runs Fusion and running
klist
.
-
-
If you specify a principal name, you must provide a keytab, either in Base64 or as a file path.
-
If you specify a keytab file path, the file must be on the machine running the Fusion connector, for each connector’s node in the cluster.
-
The Base64 option lets you supply the keytab in one place, in the UI.
-
-
Fill in any remaining options to configure the datasource.
-
Click Save.
Troubleshoot Kerberos authentication
javax.security.auth.login.LoginException: No key to store
Problem: When trying to crawl a Kerberos-authenticated Web site, you get an error like this:
crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: javax.security.auth.login.LoginException: No key to store
at com.sun.security.auth.module.Krb5LoginModule.commit(Krb5LoginModule.java:1119) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:588) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Resolution:
First test your keytab as described in test the keytab above.
If your keytab passes validation, another possibility is that the /tmp/krb*
cache file got corrupted or is not compatible after you went through other troubleshooting steps.
To rule that out, remove the /tmp/krb*
cache file on all hosts, restart your connectors-classic, and try the crawl again. That is, on each host:
rm -f /tmp/krb*
$FUSION_HOME/bin/connectors-classic restart
401 error
Problem: Crawling using the Web connector with Kerberos results in a 401 error, but curl with Kerberos works fine.
Resolution:
Make sure you have this system property set in connectors-classic jvmOptions
on all nodes:
-Djavax.security.auth.useSubjectCredsOnly=false
You must restart connectors-classic
after making that change.
If that doesn’t work, make sure the user you are authenticating with from Curl matches the user you are trying to authenticate with from the Web connector.
To see your Kerberos principal user name, run klist
.
Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”
Problem: When crawling using the Web connector with Kerberos enabled, you get an error like this:
crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: javax.security.auth.login.LoginException: Pre-authentication information was invalid (24)
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: sun.security.krb5.KrbException: Pre-authentication information was invalid (24)
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:76) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: sun.security.krb5.Asn1Exception: Identifier does not match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140) ~[?:1.8.0_161]
at sun.security.krb5.internal.ASRep.init(ASRep.java:64) ~[?:1.8.0_161]
at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Resolution:
Your keytab’s principal name doesn’t match the value on the ticket server. Check the principal name for your user.