Configure website authentication
The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.
Managed Fusion releases 5.5 - 5.6.0 use the Web V1 connector. Managed Fusion releases 5.6.1 and later use the Web V2 connector.
The Web connector supports Basic, Digest, Form, and NTLM authentication to websites.
The credentials for a crawl are stored in a credentials file in https://EXAMPLE_COMPANY.b.lucidworks.cloud:6764/data/connectors/container/lucid.web/datasourceName
, where datasourceName
is the name of the datasource.
After you create a datasource, Managed Fusion creates this directory for you. The file should be a JSON formatted file, ending with the .json
file extension.
When defining the datasource, use the name of the file in the Authentication credentials filename field in the UI (or for the f.credentialsFile
property if using the REST API).
All authentication types require the credentials file to include a property called type
that defines the type of authentication to use. The other required properties vary depending on the type of authentication chosen.
Basic form-based authentication
To use basic form-based authentication, use the value of form for the type
. The other properties are:
-
ttl
is the "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds. -
action
is the action to take to log in. That is, the URL for the login form. -
params
are the parameters for the form, such as the username and password. But this must also include required properties. In the following example, two parameters are specified:os_username
andos_password
, which are expected by the system we crawl.
Here is an example using form-based authentication:
[ {
"credential" : {
"type" : "form",
"ttl" : 300000,
"action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
"params" : {
"os_username" : "username",
"os_password" : "password"
}
}
} ]
Complex Form-based authentication
Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this is websites that use SAML to sign in a user via a central single-signon authority.
To configure Managed Fusion to sign in to a website like this, use the value of smartForm for the type
property. The other properties are:
-
ttl
is the "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds. -
loginUrl
is the URL on which the first page that initializes the login chain is located. -
params
are the parameters for form sign in credentials, such as the username and password. But this must also include required properties. In the following example, two parameters are specified:os_username
andos_password
, which are expected by the system we crawl.
After the sign in occurs, a new form displays to the user which then posts back to where we came from. No data needs to be entered in this form, which is the example includes an empty { }
in the params list.
Here is an example using form-based authentication:
[ {
"credential" : {
"type" : "smartForm",
"ttl" : 300000,
"loginUrl" : "http://some.example.com/login",
"params" : [{
"os_username" : "username",
"os_password" : "password"
}, {
} ]
}
} ]
To figure out what parameters you need to specify, turn off JavaScript in your browser and go through the sign in work flow.
Though you normally see only a single login form on your screen, you might find many more forms you need to submit before you get signed in when JavaScript is not available to perform those form submissions automatically.
Each form in that sign-in chain needs to be represented in the list of params
. If no user input is required, use an empty { }
.
Basic and Digest Authentication
Basic and Digest authentication are simple HTTP
authentication methods still in use in some places. To use either of these types, in the credentials file, for the type
property use a value of basic or digest. The other properties are:
-
host
is the host of the site. -
port
is the port, if any. -
userName
is the user identifier to use for authentication. -
password
is the password associated with theuserName
. -
realm
is the security realm for the site. For more information, see Security realms.
Example basic auth configuration:
[ {
"credential" : {
"type" : "basic",
"ttl" : 300000,
"userName" : "usr",
"password" : "pswd",
"host":"hostname.exampledomain.com”
"port": 443
}
}
]
NTLM authentication
To use NTLM authentication, in the credentials file, use a value of ntlm in the type
property. The other properties available are:
-
host
is the host of the site. -
port
is the port, if any. -
userName
is the user identifier to use for authentication. -
password
is the password associated with theuserName
. -
realm
is the security realm for the site. For more information, see Security realms. -
domain
is the domain of the site. -
workstation
is the identifier for the user computer, if needed.
Example NTLM credential configuration:
[ {"credential" :
{ "type" : "ntlm",
"ttl" : 300000,
"port" : 80,
"host" : "someHost",
"domain" : "someDomain",
"userName" : "someUser",
"password" : "XXXXXXXX"
}
} ]
Crawl a website protected by Kerberos
The Web connector can crawl Web sites protected by Kerberos using SPNEGO without requiring a user’s sign-in credentials.
Kerberos support requires Managed Fusion 5.9.5. |
The Managed Fusion Web connector can also use Kerberos with SAML/Smart Form authentication.
To crawl Kerberos-protected Web sites, first create the necessary configuration files, then configure Managed Fusion to use them.
Create standard Java configuration files to connect to Kerberos
Managed Fusion uses the JDK standard JAAS Kerberos implementation, which is based on three system properties that reference three separate files.
The files are as follows:
-
On the Kerberos-protected server, a keytab file, named
kerberuser.keytab
in our examples. -
On the Managed Fusion system, a configuration file named
login.conf
. -
On the Managed Fusion system, an initialization file named
krb5.ini
.
Create a Kerberos keytab
Create and validate the keytab file for the Kerberos client principal you want to use to authenticate to the website.
If you do not specify the kerberosPrincipalName
and kerberosKeytabFilePath
or kerberosKeytabBase64
when creating the Managed Fusion datasource, Managed Fusion uses the default sign-in principal and ticket cache.
You can see the default values by logging into the Managed Fusion server as the user who runs Managed Fusion and running klist
.
If you do not want to use the default account and credentials, specify these configuration properties when creating a keytab as well as in the Web datasource setup. Use the Kerberos user principal name (UPN), not the service principal name (SPN, which is used with the Kerberos security realm). In some cases the UPN can be a service.
In our examples, the Managed Fusion Web crawler authenticates to the Web sites using the user kerbuser@win.lab.lucidworks.com.
We create a keytab file kerbuser.keytab
for the user principal kerbuser@WIN.LAB.LUCIDWORKS.COM.
Create a Kerberos keytab on Windows
Example:
ktpass -out kerbuser.keytab -princ kerbuser@WIN.LAB.LUCIDWORKS.COM -mapUser kerbuser -mapOp set -pass YOUR_PASSWORD -crypto AES256-SHA1 -pType KRB5_NT_PRINCIPAL
+
The following weak encryption types are not supported by Fusion:
|
Create a Kerberos keytab on Ubuntu Linux
Prerequisite: Install the krb5-user
package: sudo apt-get install krb5-user
.
Example:
ktutil
addent -password -p HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM -k 1 -e aes128-cts-hmac-sha1-96
- it will ask you for password of kerbuser -
wkt kerbuser.keytab
q
Test the keytab
Once you create a keytab, verify that it works.
Prerequisite: You need a version of curl installed that allows SPNEGO. To test whether your version of curl does this, run curl --version
and make sure SPNEGO is in the output.
Run the following curl command (replace the keytab path and site):
export KRB5CCNAME=FILE:/path/to/kerbuser.keytab curl -vvv --negotiate -u : http://your-site.com
Note that the first request is a 401 status code for the negotiate request followed by a second request, which is a status of 200.
Create a login.conf and krb5.ini
On the Managed Fusion server, create login.conf
and krb5.ini
files as follows.
Create a login.conf on Windows
In this example, the keytab is stored at C:\\kerb\\kerbuser.keytab
.
KrbLogin {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="C:\\kerb\\kerbuser.keytab"
useTicketCache=true
principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
debug=true;
};
Create a login.conf on Linux
In this example, the keytab is stored at /home/lucidworks/kb.keytab
.
com.sun.security.jgss.initiate {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/home/lucidworks/kerbuser.keytab"
useTicketCache=true
principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
debug=true;
};
The format of the login.conf is described on the Oracle Web site.
Create a krb5.ini
When you install krb5 on Linux, you can find a Kerberos configuration file in /etc/krb5.conf
. You can optionally create a custom one instead.
Creating a krb5.conf
is the same for Linux and Windows. On Windows the file is krb5.ini
.
In this example the domain is WIN.LAB.LUCIDWORKS.COM, the Kerberos kdc host is my.kdc-dns.com
, and the Kerberos admin server is my-admin-server-dns.com
.
Example:
[libdefaults]
default_realm = WIN.LAB.LUCIDWORKS.COM
default_tkt_enctypes = aes128-cts-hmac-sha1-96
default_tgs_enctypes = aes128-cts-hmac-sha1-96
permitted_enctypes = aes128-cts-hmac-sha1-96
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
udp_preference_limit = 1
[realms]
WIN.LAB.LUCIDWORKS.COM = {
kdc = my.kdc-dns.com
admin_server = my.admin-server-dns.com
}
[domain_realm]
.WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
The format of the krb5.ini
file is described in the MIT Kerberos documentation.
You can change the encryption algorithms by changing the properties default_tkt_enctypes
, default_tgs_enctypes
, and permitted_enctypes
as needed. For example:
default_tkt_enctypes = RC4-HMAC
default_tgs_enctypes = RC4-HMAC
Permitted_enctypes = RC4-HMAC
Configure Managed Fusion to use Kerberos
Once you have the keytab, login.conf
, and krb5.ini
files, configure Managed Fusion to use Kerberos. You must set a property in a Managed Fusion configuration file in addition to defining the datasource in the Managed Fusion UI.
At the command line on every machine in your Managed Fusion cluster:
-
In
$FUSION_HOME/conf/fusion.cors
, add the following property to theconnectors-classic.jvmOptions
setting:-Djavax.security.auth.useSubjectCredsOnly=false
-
Restart the
connectors-classic
service using./bin/connectors-classic restart
on Linux orbin\connectors-classic.cmd restart
on Windows.
In the Managed Fusion UI:
-
Click Indexing > Datasources > Add+ > Web.
-
Enter a datasource ID and a start link.
-
Click Crawl authorization.
-
At the bottom of the section, check Enable SPNEGO/Kerberos Authentication.
-
You can either use the default principal name or specify a principal name to use.
-
If you do not specify the principal name, Managed Fusion uses the default login principal and ticket cache. You can see those default values by logging into the Managed Fusion server as the user who runs Managed Fusion and running
klist
.
-
-
If you specify a principal name, you must provide a keytab, either in Base64 or as a file path.
-
If you specify a keytab file path, the file must be on the machine running the Managed Fusion connector for each connector’s node in the cluster.
-
The Base64 option lets you supply the keytab in one place in the UI.
-
-
Fill in any remaining options to configure the datasource.
-
Click Save.
Troubleshoot Kerberos authentication
javax.security.auth.login.LoginException: No key to store
Problem: When trying to crawl a Kerberos-authenticated Web site, you get an error like this:
crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: javax.security.auth.login.LoginException: No key to store
at com.sun.security.auth.module.Krb5LoginModule.commit(Krb5LoginModule.java:1119) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:588) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Resolution:
First test your keytab as described in test the keytab.
If your keytab passes validation, another possibility is that the /tmp/krb*
cache file got corrupted or is not compatible after you went through other troubleshooting steps.
To rule that out, remove the /tmp/krb*
cache file on all hosts, restart your connectors-classic, and try the crawl again. That is, on each host:
rm -f /tmp/krb*
$FUSION_HOME/bin/connectors-classic restart
401 error
Problem: Crawling using the Web connector with Kerberos results in a 401 error, but curl with Kerberos works fine.
Resolution:
Make sure you have this system property set in connectors-classic jvmOptions
on all nodes:
-Djavax.security.auth.useSubjectCredsOnly=false
You must restart connectors-classic
after making that change.
If that doesn’t work, make sure the user you are authenticating with from Curl matches the user you are trying to authenticate with from the Web connector.
To see your Kerberos principal user name, run klist
.
Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”
Problem: When crawling using the Web connector with Kerberos enabled, you get an error like this:
crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: javax.security.auth.login.LoginException: Pre-authentication information was invalid (24)
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: sun.security.krb5.KrbException: Pre-authentication information was invalid (24)
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:76) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Caused by: sun.security.krb5.Asn1Exception: Identifier does not match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140) ~[?:1.8.0_161]
at sun.security.krb5.internal.ASRep.init(ASRep.java:64) ~[?:1.8.0_161]
at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
... 8 more
Resolution:
Your keytab’s principal name doesn’t match the value on the ticket server. Check the principal name for your user.