Web Connector and Datasource Configuration

The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.

Crawl JavaScript Web sites

JavaScript-enabled Web sites require different crawling behavior than plain HTML Web sites.

Enable JavaScript evaluation

To enable JavaScript evaluation, set the f.crawlJS/"Evaluate Javascript" parameter to "true". When this option is enabled, the connector crawls links rendered from JavaScript evaluation, using the a headless browser by default (see below).

As of Fusion 4.0.2, JavaScript evaluation is fastest when using High Performance JavaScript Evaluation mode, which uses the Chromium browser with the Web connector. There is also a High Performance Mode parameter, which uses the Chromium web browser to crawl web sites.

How is "High performance JavaScript evaluation" different than the non-high performance mode? In high performance mode, the following are true:
  • JavaScript evaluation is faster.

  • JavaScript evaluated content is more accurate.

  • You can run in "non-headless" mode to watch the browser crawl your content on a desktop, for debugging purposes.

  • You can save screenshots of pages and index them as base64 along with your other web page content.

  • The process uses more RAM.

  • The process uses more CPU.

Dependencies for Non-high Performance JavaScript Evaluation

For the default JavaScript evaluation, you need only a small number of dependencies to support PhantomJS headless browser.

RHEL / CentOS / Amazon Linux

yum install fontconfig freetype

Ubuntu / Debian

apt-get install build-essential chrpath libssl-dev libxft-dev libfreetype6-dev libfreetype6 libfontconfig1-dev libfontconfig1

Windows

For Windows, there are no extra dependencies needed.

Dependencies for High Performance JavaScript Evaluation

To enable high-performance JavaScript evaluation, you must install some dependencies, then set f.crawlJS/"Evaluate JavaScript" and f.useHighPerfJsEval/"High Performance Mode" to "true".

Enable high-performance mode with Chromium
  1. Install the dependencies, using one of the scripts packaged with Fusion:

    • Windows: fusion/4.1.x/bin/install-high-perf-web-deps.ps1

    • Linux and OSX: fusion/4.1.x/bin/install-high-perf-web-deps.sh

    When the script finishes a confirmation message displays:

    Successfully installed high-performance JS eval mode dependencies for Lucidworks Fusion web connector.
  2. In the Fusion UI, configure your Web data source:

    1. Set Evaluate Javascript to "true".

    2. Set High Performance Mode to "true".

    An additional f.headlessBrowser parameter can be set to "false" to display the browser windows during processing. It is "true" by default. Non-headless mode is available only using the "High-performance" mode.

  3. Save the datasource configuration.

Note
If Fusion is running on Docker, you must either mount an shm directory using the argument -v /dev/shm:/dev/shm or use the flag --shm-size=2g to use the host’s shared memory. The default shm size 64m will result in failing crawls with logs showing error messages like org.openqa.selenium.WebDriverException: Failed to decode response from marionette. See Geckodriver issue 1193 for more details.

Screenshots with High Performance Mode

To enable screenshots, configure these parameters:

  • f.takeScreenshot - Enable screenshots. Default: false.

  • f.screenshotFullscreen - Take full-screen screenshots. Default: false.

These additional viewport parameters exist mainly to support screenshots, but the viewport settings can also affect the overall output:

  • f.viewportWidth - View port width in pixels. Default: 800.

  • f.viewportHeight - View port height in pixels. Default: 600.

  • f.deviceScreenFactor - Device screen factor. Default: 1 (no scaling). See the Android Screen compatibility overview for a description about device displays.

  • f.simulateMobile - (advanced property) Tell the browser to "emulate" a mobile browser. Default: false.

  • f.mobileScreenWidth - (advanced property) Only applicable with f.simulateMobile. Sets the screen width in pixels.

  • f.mobileScreenHeight - (advanced property) Only applicable with f.simulateMobile. Sets the screen height in pixels.

Authentication with JavaScript Evaluation

To use authentication when JavaScript evaluation is enabled, use the SmartForm (SAML) option because it can log in to a Web site like a typical browser user.

SmartForm login functionality is more powerful when JavaScript evaluation is enabled:

  • You can perform login on forms that might be JavaScript rendered.

  • You can use a variety of HTML selectors to find the elements to enter as login information.

    By contrast, when JavaScript evaluation is disabled, you can only provide inputs using the name attribute of <input> elements.

Configure authentication for JavaScript-enabled crawling
  1. Launch Chrome.

  2. Select File > New Private Window.

  3. Navigate to the site that you want to crawl with authentication.

    For example, navigate to http://some-website-with-auth.com.

  4. Identify the URL for the login page.

    For example, from http://some-website-with-auth.com, navigate to the page that displays the login form, then copy the page URL, such as http://some-website-with-auth.com/sso/login.

    Use this URL as the value of the loginUrl parameter (URL in the Fusion UI) explained in Complex form-based authentication.

  5. On the login page, identify the fields used for inputting the username and password.

    You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.

    In most cases it is an <input> element that has a name attribute and you can specify the field as this name value. For example:

    <input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
  6. Add the username field as a Property to the SmartForm login. That is, "add" a Property, where the "Property Name" is login and the "Property Value" is the username you need to log in as.

  7. Add the password field name as the passwordParamName (Password Parameter in the Fusion UI).

  8. On the site login page, right-click "Submit" and select Inspect element.

    • If the button is an <input type="submit"/>, then the SmartForm login picks it up automatically.

    • If the button is another element (such as <button>, <a>, <div>, and so on) then you must add a parameter whose name has the special prefix ::submitButtonXPath::, then add an XPath expression that points to the submit button. For example: ::submitButtonXPath:://button[@name='loginButton']

  9. If there is no name attribute on the <input> elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:

    ;;BY_XPATH;;//input[@id='someId']
    ;;BY_ID;;someid
    ;;BY_NAME;;somename
    ;;BY_CLASS_NAME;;someCssClassName
    ;;BY_CSS_SELECTOR;;.div#selector

Sometimes your Web page asks you a random question, such as What is the name of your first dog?

In this case we add another special parameter:

::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched

Here is an example of three different parameters where our site might ask one of three questions randomly:

::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer
::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer
::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer

Debug the JavaScript Evaluation Stage using Non-headless Chromium

When testing the Web connector with Chromium, it helps to install Fusion on a workstation with desktop abilities, such as Windows, Mac, or Linux with a desktop. Then configure a Web datasource with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to 1, and uncheck Javascript Evaluation > Headless Browser.

This results in the Web fetcher using a single instance of Chromium in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Chromium evaluation stage.

Limit Crawl Scope

The connector works by going to the seed page (the "startURIs" specified in the configuration form), collecting the content for indexing, and extracting any links to other pages. It then follows those links to collect content on other pages, extracting links to those pages, and so on.

When creating a Web data source, pay attention to the Max crawl depth and Restrict To Tree parameters (c.depth and c.restrictToTree in the REST API). These properties limit the scope of your crawl to prevent an unbounded crawl that could continue for a long time, particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl can also cause memory errors in your system.

The connector keeps track of URIs it has seen, and many of the properties relate to managing the resulting database of entries. If the connector finds a standard redirect, it tracks that the redirected URI has an alias, and does not re-evaluate the URI on its next runs until the alias expiration has passed. If deduplication is enabled, documents that were found to be duplicates are also added to the alias list and are not re-evaluated until the alias expiration has passed.

Regular expressions can be used to restrict the crawl either by defining URI patterns that should be followed or URI patterns that should not be followed.

Additionally, specific patterns of the URI can be defined to define URIs that should not be followed.

Extract Content from Pages

The connector supports several approaches to extracting and filtering content from pages. When analyzing the HTML of a page, the connector can specifically include or exclude elements based on the HTML tag, the tag ID, or the tag class (such as a div tag, or the #content tag ID).

Specific tags can be selected to become fields of the document if needed. For example, all content from <h1> tags can be pulled into an h1 field, and with field mapping be transformed into document titles.

For other advanced capabilities, you can use jsoup selectors to find elements in the content to include or exclude from the content.

While field mapping is generally a function of the index pipeline, you can define some initial mappings to occur during the crawl. The "initial mappings" property for each web datasource is predefined with three mappings: to move fetchedDates to a fetchedDates_dts field, to move lastModified to a lastModified_dt field, and to move length to a length_l field.

Finally, the crawler can deduplicate crawled content. You can define a specific field to use for this deduplication (such as title, or another field), or you can use the full raw content as the default. In the Fusion UI, when you are defining your datasource, toggle Advanced to access the Dedupe settings.

Deduplicate using canonical tags

In content management and online shopping systems, it is common for the same content to be accessed through multiple URLs. Content syndication helps you distribute content to different URLs and domains, consolidate link signals for the duplicate or similar content, and track metrics for a single product or topic. But it creates some challenges when people use search engines to reach your page.

The Fusion Web connector can leverage canonical meta tags in your website’s HTML to deduplicate web pages.

To deduplicate web pages using canonical tags in the Fusion UI:

  1. When defining your Web datasource, toggle Advanced at the top of the page.

  2. Under Dedupe, click Dedupe documents.

  3. Make sure Deduplication via canonical tag is checked.

Sitemap Processing

Crawling sitemaps is supported. Simply add the URL(s) of the sitemap to the f.sitemapURLs property (Sitemap URLs in the UI) and all of the URLs found in a sitemap are added to the list of URLs to crawl. Sitemap indexes (that is, a sitemap that points to other sitemaps) are also supported. The URLs found through each sitemap are added to the list of URLs to crawl.

To configure your datasource to crawl only the sitemap file, add the sitemap URL to both the startLinks property (because that is a required property for a datasource) and also to the f.sitemapsURL property so it is properly treated as a sitemap by the connector when it starts.

Website Authentication

The Web connector supports Basic, Digest, Form, and NTLM authentication to websites.

The credentials for a crawl are stored in a credentials file in fusion/4.1.x/data/connectors/container/lucid.web/datasourceName, where datasourceName is the name of the datasource. After you create a datasource, Fusion creates this directory for you. The file should be a JSON formatted file, ending with the .json file extension. When defining the datasource, use the name of the file in the Authentication credentials filename field in the UI (or for the f.credentialsFile property if using the REST API).

All authentication types require the credentials file to include a property called type that defines the type of authentication to use. The other required properties vary depending on the type of authentication chosen.

Form-based Authentication

To use basic form-based authentication, use form for the type. The other properties are:

  • ttl - The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.

  • action - The action to take to log in. That is, the URL for the login form.

  • params - The parameters for the form, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "form",
            "ttl" : 300000,
            "action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
            "params" : {
                "os_username" : "username",
                "os_password" : "password"
            }
        }
  } ]

Complex Form-based Authentication

Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this is websites that use SAML to log in a user via a central single-signon authority. To configure fusion to log in to a website like this, use smartForm for the type. The other properties are:

  • ttl - The "time to live" for the session created after authentication. After the specified time, the crawler logs in again to keep the crawl activity from failing due to an expired session. This value is defined in seconds.

  • loginUrl - The URL on which the first page that initializes the login chain is located

  • params - A list of parameters to use for the form logins, likely the username and password, but could be other required properties. In the example below, we pass two parameters, os_username and os_password, which are expected by the system we crawl. Additionally we expect that once that login has happened, a new form is presented to the user which then posts back to where we came from. No data need to be entered in this form, which is why we include an empty { } in the params list.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "smartForm",
            "ttl" : 300000,
            "loginUrl" : "http://some.example.com/login",
            "params" : [{
                "os_username" : "username",
                "os_password" : "password"
            }, {

            } ]
        }
  } ]

To figure out what parameters you need to specify, turn off JavaScript in your browser and go through the login work flow. Though you normally see only a single login form on your screen, you might find many more forms you need to submit before you get logged in when JavaScript is not available to perform those form submissions automatically. Each form in that login chain needs to be represented in the list of params. If no user input is required, simply include an empty { }.

Basic and Digest Authentication

Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types, in the credentials file, for the type property use "basic" or "digest". The other properties are:

  • host - The host of the site.

  • port - The port, if any.

  • userName - The username to use for authentication.

  • password - The password for the userName.

  • realm - The security realm for the site, if any.

Example basic auth configuration:

[ {
        "credential" : {
            "type" : "basic",
            "ttl" : 300000,
            "userName" : "usr",
            "password" : "pswd",
            "host":"hostname.exampledomain.com”
            "port": 443
        }
  }
]

NTLM Authentication

To use NTLM authentication, in the credentials file, for the type property, use ntlm. The other properties available are:

  • host - The host of the site.

  • port - The port, if any.

  • userName - The username to use for authentication.

  • password - The password for the userName.

  • realm - The security realm for the site, if any.

  • domain - The domain.

  • workstation - The workstation, as needed.

Example NTLM credential configuration:

[ {"credential" :
   { "type" : "ntlm",
     "ttl" : 300000,
     "port" : 80,
     "host" : "someHost",
     "domain" : "someDomain",
     "userName" : "someUser",
     "password" : "XXXXXXXX"
   }
} ]

Crawl a Web site protected by Kerberos

The Fusion Web connector can crawl Web sites protected by Kerberos using SPNEGO. This is a way to access Web sites without requiring a user’s login credentials.

The Fusion Web connector can optionally use Kerberos with SAML/Smart Form authentication.

To crawl Kerberos-protected Web sites, first create the necessary configuration files, then configure Fusion to use them.

Create standard Java configuration files to connect to Kerberos

Fusion uses the JDK standard JAAS Kerberos implementation, which is based on three system properties that reference three separate files.

The files are as follows:

  • On the Kerberos-protected server, a keytab file, named kerberuser.keytab in our examples.

  • On the Fusion system, a configuration file named login.conf.

  • On the Fusion system, an initialization file named krb5.ini.

Create a Kerberos keytab

Create and validate the keytab file for the Kerberos client principal you want to use to authenticate to the website.

If you do not specify the kerberosPrincipalName and kerberosKeytabFilePath or kerberosKeytabBase64 when creating the Fusion datasource, Fusion uses the default login principal and ticket cache. You can see the default values by logging into the Fusion server as the user who runs Fusion and running klist.

If you do not want to use the default account and credentials, specify these configuration properties when creating a keytab as well as in the Web datasource setup. Use the Kerberos user principal name (UPN), not the service principal name (SPN, which is used with the Kerberos security realm). In some cases the UPN can be a service.

In our examples, the Fusion Web crawler authenticates to the Web sites using the user kerbuser@win.lab.lucidworks.com. We create a keytab file kerbuser.keytab for the user principal kerbuser@WIN.LAB.LUCIDWORKS.COM.

Create a Kerberos keytab on Windows

Example:

ktpass -out kerbuser.keytab -princ kerbuser@WIN.LAB.LUCIDWORKS.COM -mapUser kerbuser -mapOp set -pass YOUR_PASSWORD -crypto ALL -pType KRB5_NT_PRINCIPAL
Create a Kerberos keytab on Ubuntu Linux

Prerequisite: Install the krb5-user package: sudo apt-get install krb5-user

Example:

ktutil
addent -password -p HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM -k 1 -e aes128-cts-hmac-sha1-96
- it will ask you for password of kerbuser -
wkt kerbuser.keytab
q
Test the keytab

Once you create a keytab, verify that it works.

Prerequisite: You need a version of curl installed that allows SPNEGO. To test whether your version of curl does this, run curl --version and make sure SPNEGO is in the output.

Run the following curl command (replace the keytab path and site):

export KRB5CCNAME=FILE:/path/to/kerbuser.keytab curl -vvv --negotiate -u : http://your-site.com

Note that the first request is a 401 status code for the negotiate request followed by a second request, which is a status of 200.

Create a login.conf and krb5.ini

On the Fusion server, create login.conf and krb5.ini files as follows.

Create a login.conf on Windows

In this example, the keytab is stored at C:\kerb\kerbuser.keytab

KrbLogin {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="C:\kerb\kerbuser.keytab"
  useTicketCache=true
  principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
  debug=true;
};
Create a login.conf on Linux

In this example, the keytab is stored at /home/lucidworks/kb.keytab

com.sun.security.jgss.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=true
  keyTab="/home/lucidworks/kerbuser.keytab"
  useTicketCache=true
  principal="kerbuser@WIN.LAB.LUCIDWORKS.COM"
  debug=true;
};

The format of the login.conf is described on the Oracle Web site.

Create a krb5.ini

When you install krb5 on Linux, you can find a Kerberos configuration file in /etc/krb5.conf. You can optionally create a custom one instead.

Creating a krb5.conf is the same for Linux and Windows.

In this example the domain is WIN.LAB.LUCIDWORKS.COM, the Kerberos kdc host is my.kdc-dns.com, and the Kerberos admin server is my-admin-server-dns.com.

Example:

[libdefaults]
    default_realm = WIN.LAB.LUCIDWORKS.COM
    default_tkt_enctypes = aes128-cts-hmac-sha1-96
    default_tgs_enctypes = aes128-cts-hmac-sha1-96
    permitted_enctypes = aes128-cts-hmac-sha1-96
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    udp_preference_limit = 1

[realms]
WIN.LAB.LUCIDWORKS.COM = {
   kdc = my.kdc-dns.com
   admin_server = my.admin-server-dns.com
}

[domain_realm]
.WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM
WIN.LAB.LUCIDWORKS.COM = WIN.LAB.LUCIDWORKS.COM

The format of the krb5.ini file is described in the MIT Kerberos documentation. You can change the encryption algorithms by changing the properties default_tkt_enctypes, default_tgs_enctypes, and permitted_enctypes as needed. For example:

default_tkt_enctypes = RC4-HMAC
default_tgs_enctypes = RC4-HMAC
Permitted_enctypes = RC4-HMAC

Configure Fusion to use Kerberos

Once you have the keytab, login.conf, and krb5.ini files, configure Fusion to use Kerberos. You must set a property in a Fusion configuration file in addition to defining the datasource in the Fusion UI.

At the command line on any machine in your Fusion cluster:

  1. In $FUSION_HOME/conf/fusion.properties, add the following property to the connectors-classic jvmOptions setting: -Djavax.security.auth.useSubjectCredsOnly=false

  2. Restart the connectors-classic service using ./bin/connectors-classic restart on Linux or bin\connectors-classic.cmd restart on Windows.

In the Fusion UI:

  1. Click Indexing > Datasources.

  2. Click Add+, then Web.

  3. Enter a datasource ID and a start link.

  4. Click Crawl authorization.

  5. At the bottom of the section, check Enable SPNEGO/Kerberos Authentication.

  6. You can either use the default principal name or specify a principal name to use.

    • If you do not specify the principal name, then Fusion uses the default login principal and ticket cache. You can see those default values by logging into the Fusion server as the user who runs Fusion and running klist.

  7. If you specify a principal name, you must provide a keytab, either in Base64 or as a file path.

    • If you specify a keytab file path, the file must be on the machine running the Fusion connector, for each connector’s node in the cluster.

    • The Base64 option lets you supply the keytab in one place, in the UI.

  8. Fill in any remaining options to configure the datasource.

  9. Click Save.

Troubleshoot Kerberos authentication

javax.security.auth.login.LoginException: No key to store

Problem: When trying to crawl a Kerberos-authenticated Web site, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
    at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
    at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
    at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
    at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=HTTP/kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
    at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more
Caused by: javax.security.auth.login.LoginException: No key to store
    at com.sun.security.auth.module.Krb5LoginModule.commit(Krb5LoginModule.java:1119) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
    at javax.security.auth.login.LoginContext.login(LoginContext.java:588) ~[?:1.8.0_161]
    at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
    at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
    ... 8 more

Resolution:

First test your keytab as described in test the keytab above.

If your keytab passes validation, another possibility is that the /tmp/krb* cache file got corrupted or is not compatible after you went through other troubleshooting steps. To rule that out, remove the /tmp/krb* cache file on all hosts, restart your connectors-classic, and try the crawl again. That is, on each host:

rm -f /tmp/krb*
$FUSION_HOME/bin/connectors-classic restart
401 error

Problem: Crawling using the Web connector with Kerberos results in a 401 error, but curl with Kerberos works fine.

Resolution:

Make sure you have this system property set in connectors-classic jvmOptions on all nodes:

-Djavax.security.auth.useSubjectCredsOnly=false

You must restart connectors-classic after making that change.

If that doesn’t work, make sure the user you are authenticating with from Curl matches the user you are trying to authenticate with from the Web connector. To see your Kerberos principal user name, run klist.

Error: “Pre-authentication information was invalid - Identifier doesn’t match expected value”

Problem: When crawling using the Web connector with Kerberos enabled, you get an error like this:

crawler.common.ComponentInitException: Could not initialize Spnego/Kerberos.
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:282) ~[lucid.web-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.initComponents(Crawler.java:125) ~[lucid.anda-4.0.2.jar:?]
	at crawler.Crawler.init(Crawler.java:108) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.ComponentFactory.initComponent(ComponentFactory.java:37) ~[lucid.anda-4.0.2.jar:?]
	at crawler.common.config.CrawlConfig.buildCrawler(CrawlConfig.java:212) ~[lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.anda.AndaFetcher.start(AndaFetcher.java:139) [lucid.anda-4.0.2.jar:?]
	at com.lucidworks.connectors.ConnectorJob.start(ConnectorJob.java:200) [lucid.shared-4.0.2.jar:?]
	at com.lucidworks.connectors.Connector$RunnableJob.run(Connector.java:319) [lucid.shared-4.0.2.jar:?]
Caused by: java.lang.Exception: Could not perform spnego/kerberos login. java.security.krb5.conf = /etc/krb5.conf,, Keytab file = /home/ndipiazza/Downloads/kerbuser.keytab, login config = {principal=kerbuser@WIN.LAB.LUCIDWORKS.COM, debug=false, storeKey=true, keyTab=/home/ndipiazza/Downloads/kerbuser.keytab, useKeyTab=true, useTicketCache=true, refreshKrb5Config=true}
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:83) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: javax.security.auth.login.LoginException: Pre-authentication information was invalid (24)
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.KrbException: Pre-authentication information was invalid (24)
	at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:76) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more
Caused by: sun.security.krb5.Asn1Exception: Identifier doesn't match expected value (906)
	at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.init(ASRep.java:64) ~[?:1.8.0_161]
	at sun.security.krb5.internal.ASRep.<init>(ASRep.java:59) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsRep.<init>(KrbAsRep.java:60) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) ~[?:1.8.0_161]
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776) ~[?:1.8.0_161]
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_161]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_161]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) ~[?:1.8.0_161]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) ~[?:1.8.0_161]
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587) ~[?:1.8.0_161]
	at com.lucidworks.connectors.spnego.SpnegoAuth.<init>(SpnegoAuth.java:76) ~[?:?]
	at crawler.fetch.impl.http.WebFetcher.init(WebFetcher.java:279) ~[?:?]
	... 8 more

Resolution:

Your keytab’s principal name doesn’t match the value on the ticket server. Check the principal name for your user.

Add custom headers to HTTP requests

You can optionally add custom headers to all http get requests from the Web connector. For example, you might want to add a header that includes Connection: keep-alive to prevent the connector from timing out while crawling your Web site.

To add a custom header, use the configuration parameter f.addedHeaders. To send multiple headers, use the following format:

Header1: Value1
Header2: Value2

To add a custom header in the Fusion UI on any node:

  1. Click Indexing > Datasources.

  2. Click Add+, then Web.

  3. Enter a datasource ID and a start link.

  4. Click Link discovery.

  5. Fill in the Headers to add to HTTP requests field.

    • Add each header in the format HeaderName: HeaderValue.

    • To add multiple headers to all HTTP requests, put each header on a new line.

  6. Fill in any remaining options to configure the datasource.

  7. Click Save.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.