Web Connector and Datasource Configuration

The Web connector is used to retrieve data from a Web site using HTTP and starting from a specified URL.

Crawling JavaScript Web sites

JavaScript-enabled Web sites require different crawling behavior than plain HTML Web sites.

Enabling JavaScript evaluation

To enable JavaScript evaluation, set the f.crawlJS/"Evaluate Javascript" parameter to "true". When this option is enabled, the connector crawls links rendered from JavaScript evaluation, using the Firefox browser by default (see below). When the Firefox option is disabled, then a Java-embedded browser called JBrowserDriver is used.

Note
This feature requires Oracle JDK with JavaFX, or OpenJDK with OpenJFX.

JavaScript evaluation with Firefox

JavaScript evaluation is fastest when using the headless Firefox browser bundled with the Web connector as of Fusion 3.1.3. This option is enabled by default with the f.useFirefox parameter. (You must explicitly set f.crawlJS to "true".)

Note
If f.useFirefox is set to "false", then a Java-embedded browser called JBrowserDriver is used. We recommend you use Firefox when possible because it provides more reliable JavaScript evaluation behavior.

These additional parameters configure the feature:

  • f.firefoxBinaryPath configures the path to the Firefox binary. Normally you do not need to set this; by default, the connector uses the binary that is bundled with it.

  • f.firefoxHeadlessBrowser can be set to "false" to display the Firefox browser windows during processing.

Note
On headless Linux environments you must install GTK3 to use the Firefox web fetcher due to Mozilla bug 1372998. If you are on the Desktop version of any of these operating systems, you do not need to install any additional packages. But if you are using the server installation, which is typical for running Fusion, you must install the prerequisites listed:
How to install Headless Firefox prerequisites on RHEL Server, Fedora Server, Centos Server and Amazon EC2 Linux:
  1. sudo yum install gtk3

  2. sudo yum install libXt

How to install Headless Firefox prerequisites on Ubuntu Server:
  1. sudo apt-get install libgtk-3.0

  2. sudo apt-get install libdbus-glib-1-2

  3. sudo apt-get install xvfb

Authentication with Firefox

In order to use authentication when Javascript evaluation is enabled, you will typically use the SmartForm (SAML) option because it can log in to a Web site like typical browser user.

SmartForm login functionality is more powerful when JavaScript evaluation is enabled:

  • You can perform login on forms that may be JavaScript rendered.

  • You can use a variety of HTML selectors to find the elements to enter as login information.

    By contrast, when JavaScript evaluation is disabled, you can only provide inputs using the name attribute of <input> elements.

To configure authentication for JavaScript-enabled crawling
  1. Launch Firefox.

  2. Select File > New Private Window.

  3. Navigate to the site that you want to crawl with authentication.

    For example, navigate to http://some-website-with-auth.com.

  4. Identify the URL for the login page.

    For example, from http://some-website-with-auth.com, navigate to the page that displays the login form, then copy the page URL, such as http://some-website-with-auth.com/sso/login.

    You’ll use this URL as the value of the loginUrl parameter (URL in the Fusion UI) explained in Complex form-based authentication.

  5. On the login page, identify the fields used for inputting the username and password.

    You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.

    In most cases it will be an <input> element that has a name attribute and you can specify the field as this name value. For example:

    <input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
  6. Add the username field as a Property to the SmartForm login. So we "add" a Property, and the "Property Name" would be login and the "Property Value" would be the username you need to log in as.

  7. Add the password field name as the passwordParamName (Password Parameter in the Fusion UI).

  8. On the site login page, right, click on the "Submit" button and select Inspect element.

    • If the button is an <input type="submit"/>, then the SmartForm login will pick it up automatically and we do not have to do anything extra.

    • If the button is another element (such as <button>, <a>, <div>, and so on) then you must add a parameter whose name has the special prefix ::submitButtonXPath::, then add an XPath expression that points to the submit button. For example: ::submitButtonXPath:://button[@name='loginButton']

  9. If there is no name attribute on the <input> elements, then you must specify a special parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:

    ;;BY_XPATH;;//input[@id='someId']
    ;;BY_ID;;someid
    ;;BY_NAME;;somename
    ;;BY_CLASS_NAME;;someCssClassName
    ;;BY_CSS_SELECTOR;;.div#selector

Sometimes your Web page will ask you a random question, such as What is the name of your first dog?

In this case we add another special parameter:

::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched

Here is an example of three different parameters where our site might ask one of three different questions randomly:

::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer
::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer
::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer

Debugging the Javascript Evaluation Stage using Non-headless Firefox

When testing the Web connector with Firefox, it helps to install Fusion on a workstation with desktop abilities, such as Windows, Mac, or Linux with a desktop. Then configure a Web datasource with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to 1, and uncheck Javascript Evaluation > Run Firefox in Headless Mode.

This will result in the Web fetcher using a single instance of Firefox in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Firefox evaluation stage.

Limiting Crawl Scope

The connector works by going to the seed page (the "startURIs" specified in the configuration form), collecting the content for indexing, and extracting any links to other pages. It then follows those links to collect content on other pages, extracting links to those pages, etc.

When creating a Web data source, pay attention to the "Max depth" and "Restrict To Tree" parameters (also known as "c.depth" and "c.restrictToTree" in the REST API). These properties will help limit the scope of your crawl to prevent an "unbounded" crawl that could continue for a long time, particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl may also cause memory errors in your system.

The connector keeps track of URIs it has seen, and many of the properties relate to managing the resulting database of entries. If the connector finds a standard redirect, it will track that the redirected URI has an alias, and will not re-evaluate the URI on its next runs until the alias expiration has passed. Documents that were found to be duplicates, if de-duplication is enabled, are also added to the alias list and are not re-evaluated until the alias expiration has passed.

Regular expressions can be used to restrict the crawl either by defining URI patterns that should be followed or URI patterns that should not be followed.

Additionally, specific patterns of the URI can also be defined to define URIs that should not be followed.

Extracting Content from Pages

The connector supports several approaches to extracting and filtering content from pages. When analyzing the HTML of a page, the connector can specifically include or exclude elements based on the HTML tag, the tag ID, or the tag class (such as a 'div' tag, or the '#content' tag ID).

Specific tags can be selected to become fields of the document if needed. For example, all content from <h1> tags can be pulled into a 'h1' field, and with field mapping be transformed into document titles.

For even more advanced capabilities, you can use jsoup selectors to find elements in the content to include or exclude from the content.

While field mapping is generally a function of the index pipeline, you can define some initial mapping to occur during the crawl. The 'initial mappings' property for each web datasource is pre-defined with three mappings, to move 'fetchedDates' to a 'fetchedDates_dts' field, to move 'lastModified' to a 'lastModified_dt' field and to move 'length' to a 'length_l' field.

Finally, the crawler is able to do de-duplication of crawled content. You can define a specific field to use for this de-duplication (such as title, or another field), or you can use the full raw content as the default.

Sitemap Processing

As of Fusion 1.1.2, crawling sitemaps is supported. Simply add the URL(s) of the sitemap to the f.sitemapURLs property ("Sitemap URLs" in the UI) and all of the URLs found in a sitemap will be added to the list of URLs to crawl. If your site has a sitemap index, i.e., a sitemap that points to other sitemaps, that is also supported and the URLs found through each sitemap will be added to the list of URLs to crawl.

If you want to configure your datasource to only crawl the sitemap file, you must add the sitemap URL to both the startLinks property (because that is a required property for a datasource) and also to the f.sitemapsURL property so it is properly treated as a sitemap by the connector when it starts.

Website Authentication

The Web connector supports Basic, Digest, Form and NTLM authentication to websites.

The credentials for a crawl are stored in a credentials file that should be placed in `fusion/3.1.x/data/connectors/container/lucid.web/datasourceName ` where the "datasourceName" corresponds to the name given to the datasource. After creating a datasource, this directory should be created for you. The file should be a JSON formatted file, ending with the '.json' file extension. When defining the datasource, you would pass the name of the file with the 'Authentication file' property in the UI (or 'f.credentialsFile' property if using the REST API).

All types of authentication require the credentials file to include a property called "type" which defines the type of authentication to use. After that, the required properties will vary depending on the type of authentication chosen.

Form-based Authentication

To use basic form-based authentication, use "form" for the type. The other properties are:

  • ttl - The "time to live" for the session that will be created after authentication. This will have the crawler log in again after the specified time so the crawl activity doesn’t fail due to an expired session. This value is defined in seconds.

  • action - The action to take to log in, i.e., the URL for the login form.

  • params - The parameters for the form, likely the username and password, but any other required properties. In the example below, we are passing two parameters, the 'os_username' and the 'os_password' that are the properties expected by the system we would like to crawl.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "form",
            "ttl" : 300000,
            "action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
            "params" : {
                "os_username" : "username",
                "os_password" : "password"
            }
        }
  } ]

Complex Form-based Authentication

Some websites do not manage their own authentication, but rather trust a third-party authority to authenticate the user. An example of this would be websites that use SAML to log in a user via a central single-signon authority. In order to configure fusion to log in to a website like this, use "smartForm" for the type. The other properties are:

  • ttl - the "time to live" for the session that will be created after authentication. This will have the crawler re-login after the specified time so the crawl activity doesn’t fail due to an expired session. This value is defined in seconds.

  • loginUrl - the URL on which the first page that initializes the login chain is located

  • params - a list of parameters to use for the form logins, likely the username and password, but could be other required properties. In the example below, we are passing two parameters, the 'os_username' and the 'os_password' that are the properties expected by the system we would like to crawl. Additionally we expect that once that login has happened, that a new form will be presented to the user which then posts back to where we came from. No data need to be entered in this form, which is why we include an empty { } in the params list.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "smartForm",
            "ttl" : 300000,
            "loginUrl" : "http://some.example.com/login",
            "params" : [{
                "os_username" : "username",
                "os_password" : "password"
            }, {

            } ]
        }
  } ]

In order to figure out what params you need to specify, turn off JavaScript in your browser and walk through the login chain. Though you normally only see a single login form on your screen, you may be surprised to find many more forms you need to submit before you get logged in when JavaScript is not available to perform those form submissions automatically. Each form in that chain needs to be represented in list of params. If no user input is required, simply include an empty { }.

Basic and Digest Authentication

Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types use "basic" or "digest" in the credentials file for the 'type' property. Other properties are:

  • host - the host of the site.

  • port - the port, if any.

  • userName - the username to use for authentication.

  • password - the password for the userName.

  • realm - the realm for the site, if any.

Example basic auth configuration:

[ {
        "credential" : {
            "type" : "basic",
            "ttl" : 300000,
            "userName" : "usr",
            "password" : "pswd",
            "host":"hostname.exampledomain.com”
            "port": 443
        }
  }
]

NTLM Authentication

To use NTLM authentication, use "ntlm" in the credentials file for the 'type' property. The other properties available are:

  • host - the host of the site.

  • port - the port, if any.

  • userName - the username to use for authentication.

  • password - the password for the userName.

  • realm - the realm for the site, if any.

  • domain - the domain.

  • workstation - the workstation, as needed.

Example NTLM credential configuration:

[ {"credential" :
   { "type" : "ntlm",
     "ttl" : 300000,
     "port" : 80,
     "host" : "someHost",
     "domain" : "someDomain",
     "userName" : "someUser",
     "password" : "XXXXXXXX"
   }
} ]

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.