Website Connector and Datasource Configuration

The Web connector is used to retrieve data from a website using HTTP and starting from a specified URL.

Limiting Crawl Scope

The connector works by going to the seed page (the "startURIs" specified in the configuration form), collecting the content for indexing, and extracting any links to other pages. It then follows those links to collect content on other pages, extracting links to those pages, etc.

When creating a Web data source, pay attention to the "Max depth" and "Restrict To Tree" parameters (also known as "c.depth" and "c.restrictToTree" in the REST API). These properties will help limit the scope of your crawl to prevent an "unbounded" crawl that could continue for a long time, particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl may also cause memory errors in your system.

The connector keeps track of URIs it has seen, and many of the properties relate to managing the resulting database of entries. If the connector finds a standard redirect, it will track that the redirected URI has an alias, and will not re-evaluate the URI on its next runs until the alias expiration has passed. Documents that were found to be duplicates, if de-duplication is enabled, are also added to the alias list and are not re-evaluated until the alias expiration has passed.

Regular expressions can be used to restrict the crawl either by defining URI patterns that should be followed or URI patterns that should not be followed.

Additionally, specific patterns of the URI can also be defined to define URIs that should not be followed.

Extracting Content from Pages

The connector supports several approaches to extracting and filtering content from pages. When analyzing the HTML of a page, the connector can specifically include or exclude elements based on the HTML tag, the tag ID, or the tag class (such as a 'div' tag, or the '#content' tag ID).

Specific tags can be selected to become fields of the document if needed. For example, all content from <h1> tags can be pulled into a 'h1' field, and with field mapping be transformed into document titles.

For even more advanced capabilities, you can use jsoup selectors to find elements in the content to include or exclude from the content.

While field mapping is generally a function of the index pipeline, you can define some initial mapping to occur during the crawl. The 'initial mappings' property for each web datasource is pre-defined with three mappings, to move 'fetchedDates' to a 'fetchedDates_dts' field, to move 'lastModified' to a 'lastModified_dt' field and to move 'length' to a 'length_l' field.

Finally, the crawler is able to do de-duplication of crawled content. You can define a specific field to use for this de-duplication (such as title, or another field), or you can use the full raw content as the default.

Sitemap Processing

As of Fusion 1.1.2, crawling sitemaps is supported. Simply add the URL(s) of the sitemap to the f.sitemapURLs property ("Sitemap URLs" in the UI) and all of the URLs found in a sitemap will be added to the list of URLs to crawl. If your site has a sitemap index, i.e., a sitemap that points to other sitemaps, that is also supported and the URLs found through each sitemap will be added to the list of URLs to crawl.

If you want to configure your datasource to only crawl the sitemap file, you must add the sitemap URL to both the startLinks property (because that is a required property for a datasource) and also to the f.sitemapsURL property so it is properly treated as a sitemap by the connector when it starts.

Website Authentication

The Web connector supports Basic, Digest, Form and NTLM authentication to websites.

The credentials for a crawl are stored in a credentials file that should be placed in `$FUSION/data/connectors/container/lucid.anda/datasourceName ` where the "datasourceName" corresponds to the name given to the datasource. After creating a datasource, this directory should be created for you. The file should be a JSON formatted file, ending with the '.json' file extension. When defining the datasource, you would pass the name of the file with the 'Authentication file' property in the UI (or 'f.credentialsFile' property if using the REST API).

All types of authentication require the credentials file to include a property called "type" which defines the type of authentication to use. After that, the required properties will vary depending on the type of authentication chosen.

Form-based Authentication

To use form-based authentication, use "form" for the type. The other properties are:

  • ttl - the "time to live" for the session that will be created after authentication. This will have the crawler re-login after the specified time so the crawl activity doesn’t fail due to an expired session. This value is defined in seconds.

  • action - the action to take to login, i.e., the URL for the login form.

  • params - the parameters for the form, likely the username and password, but any other required properties. In the example below, we are passing two parameters, the 'os_username' and the 'os_password' that are the properties expected by the system we would like to crawl.

Here is an example using form-based authentication:

[ {
        "credential" : {
            "type" : "form",
            "ttl" : 300000,
            "action" : "http://some.server.com/login.action?os_destination=%2Fpages%2Fviewpage.action%3Ftitle%3DAcme%2B5%2BDocumentation%26spaceKey%3DAcme5",
            "params" : {
                "os_username" : "username",
                "os_password" : "password"
            }
        }
  } ]

Basic and Digest Authentication

Basic and Digest authentication are simple HTTP authentication methods still in use in some places. To use either of these types use "basic" or "digest" in the credentials file for the 'type' property. Other properties are:

  • host - the host of the site.

  • port - the port, if any.

  • userName - the username to use for authentication.

  • password - the password for the userName.

  • realm - the realm for the site, if any.

Example basic auth configuration:

[ {
        "credential" : {
            "type" : "basic",
            "ttl" : 300000,
            "userName" : "usr",
            "password" : "pswd",
            "host":"hostname.exampledomain.com”
            "port": 443
        }
  }
]

NTLM Authentication

To use NTLM authentication, use "ntlm" in the credentials file for the 'type' property. The other properties available are:

  • host - the host of the site.

  • port - the port, if any.

  • userName - the username to use for authentication.

  • password - the password for the userName.

  • realm - the realm for the site, if any.

  • domain - the domain.

  • workstation - the workstation, as needed.

Example NTLM credential configuration:

[ {"credential" :
   { "type" : "ntlm",
     "ttl" : 300000,
     "port" : 80,
     "host" : "someHost",
     "domain" : "someDomain",
     "userName" : "someUser",
     "password" : "XXXXXXXX"
   }
} ]

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.