Crawl Javascript Web Sites

The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.

JavaScript-enabled Web sites require a different crawling configuration than plain HTML Web sites do.

Enable JavaScript evaluation

To enable JavaScript evaluation, set the f.crawlJS/"Evaluate Javascript" parameter to "true". When this option is enabled, the connector crawls links rendered from JavaScript evaluation, using the a headless browser by default (see below).

As of Fusion 4.0.2, JavaScript evaluation is fastest when using High Performance JavaScript Evaluation mode, which uses the Chromium browser with the Web connector. There is also a High Performance Mode parameter, which uses the Chromium web browser to crawl web sites.

How is "High performance JavaScript evaluation" different than the non-high performance mode? In high performance mode, the following are true:
  • JavaScript evaluation is faster.

  • JavaScript evaluated content is more accurate.

  • You can run in "non-headless" mode to watch the browser crawl your content on a desktop, for debugging purposes.

  • You can save screenshots of pages and index them as base64 along with your other web page content.

  • The process uses more RAM.

  • The process uses more CPU.

Dependencies for Non-high Performance JavaScript Evaluation

For the default JavaScript evaluation, you need only a small number of dependencies to support PhantomJS headless browser.

RHEL / CentOS / Amazon Linux

yum install fontconfig freetype

Ubuntu / Debian

apt-get install build-essential chrpath libssl-dev libxft-dev libfreetype6-dev libfreetype6 libfontconfig1-dev libfontconfig1

Windows

For Windows, there are no extra dependencies needed.

Dependencies for High Performance JavaScript Evaluation

To enable high-performance JavaScript evaluation, you must install some dependencies, then set f.crawlJS/"Evaluate JavaScript" and f.useHighPerfJsEval/"High Performance Mode" to "true".

Enable high-performance mode with Chromium
  1. Install the dependencies, using one of the scripts packaged with Fusion:

    • Windows: VAR-FUSIONPATH/bin/install-high-perf-web-deps.ps1

    • Linux and OSX: VAR-FUSIONPATH/bin/install-high-perf-web-deps.sh

    When the script finishes a confirmation message displays:

    Successfully installed high-performance JS eval mode dependencies for Lucidworks Fusion web connector.
  2. In the Fusion UI, configure your Web data source:

    1. Set Evaluate Javascript to "true".

    2. Set High Performance Mode to "true".

    An additional f.headlessBrowser parameter can be set to "false" to display the browser windows during processing. It is "true" by default. Non-headless mode is available only using the "High-performance" mode.

  3. Save the datasource configuration.

Note
If Fusion is running on Docker, you must either mount an shm directory using the argument -v /dev/shm:/dev/shm or use the flag --shm-size=2g to use the host’s shared memory. The default shm size 64m will result in failing crawls with logs showing error messages like org.openqa.selenium.WebDriverException: Failed to decode response from marionette. See Geckodriver issue 1193 for more details.

Screenshots with High Performance Mode

To enable screenshots, configure these parameters:

  • f.takeScreenshot - Enable screenshots. Default: false.

  • f.screenshotFullscreen - Take full-screen screenshots. Default: false.

These additional viewport parameters exist mainly to support screenshots, but the viewport settings can also affect the overall output:

  • f.viewportWidth - View port width in pixels. Default: 800.

  • f.viewportHeight - View port height in pixels. Default: 600.

  • f.deviceScreenFactor - Device screen factor. Default: 1 (no scaling). See the Android Screen compatibility overview for a description about device displays.

  • f.simulateMobile - (advanced property) Tell the browser to "emulate" a mobile browser. Default: false.

  • f.mobileScreenWidth - (advanced property) Only applicable with f.simulateMobile. Sets the screen width in pixels.

  • f.mobileScreenHeight - (advanced property) Only applicable with f.simulateMobile. Sets the screen height in pixels.

Authentication with JavaScript Evaluation

To use authentication when JavaScript evaluation is enabled, use the SmartForm (SAML) option because it can log in to a Web site like a typical browser user.

SmartForm login functionality is more powerful when JavaScript evaluation is enabled:

  • You can perform login on forms that might be JavaScript rendered.

  • You can use a variety of HTML selectors to find the elements to enter as login information.

    By contrast, when JavaScript evaluation is disabled, you can only provide inputs using the name attribute of <input> elements.

Configure authentication for JavaScript-enabled crawling
  1. Launch Chrome.

  2. Select File > New Private Window.

  3. Navigate to the site that you want to crawl with authentication.

    For example, navigate to http://some-website-with-auth.com.

  4. Identify the URL for the login page.

    For example, from http://some-website-with-auth.com, navigate to the page that displays the login form, then copy the page URL, such as http://some-website-with-auth.com/sso/login.

    Use this URL as the value of the loginUrl parameter (URL in the Fusion UI) explained in Complex form-based authentication.

  5. On the login page, identify the fields used for inputting the username and password.

    You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.

    In most cases it is an <input> element that has a name attribute and you can specify the field as this name value. For example:

    <input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
  6. Add the username field as a Property to the SmartForm login. That is, "add" a Property, where the "Property Name" is login and the "Property Value" is the username you need to log in as.

  7. Add the password field name as the passwordParamName (Password Parameter in the Fusion UI).

  8. On the site login page, right-click "Submit" and select Inspect element.

    • If the button is an <input type="submit"/>, then the SmartForm login picks it up automatically.

    • If the button is another element (such as <button>, <a>, <div>, and so on) then you must add a parameter whose name has the special prefix ::submitButtonXPath::, then add an XPath expression that points to the submit button. For example: ::submitButtonXPath:://button[@name='loginButton']

  9. If there is no name attribute on the <input> elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:

    ;;BY_XPATH;;//input[@id='someId']
    ;;BY_ID;;someid
    ;;BY_NAME;;somename
    ;;BY_CLASS_NAME;;someCssClassName
    ;;BY_CSS_SELECTOR;;.div#selector

Sometimes your Web page asks you a random question, such as What is the name of your first dog?

In this case we add another special parameter:

::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched

Here is an example of three different parameters where our site might ask one of three questions randomly:

::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer
::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer
::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer

Debug the JavaScript Evaluation Stage using Non-headless Chromium

When testing the Web connector with Chromium, it helps to install Fusion on a workstation with desktop abilities, such as Windows, Mac, or Linux with a desktop. Then configure a Web datasource with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to 1, and uncheck Javascript Evaluation > Headless Browser.

This results in the Web fetcher using a single instance of Chromium in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Chromium evaluation stage.