Product Selector

Fusion 5.9
    Fusion 5.9

    Crawl JavaScript Web Sites

    JavaScript-enabled web sites require a different crawling configuration than plain HTML web sites do.

    Enable JavaScript evaluation

    To enable JavaScript evaluation, set the f.crawlJS/"Evaluate Javascript" parameter to "true". When this option is enabled, the connector crawls links rendered from JavaScript evaluation, using a headless browser by default (see below).

    JavaScript evaluation is fastest when using High Performance JavaScript Evaluation mode, which uses the Chromium browser with the Web connector. There is also a High Performance Mode parameter, which uses the Chromium web browser to crawl web sites.

    How is "High performance JavaScript evaluation" different than the non-high performance mode? In high performance mode, the following are true:
    • JavaScript evaluation is faster.

    • JavaScript evaluated content is more accurate.

    • You can run in "non-headless" mode to watch the browser crawl your content on a desktop, for debugging purposes.

    • You can save screenshots of pages and index them as base64 along with your other web page content.

    • The process uses more RAM.

    • The process uses more CPU.

    Dependencies for High Performance JavaScript Evaluation

    This section only applies to 4.x.x releases. For 5.x.x releases, Chrome dependencies are already installed in the Docker container.

    To enable high-performance JavaScript evaluation, you must install some dependencies, then set f.crawlJS/"Evaluate JavaScript" and f.useHighPerfJsEval/"High Performance Mode" to "true".

    Enable high-performance mode with Chromium
    1. Install the dependencies, using one of the scripts packaged with Fusion:

      • Windows: https://FUSION_HOST:FUSION_PORT/bin/install-high-perf-web-deps.ps1

      • Linux and OSX: https://FUSION_HOST:FUSION_PORT/bin/install-high-perf-web-deps.sh

      When the script finishes a confirmation message displays:

      Successfully installed high-performance JS eval mode dependencies for Lucidworks Fusion web connector.
    2. In the Fusion UI, configure your Web data source:

      1. Set Evaluate Javascript to "true".

      2. Set High Performance Mode to "true".

      An additional f.headlessBrowser parameter can be set to "false" to display the browser windows during processing. It is "true" by default. Non-headless mode is available only using the "High-performance" mode.

    3. Save the datasource configuration.

    If Fusion is running on Docker, you must either mount an shm directory using the argument -v /dev/shm:/dev/shm or use the flag --shm-size=2g to use the host’s shared memory. The default shm size 64m will result in failing crawls with logs showing error messages like org.openqa.selenium.WebDriverException: Failed to decode response from marionette. See Geckodriver issue 1193 for more details.

    Screenshots with High Performance Mode

    To enable screenshots, configure these parameters:

    • f.takeScreenshot - Enable screenshots. Default: false.

    • f.screenshotFullscreen - Take full-screen screenshots. Default: false.

    These additional viewport parameters exist mainly to support screenshots, but the viewport settings can also affect the overall output:

    • f.viewportWidth - View port width in pixels. Default: 800.

    • f.viewportHeight - View port height in pixels. Default: 600.

    • f.deviceScreenFactor - Device screen factor. Default: 1 (no scaling). See the Android Screen compatibility overview for a description about device displays.

    • f.simulateMobile - (advanced property) Tell the browser to "emulate" a mobile browser. Default: false.

    • f.mobileScreenWidth - (advanced property) Only applicable with f.simulateMobile. Sets the screen width in pixels.

    • f.mobileScreenHeight - (advanced property) Only applicable with f.simulateMobile. Sets the screen height in pixels.

    Authentication with JavaScript Evaluation

    To use authentication when JavaScript evaluation is enabled, use the SmartForm (SAML) option because it can log in to a web site like a typical browser user.

    SmartForm login functionality is more powerful when JavaScript evaluation is enabled:

    • You can perform login on forms that might be JavaScript rendered.

    • You can use a variety of HTML selectors to find the elements to enter as login information.

      By contrast, when JavaScript evaluation is disabled, you can only provide inputs using the name attribute of <input> elements.

    Configure authentication for JavaScript-enabled crawling
    1. Open a private or incognito window in your internet browser. This example uses Google Chrome.

    2. Navigate to the site that you want to crawl with authentication.

    3. Identify the URL for the login page.

      For example, from http://some-website-with-auth.com, navigate to the page that displays the login form, then copy the page URL, such as http://some-website-with-auth.com/sso/login.

      Use this URL as the value of the loginUrl parameter (URL in the Fusion UI) explained in Complex form-based authentication.

    4. On the login page, identify the fields used for inputting the username and password.

      You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.

      In most cases it is an <input> element that has a name attribute and you can specify the field as this name value. For example:

      <input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
    5. Add the username field as a parameter to the authentication section of the Web connector. The Property Name is login and the Property Value is the username the connector will use to log in.

    6. Add the password field name as the passwordParamName (Password Parameter in the Fusion UI).

    7. On the site login page, right-click Submit (or equivalent) and select Inspect element.

      • If the button is an <input type="submit"/>, then the SmartForm login picks it up automatically.

      • If the button is another element (such as <button>, <a>, <div>, and so on) then you must add a parameter with the special prefix ::submitButtonXPath::, then add an XPath expression that points to the submit button. For example: ::submitButtonXPath:://button[@name='loginButton']

    8. If there is no name attribute on the <input> elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:

      ;;BY_XPATH;;//input[@id='someId']
      ;;BY_ID;;someid
      ;;BY_NAME;;somename
      ;;BY_CLASS_NAME;;someCssClassName
      ;;BY_CSS_SELECTOR;;.div#selector

    Sometimes your web page asks you a random question, such as What is the name of your first dog?

    In this case add another special parameter:

    ::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched

    Here is an example of three different parameters where your site might ask one of three questions randomly:

    ::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer
    ::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer
    ::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer

    Debug the JavaScript Evaluation Stage using Non-headless Chromium

    When testing the Web connector with Chromium, it helps to access Fusion through a GUI-enabled browser. Configure a Web data source with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to 1, and uncheck Javascript Evaluation > Headless Browser.

    This results in the Web fetcher using a single instance of Chromium in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Chromium evaluation stage.