Crawl JavaScript websites
The Web connector retrieves data from a web site using HTTP and starting from a specified URL.
Managed Fusion releases 5.5 - 5.6.0 use the Web V1 connector. Managed Fusion releases 5.6.1 and later use the Web V2 connector. JavaScript Evaluation is not supported in the Web V2 connector.
JavaScript-enabled web sites require a different crawling configuration than plain HTML web sites do.
Enable JavaScript evaluation
To enable JavaScript evaluation, set the f.crawlJS
/"Evaluate Javascript" parameter to "true". When this option is enabled, the connector crawls links rendered from JavaScript evaluation, using the f.headlessBrowser
parameter set to "true".
JavaScript evaluation is optimal when using High Performance JavaScript Evaluation mode, which uses the Chromium browser with the Web connector. There is also a High Performance Mode parameter, which uses the Chromium web browser to crawl web sites.
-
JavaScript evaluation is faster.
-
JavaScript evaluated content is more accurate.
-
You can run in "non-headless" mode to watch the browser crawl your content on a desktop for debugging purposes.
-
You can save screenshots of pages and index them as base64 along with your other web page content.
-
The process uses more RAM.
-
The process uses more CPU.
Dependencies for high performance JavaScript evaluation
Chrome dependencies are installed in the Docker container.
Screenshots with high performance mode
To enable screenshots, configure these parameters:
-
f.takeScreenshot
to enable screenshots. Default: false. -
f.screenshotFullscreen
to take full-screen screenshots. Default: false.
These additional viewport parameters exist mainly to support screenshots, but the viewport settings can also affect the overall output:
-
f.viewportWidth
to view port width in pixels. Default: 800. -
f.viewportHeight
to view port height in pixels. Default: 600. -
f.deviceScreenFactor
is the device screen factor. Default: 1 (no scaling). See the Android Screen compatibility overview for a description about device displays. -
f.simulateMobile
is the advanced property that tells the browser to "emulate" a mobile browser. Default: false. -
f.mobileScreenWidth
is the advanced property that is only applicable withf.simulateMobile
and sets the screen width in pixels. -
f.mobileScreenHeight
is the advanced property that is only applicable withf.simulateMobile
and sets the screen height in pixels.
Authentication with JavaScript Evaluation
To use authentication when JavaScript evaluation is enabled, use the SmartForm (SAML) option because it can sign in to a web site like a typical browser user.
SmartForm sign-in functionality is more powerful when JavaScript evaluation is enabled:
-
You can perform sign-in on forms that might be JavaScript rendered.
-
You can use a variety of HTML selectors to find the elements to enter as sign-in information.
By contrast, when JavaScript evaluation is disabled, you can only provide inputs using the
name
attribute of<input>
elements.
-
Launch Chrome.
-
Click File > New Private Window.
-
Navigate to the site that you want to crawl with authentication.
For example, navigate to
http://some-website-with-auth.com
. -
Identify the URL for the sign-in page.
For example, from
http://some-website-with-auth.com
, navigate to the page that displays the sign-in form, then copy the page URL, such ashttp://some-website-with-auth.com/sso/login
.Use this URL as the value of the
loginUrl
parameter which is the URL field in the Managed Fusion UI. For more information, see Complex form-based authentication. -
On the sign-in page, identify the fields used for inputting the username and password.
You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.
In most cases it is an
<input>
element that has aname
attribute and you can specify the field as this name value. For example:<input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
-
Add the username field as a Property to the SmartForm sign-in. That is, add a Property, where the Property Name is
login
and the Property Value is the username. -
Add the password field name as the
passwordParamName
, which is the Password Parameter in the Managed Fusion UI. -
On the site sign-in page, right-click Submit and click Inspect element.
-
If the button is an
<input type="submit"/>
, then the SmartForm sign-in picks it up automatically. -
If the button is another element (such as
<button>
,<a>
,<div>
, and so on), then you must add a parameter whose name has the special prefix::submitButtonXPath::
and then add an XPath expression that points to the submit button. For example:::submitButtonXPath:://button[@name='loginButton']
-
-
If there is no
name
attribute on the<input>
elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:;;BY_XPATH;;//input[@id='someId'] ;;BY_ID;;someid ;;BY_NAME;;somename ;;BY_CLASS_NAME;;someCssClassName ;;BY_CSS_SELECTOR;;.div#selector
Sometimes your Web page asks you a random question, such as What is the name of your first dog?
In this case we add another special parameter:
::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched
Here is an example of three different parameters where our site might ask one of three questions randomly:
::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer
::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer
::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer
Debug the JavaScript Evaluation Stage using Non-headless Chromium
When testing the Web connector with Chromium, it helps to install Managed Fusion on a workstation with desktop abilities, such as Windows, Mac, or Linux with a desktop.
Then configure a Web datasource with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to 1
, and uncheck Javascript Evaluation > Headless Browser.
This results in the Web fetcher using a single instance of Chromium in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Chromium evaluation stage.