- Screenshots with High Performance Mode
The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.
You can run in "non-headless" mode to watch the browser crawl your content on a desktop, for debugging purposes.
You can save screenshots of pages and index them as base64 along with your other web page content.
The process uses more RAM.
The process uses more CPU.
RHEL / CentOS / Amazon Linux
yum install fontconfig freetype
Ubuntu / Debian
apt-get install build-essential chrpath libssl-dev libxft-dev libfreetype6-dev libfreetype6 libfontconfig1-dev libfontconfig1
For Windows, there are no extra dependencies needed.
f.useHighPerfJsEval/"High Performance Mode" to "true".
Install the dependencies, using one of the scripts packaged with Fusion:
Linux and OSX:
When the script finishes a confirmation message displays:
Successfully installed high-performance JS eval mode dependencies for Lucidworks Fusion web connector.
In the Fusion UI, configure your Web data source:
Set High Performance Mode to "true".
f.headlessBrowserparameter can be set to "false" to display the browser windows during processing. It is "true" by default. Non-headless mode is available only using the "High-performance" mode.
Save the datasource configuration.
If Fusion is running on Docker, you must either mount an
Screenshots with High Performance Mode
To enable screenshots, configure these parameters:
f.takeScreenshot- Enable screenshots. Default: false.
f.screenshotFullscreen- Take full-screen screenshots. Default: false.
These additional viewport parameters exist mainly to support screenshots, but the viewport settings can also affect the overall output:
f.viewportWidth- View port width in pixels. Default: 800.
f.viewportHeight- View port height in pixels. Default: 600.
f.deviceScreenFactor- Device screen factor. Default: 1 (no scaling). See the Android Screen compatibility overview for a description about device displays.
f.simulateMobile- (advanced property) Tell the browser to "emulate" a mobile browser. Default: false.
f.mobileScreenWidth- (advanced property) Only applicable with
f.simulateMobile. Sets the screen width in pixels.
f.mobileScreenHeight- (advanced property) Only applicable with
f.simulateMobile. Sets the screen height in pixels.
You can use a variety of HTML selectors to find the elements to enter as login information.
Select File > New Private Window.
Navigate to the site that you want to crawl with authentication.
For example, navigate to
Identify the URL for the login page.
For example, from
http://some-website-with-auth.com, navigate to the page that displays the login form, then copy the page URL, such as
Use this URL as the value of the
loginUrlparameter (URL in the Fusion UI) explained in Complex form-based authentication.
On the login page, identify the fields used for inputting the username and password.
You can do this by right-clicking on the form fields and selecting Inspect element to open the developer tools, where the corresponding HTML element is highlighted.
In most cases it is an
<input>element that has a
nameattribute and you can specify the field as this name value. For example:
<input id="resolving_input" name="login" class="signin-textfield" autocorrect="off" autocapitalize="off" type="text">
Add the username field as a Property to the SmartForm login. That is, "add" a Property, where the "Property Name" is
loginand the "Property Value" is the username you need to log in as.
Add the password field name as the
passwordParamName(Password Parameter in the Fusion UI).
On the site login page, right-click "Submit" and select Inspect element.
If the button is an
<input type="submit"/>, then the SmartForm login picks it up automatically.
If the button is another element (such as
<div>, and so on) then you must add a parameter whose name has the special prefix
::submitButtonXPath::, then add an XPath expression that points to the submit button. For example:
If there is no
nameattribute on the
<input>elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:
;;BY_XPATH;;//input[@id='someId'] ;;BY_ID;;someid ;;BY_NAME;;somename ;;BY_CLASS_NAME;;someCssClassName ;;BY_CSS_SELECTOR;;.div#selector
Sometimes your Web page asks you a random question, such as
What is the name of your first dog?
In this case we add another special parameter:
::WhenXPath::XPath of element to check against::Either @attributeToCheckAgainst or text to check against the text of the element::Value To Match::Field selector to set the value of only if the conditional check matched
Here is an example of three different parameters where our site might ask one of three questions randomly:
::WhenXPath:://div[@tag='Your question']::text::What is the name of your first dog?::;;BY_ID;;answer ::WhenXPath:://div[@tag='Your question']::text::In what city were you born?::;;BY_ID;;answer ::WhenXPath:://input[@id='Your question']::@value::In what city were you born?::;;BY_ID;;answer
When testing the Web connector with Chromium, it helps to install Fusion on a workstation with desktop abilities, such as Windows, Mac, or Linux with a desktop.
Then configure a Web datasource with your website, enable advanced mode, set the Crawl Performance > Fetch Threads to
This results in the Web fetcher using a single instance of Chromium in a window where you can see the fetch documents. This is helpful if you are getting an unexpected result from the Chromium evaluation stage.