Connector Configuration Reference
Crawl an Authenticated Website with the Web Connector
http://some-website-with-auth.com
, navigate to the page that displays the login form, then copy the page URL, such as http://some-website-with-auth.com/sso/login
.
Use this URL as the value of the loginUrl
parameter (URL in the Fusion UI).
<input>
element that has a name
attribute and you can specify the field as this name value. For example:
login
and the Property Value is the username the connector will use to log in.
passwordParamName
(Password Parameter in the Fusion UI).
<input type="submit"/>
, then the SmartForm login picks it up automatically.<button>
, <a>
, <div>
, and so on) then you must add a parameter with the special prefix ::submitButtonXPath::
, then add an XPath expression that points to the submit button. For example: ::submitButtonXPath:://button[@name='loginButton']
name
attribute on the <input>
elements, then you must specify a parameter to tell the Web connector how to find the input element. You can use any of these special selector formats for the parameter name:
What is the name of your first dog?
In this case add another special parameter:f.crawlJS
/“Evaluate JavaScript” and f.useHighPerfJsEval
/“High Performance Mode” to “true”.Enable high-performance mode with Chromiumhttps://FUSION_HOST:FUSION_PORT/bin/install-high-perf-web-deps.ps1
https://FUSION_HOST:FUSION_PORT/bin/install-high-perf-web-deps.sh
f.headlessBrowser
parameter can be set to “false” to display the browser windows during processing. It is “true” by default. Non-headless mode is available only using the “High-performance” mode.
shm
directory using the argument -v /dev/shm:/dev/shm
or use the flag --shm-size=2g
to use the host’s shared memory. The default shm size 64m
will result in failing crawls with logs showing error messages like org.openqa.selenium.WebDriverException: Failed to decode response from marionette
. See Geckodriver issue 1193 for more details.1
.headless mode = false
to test with an actual web browser.
Deduplicate Web Content using Canonical Tags
Limit the Crawl Scope for Web Sites
startURIs
specified in the configuration form), collecting the content for indexing, and extracting any links to other pages.
It then follows those links to collect content on other pages, extracting links to those pages, and so on.When creating a Web data source, pay attention to the Max crawl depth and Restrict To Tree parameters (c.depth
and c.restrictToTree
in the REST API).
These properties limit the scope of your crawl to prevent an unbounded crawl that could continue for a long time,
particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl can also cause memory errors in your system.The connector keeps track of URIs it has seen, and many of the properties relate to managing the resulting database of entries.
If the connector finds a standard redirect, it tracks that the redirected URI has an alias, and does not re-evaluate the URI on its next runs until the alias expiration has passed.
If deduplication is enabled, documents that were found to be duplicates are also added to the alias list and are not re-evaluated until the alias expiration has passed.Regular expressions can be used to restrict the crawl either by defining URI patterns that should be followed or URI patterns that should not be followed.Additionally, specific patterns of the URI can be defined to define URIs that should not be followed.sitemap_incremental_crawling
configuration parameter processes and crawls URLs found in the sitemap. Set to true
to remove documents from the index when they can no longer be accessed as unique documents. For example, if the page is removed from the sitemap and cannot be accessed. In addition, if the page is not in the sitemap, the connector classifies the missing page as unbounded and removes the page from the index.
\t
for the tab character. When entering configuration values in the API, use escaped characters, such as \\t
for the tab character.