Compatible with Fusion version: 4.0.0 through 5.12.0
The Web V1 connector retrieves data from a website using HTTP and starting from a specified URL. The connector discovers pages through links such as href attributes and sitemaps up to your configured depth, then indexes those pages.
ImportantV1 deprecation and removal noticeStarting in Fusion 5.12.0, all V1 connectors are deprecated. This means they are no longer being actively developed and will be removed in Fusion 5.13.0.The replacement for this connector is the Web V2 connector.If you are using this connector, you must migrate to the replacement connector or a supported alternative before upgrading to Fusion 5.13.0. We recommend migrating to the replacement connector as soon as possible to avoid any disruption to your workflows.
If you migrate to the Web V2 connector instead of the Web V1 connector, you’ll get better Java SDK-based performance, distributed fetching, upgrades to the plugin as new versions are released, and built-in OAuth token support. Fusion 5.x uses the Open Graph Protocol metadata such as og:title and og:description as the default configuration for fields. Deviation from that standard configuration may exclude information from indexing during the crawl.

Prerequisites

Perform these prerequisites to ensure the connector can reliably access, crawl, and index your data. Proper setup helps avoid configuration or permission errors, so use the following guidelines to keep your content available for discovery and search in Fusion. Network connectivity requires that your Fusion server or remote connector host be able to reach the target site over HTTP/HTTPS on the required ports.

Authentication

Setting up the correct authentication according to your organization’s data governance policies helps keep sensitive data secure while allowing authorized indexing. The Web V1 connector supports many HTTP-level authentication methods. You configure these under Crawl Authentication Properties in your datasource. The supported authentication schemes include the following:
  • Basic HTTP Authentication: Provide host, port, realm (if any), username, and password.
  • Digest HTTP Authentication: Same parameters as Basic, but using the Digest challenge/response scheme.
  • Form Authentication: Post to a login form URL with whatever name/value pairs the site expects, plus a TTL for the session.
  • SAML/Smart Form Authentication: For multi-step or SAML-backed form logins, submit a sequence of forms until you’re authenticated.
  • NTLM Authentication for Windows-style authentication: Provide domain, workstation, host, port, realm (if any), username, and password.
Once you’ve picked your scheme, Fusion will maintain cookies and session state automatically for the duration of the crawl.

Crawl options

  • If you’re crawling a website protected by a login or SmartForm, see Crawl an Authenticated Website with the Web Connector.
  • If you’re crawling a CMS or ecommerce site, you may want to Deduplicate Web Content using Canonical Tags.
  • To prevent unbounded crawls or to only crawl a portion of your website, see Limit the Crawl Scope for Web Sites.
The sitemap_incremental_crawling configuration parameter processes and crawls URLs found in the sitemap. Set to true to remove documents from the index when they can no longer be accessed as unique documents. For example, if the page is removed from the sitemap and cannot be accessed. In addition, if the page is not in the sitemap, the connector classifies the missing page as unbounded and removes the page from the index.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
If you experience CrawlDB errors such as “File is already opened and is locked”, then raise the Alias Expiration setting.