Extract Content from Web Pages

The Web V1 connector retrieves data from a Web site using HTTP and starting from a specified URL.

The connector supports several approaches to extracting and filtering content from pages. When analyzing the HTML of a page, the connector can specifically include or exclude elements based on the HTML tag, the tag ID, or the tag class (such as a div tag, or the #content tag ID).

Specific tags can be selected to become fields of the document if needed. For example, all content from <h1> tags can be pulled into an h1 field, and with field mapping be transformed into document titles.

For other advanced capabilities, you can use jsoup selectors to find elements in the content to include or exclude from the content.

While field mapping is generally a function of the index pipeline, you can define some initial mappings to occur during the crawl. The "initial mappings" property for each web datasource is predefined with three mappings: to move fetchedDates to a fetchedDates_dts field, to move lastModified to a lastModified_dt field, and to move length to a length_l field.

Finally, the crawler can deduplicate crawled content. You can define a specific field to use for this deduplication (such as title, or another field), or you can use the full raw content as the default. In the Fusion UI, when you are defining your datasource, toggle Advanced to access the Dedupe settings.