Extract Content from Web Pages
The Web V1 connector retrieves data from a Web site using HTTP and starting from a specified URL.
The connector supports several approaches to extracting and filtering content from pages. When analyzing the HTML of a page, the connector can specifically include or exclude elements based on the HTML tag, the tag ID, or the tag class (such as a div
tag, or the #content
tag ID).
Specific tags can be selected to become fields of the document if needed. For example, all content from <h1>
tags can be pulled into an h1
field, and with field mapping be transformed into document titles.
For other advanced capabilities, you can use jsoup selectors to find elements in the content to include or exclude from the content.
While field mapping is generally a function of the index pipeline, you can define some initial mappings to occur during the crawl.
The "initial mappings" property for each web datasource is predefined with three mappings: to move fetchedDates
to a fetchedDates_dts
field, to move lastModified
to a lastModified_dt
field, and to move length
to a length_l
field.
Finally, the crawler can deduplicate crawled content. You can define a specific field to use for this deduplication (such as title, or another field), or you can use the full raw content as the default. In the Fusion UI, when you are defining your datasource, toggle Advanced to access the Dedupe settings.