Deduplicate Web Content using Canonical Tags

The Web connector retrieves data from a Web site using HTTP and starting from a specified URL.

In content management and online shopping systems, it is common for the same content to be accessed through multiple URLs. Content syndication helps you distribute content to different URLs and domains, consolidate link signals for the duplicate or similar content, and track metrics for a single product or topic. But it creates some challenges when people use search engines to reach your page.

The Fusion Web connector can leverage canonical meta tags in your website’s HTML to deduplicate web pages.

To deduplicate web pages using canonical tags in the Fusion UI:
  1. When configuring your Web datasource, toggle Advanced at the top of the page.

  2. Under Dedupe, click Dedupe documents.

  3. Make sure Deduplication via canonical tag is checked.