Limit the Crawl Scope for Web Sites
The Web V1 connector retrieves data from a Web site using HTTP and starting from a specified URL.
The connector works by going to the seed page (the startURIs
specified in the configuration form), collecting the content for indexing, and extracting any links to other pages.
It then follows those links to collect content on other pages, extracting links to those pages, and so on.
When creating a Web data source, pay attention to the Max crawl depth and Restrict To Tree parameters (c.depth
and c.restrictToTree
in the REST API).
These properties limit the scope of your crawl to prevent an unbounded crawl that could continue for a long time,
particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl can also cause memory errors in your system.
The connector keeps track of URIs it has seen, and many of the properties relate to managing the resulting database of entries. If the connector finds a standard redirect, it tracks that the redirected URI has an alias, and does not re-evaluate the URI on its next runs until the alias expiration has passed. If deduplication is enabled, documents that were found to be duplicates are also added to the alias list and are not re-evaluated until the alias expiration has passed.
Regular expressions can be used to restrict the crawl either by defining URI patterns that should be followed or URI patterns that should not be followed.
Additionally, specific patterns of the URI can be defined to define URIs that should not be followed.