Web Crawler

For a Web crawler data source, Site Search indexes a website (possibly following links outside of the website).

Site Search indexes the website periodically. You can also re-index the website at need.

Supported websites

Site Search can crawl these websites:

  • Public Web – Websites must be on the public Web (Internet). They can’t be behind firewalls.

  • Server-side rendering – Site Search can search websites for which rendering of page content is server-side rendering. Client-side rendering (using JavaScript in browsers) isn’t supported. Crawling a website that uses client-side rendering indexes a single document, possibly without content.

Add a Web crawler data source

To add a data source to index a website
  1. In the Site Search menu, click Add new data source, and then click Web crawler. If you are adding the first data source, just click Web crawler.

  2. On the Configuration tab, specify which website to crawl and which documents to index. Expand Show more options to see all of the options:

    Setting Description

    Start URL

    URL from which Site Search starts crawling (indexing) a website. Specify the URI scheme (http or https), ://, and the path, for example, https://www.mycompany.com/. The path must include the fully qualified domain name (for example, www.mycompany.com and can extend to a lower level, for example, www.mcompany.com/news/. The trailing slash is optional. The path might include a file name, for example, www.mycompany.com/index.html.

    Note
    There are two things to note about the URL. First, Site Search must be able to connect to it. If you enter the URL in a browser and get the HTTP error 404 Not Found, then Site Search won’t be able to connect to that URL. This is frequently the case if the directory is an intermediate one in a longer path, and if it doesn’t contain an index.html page. To work around this, choose a directory higher up in the path or the domain level, and use exclusion criteria to omit other directories and files. Second, if a website URL contains a file name as the final part of the path, then don’t choose "Pages beginning with the Start URL" as the value of "Pages to include".

    Pages to include

    Site Search can use one of these strategies regarding links on a website:

    • All pages linked to by this site – Follow all links on the website. This can result in indexing web pages that are off-site.

    • Pages on this site and subdomains (default) – Follow links that lead to other pages on the website or on subdomains. Don’t follow links that lead offsite.

    • Pages beginning with the Start URL – Follow links that lead to other pages on the website. Don’t follow links across subdomains or that lead offsite. Page URLs must start with the website URL that you entered at the top of the Configuration tab.

    Respect refresh redirection

    Allow Site Search to follow meta redirects.

    Limit crawl to # levels deep

    Maximum number of levels to crawl in linked web pages, or to follow in the site map hierarchy

    Limit crawl to # documents

    Maximum number of documents to index

    Site map URL

    (Optional) If specified, Site Search uses the hierarchy in the site map to determine which web pages to crawl, instead of following links. Site maps must be XML files that adhere to the Sitemap protocol.

    Maximum file size

    Files above this maximum size aren’t indexed.

    The largest maximum file size you can specify is 50 MB. The Web crawler data source doesn’t support indexing files that exceed 50 MB.

    Documents to exclude from the index. Specify exclusion criteria, that is, a series of strings to match against the parts of document URLs after the domain name (not after the full path specified in Start URL). You can use the wildcard * (asterisk) to match any number of characters, so a single exclusion criterion can exclude multiple directories or files. How matching is done and what you should specify here depend on your selection for "Pages to include". For more information and examples, see Exclusion criteria.

    Data Source Topics

    Meaningful characterizations of the data from this data source, including the source of the data. A topic can be applied to one or more data sources, and a data source can have one or more topics. In a Topic Tabs module, search results are grouped by topic. Searches in Search Box modules and with the Search API can be limited by topic or topics.

  3. Click Save and Index.

    While Site Search is crawling a data source, it displays its activity (the number of documents it has indexed) in the lower left corner of the page:

    Activity

    Tip
    You might need to refresh the page in the browser to see all of the documents that were found in the crawl.
  4. After the crawl has begun, specify information on the Display tab (how to index data and display results):

    1. (Possibly required) Specify whether to map fields, and how to map them:

      Setting Description

      Map fields

      Select this to map field names from the data source to other field names. For more information about mapping fields, see Mapping fields.

      Source Field Name

      Field name from the data source

      Target Field Name

      A different field name that you want to use in the search app

      Result Template (menu to the left of Edit Template)

      Choose a result template to use.

    2. (Optional) Choose a result template. Site Search has already chosen the best result template based on your data source, which is shown in the dropdown next to Edit Template. If you want to choose a different result template, you can.

    3. (Optional) Click Edit Template to edit the result template.

  5. If you have modified anything on the Display tab, click Save to save your changes.

If Web crawler data sources overlap

We recommend that you not create Web crawler data sources that overlap. A specific document appears in the index only one time. If two (or more) Web crawler data sources crawl the same document, then it will be re-associated with each last-crawled data source.

Examples of overlapping data sources are:

  • Creating separate data sources for https://my.company.com/products (topic products) and https://my.company.com/products/zap (topic zap).

    Zap products would alternately be found under the tab Products or the tab Zap.

  • Creating a data source that brings in a set of documents directly and another that brings in the same documents through redirects. For example, the data source https://my.company.com/store/ (topic store) might bring in documents in https://my.company.com/used by meta redirects. A second data source https://my.company.com/used (topic used) would overlap with the first data source.

    Used products would alternately be found under the tab Store or the tab Used. The correct approach to having some documents appear in multiple categories is to use facets.

Troubleshooting Web crawler data sources

Following are difficulties that you might encounter with Web crawler data sources.

Connection errors

Problem – When you click Save and Index for a Web crawler data source, the status at the top of the page should change to Connecting and then Connected, at which point Site Search indexes the data source. If Site Search has difficulty connecting to the data source, the status changes to Connection error.

Possible causes of connection errors and what to try:

Possible cause What to try

Start URL is incorrect

Double-check the URL. Try accessing the website in a browser.

Website is down

Try accessing the website in a browser. If the website is down, index the data source when the website is back up.

Chosen values of Pages to include and Respect refresh redirection are problematic for this website

Choose different values of these settings to see whether different choices let Site Search connect to the website.

Indexing speed

Problem: – Indexing of a Web crawler data source is slow to start, and/or indexing progresses slowly.

Indexing takes the time it takes:

  • It takes a little while for Site Search to start indexing pages.

  • Larger websites take longer to index.

  • Indexing multiple data sources at the same time causes indexing to take longer.

Regarding the status messages about the number of documents indexed:

  • Initial indexing – Site Search reports all documents indexed in batches, for example, 1, 640, 1413, 2069, 2739, etc.

  • Reindexing – Site Search crawls all of the documents, but only reports the documents that it updates in the index. So, you might see the status Updating…​ 0 docs for some time. If documents have changed on the website, Site Search increments the status for those. When Site Search finishes indexing the website, the Index Now button turns green again.

Content is not what was expected

Crawling the right documents and getting the ones you want in the index can take some research. If the indexed documents don’t match your expectations, then here are some things to try:

  • Specify different settings for "Pages to include" – The different "Pages to include" strategies produce different results. You might try them all, to see which works best for a specific website.

  • Determine whether "Respect refresh redirection" is needed – If the website Site Search is crawling uses meta redirects, then select Respect refresh redirection.

  • Use a site map – To restrict the documents crawled, you can remove items from the site map used by Site Search.

Little or no content is indexed

  • Determine where the website is rendered – Site Search supports crawling of content that is rendered on the server side.

  • Specify the correct Start URL – If you select "Pages beginning with the Start URL" for "Pages to include", specify a Start URL for a domain or directory (the path shouldn’t end with a file name).