- Supported websites
- Add a Web crawler data source
- If Web crawler data sources overlap
- Troubleshooting Web crawler data sources
For a Web crawler data source, Site Search indexes a website (possibly following links outside of the website).
Site Search indexes the website periodically. You can also re-index the website at need.
Site Search can crawl these websites:
Public Web – Websites must be on the public Web (Internet). They can’t be behind firewalls.
Add a Web crawler data source
In the Site Search menu, click Add new data source, and then click Web crawler. If you are adding the first data source, just click Web crawler.
On the Configuration tab, specify which website to crawl and which documents to index. Expand Show more options to see all of the options:
URL from which Site Search starts crawling (indexing) a website. Specify the URI scheme (
://, and the path, for example,
https://www.mycompany.com/. The path must include the fully qualified domain name (for example,
www.mycompany.comand can extend to a lower level, for example,
www.mcompany.com/news/. The trailing slash is optional. The path might include a file name, for example,
There are two things to note about the URL. First, Site Search must be able to connect to it. If you enter the URL in a browser and get the HTTP error 404 Not Found, then Site Search won’t be able to connect to that URL. This is frequently the case if the directory is an intermediate one in a longer path, and if it doesn’t contain an
index.htmlpage. To work around this, choose a directory higher up in the path or the domain level, and use exclusion criteria to omit other directories and files. Second, if a website URL contains a file name as the final part of the path, then don’t choose "Pages beginning with the Start URL" as the value of "Pages to include".
Pages to include
Site Search can use one of these strategies regarding links on a website:
All pages linked to by this site – Follow all links on the website. This can result in indexing web pages that are off-site.
Pages on this site and subdomains (default) – Follow links that lead to other pages on the website or on subdomains. Don’t follow links that lead offsite.
Pages beginning with the Start URL – Follow links that lead to other pages on the website. Don’t follow links across subdomains or that lead offsite. Page URLs must start with the website URL that you entered at the top of the Configuration tab.
Respect refresh redirection
Allow Site Search to follow meta redirects.
Limit crawl to # levels deep
Maximum number of levels to crawl in linked web pages, or to follow in the site map hierarchy
Limit crawl to # documents
Maximum number of documents to index
Site map URL
(Optional) If specified, Site Search uses the hierarchy in the site map to determine which web pages to crawl, instead of following links. Site maps must be XML files that adhere to the Sitemap protocol.
Maximum file size
Files above this maximum size aren’t indexed.
The largest maximum file size you can specify is 50 MB. The Web crawler data source doesn’t support indexing files that exceed 50 MB.
Documents to exclude from the index. Specify exclusion criteria, that is, a series of strings to match against the parts of document URLs after the domain name (not after the full path specified in Start URL). You can use the wildcard
*(asterisk) to match any number of characters, so a single exclusion criterion can exclude multiple directories or files. How matching is done and what you should specify here depend on your selection for "Pages to include". For more information and examples, see Exclusion criteria.
Data Source Topics
Meaningful characterizations of the data from this data source, including the source of the data. A topic can be applied to one or more data sources, and a data source can have one or more topics. In a Topic Tabs module, search results are grouped by topic. Searches in Search Box modules and with the Search API can be limited by topic or topics.
Click Save and Index.
While Site Search is crawling a data source, it displays its activity (the number of documents it has indexed) in the lower left corner of the page:Tip
You might need to refresh the page in the browser to see all of the documents that were found in the crawl.
After the crawl has begun, specify information on the Display tab (how to index data and display results):
(Possibly required) Specify whether to map fields, and how to map them:
Select this to map field names from the data source to other field names. For more information about mapping fields, see Mapping fields.
Source Field Name
Field name from the data source
Target Field Name
A different field name that you want to use in the search app
Result Template (menu to the left of Edit Template)
Choose a result template to use.
(Optional) Choose a result template. Site Search has already chosen the best result template based on your data source, which is shown in the dropdown next to Edit Template. If you want to choose a different result template, you can.
(Optional) Click Edit Template to edit the result template.
If you have modified anything on the Display tab, click Save to save your changes.
If Web crawler data sources overlap
We recommend that you not create Web crawler data sources that overlap. A specific document appears in the index only one time. If two (or more) Web crawler data sources crawl the same document, then it will be re-associated with each last-crawled data source.
Examples of overlapping data sources are:
Creating separate data sources for
Zap products would alternately be found under the tab
Productsor the tab
Creating a data source that brings in a set of documents directly and another that brings in the same documents through redirects. For example, the data source
store) might bring in documents in
https://my.company.com/usedby meta redirects. A second data source
used) would overlap with the first data source.
Used products would alternately be found under the tab
Storeor the tab
Used. The correct approach to having some documents appear in multiple categories is to use facets.
Troubleshooting Web crawler data sources
Following are difficulties that you might encounter with Web crawler data sources.
Problem – When you click Save and Index for a Web crawler data source, the status at the top of the page should change to
Connecting and then
Connected, at which point Site Search indexes the data source. If Site Search has difficulty connecting to the data source, the status changes to
Possible causes of connection errors and what to try:
|Possible cause||What to try|
Start URL is incorrect
Double-check the URL. Try accessing the website in a browser.
Website is down
Try accessing the website in a browser. If the website is down, index the data source when the website is back up.
Chosen values of
Choose different values of these settings to see whether different choices let Site Search connect to the website.
Problem: – Indexing of a Web crawler data source is slow to start, and/or indexing progresses slowly.
Indexing takes the time it takes:
It takes a little while for Site Search to start indexing pages.
Larger websites take longer to index.
Indexing multiple data sources at the same time causes indexing to take longer.
Regarding the status messages about the number of documents indexed:
Initial indexing – Site Search reports all documents indexed in batches, for example, 1, 640, 1413, 2069, 2739, etc.
Reindexing – Site Search crawls all of the documents, but only reports the documents that it updates in the index. So, you might see the status
Updating… 0 docsfor some time. If documents have changed on the website, Site Search increments the status for those. When Site Search finishes indexing the website, the Index Now button turns green again.
Content is not what was expected
Crawling the right documents and getting the ones you want in the index can take some research. If the indexed documents don’t match your expectations, then here are some things to try:
Specify different settings for "Pages to include" – The different "Pages to include" strategies produce different results. You might try them all, to see which works best for a specific website.
Determine whether "Respect refresh redirection" is needed – If the website Site Search is crawling uses meta redirects, then select Respect refresh redirection.
Use a site map – To restrict the documents crawled, you can remove items from the site map used by Site Search.
Little or no content is indexed
Determine where the website is rendered – Site Search supports crawling of content that is rendered on the server side.
Specify the correct Start URL – If you select "Pages beginning with the Start URL" for "Pages to include", specify a Start URL for a domain or directory (the path shouldn’t end with a file name).