Exclude Documents

Not all web pages on a website interest users as search results. Using exclusion critiera to constrain search results can be useful when:

  • Websites have file types that contain content you don’t want to index, for example, *.xml, *.rss, *.css, *.jsp, *.php, and *.json files.

  • Websites contain landing pages, support pages, contact pages, archives, and so forth. When a user searches for a product, these pages can appear among the search results.

  • Websites have major sections, for example, Products, Used Products, Technology, and so forth, some of which are relevant for a specific search module and some of which aren’t. Depending on specific search objectives, you can either "build down" (index an entire website, excluding what isn’t of interest) or "build up" (index parts of a website as separate data sources).

  • Pages on some subdomains of the website aren’t of interest.

Comparison of exclusion and blocking

Site Search provides two ways to omit unnecessary or undesired documents from search results:

Approach Description

Exclude documents
(Web Crawler)

Omit documents from the index when a web data source is crawled. Site Search omits documents by comparing a series of exclusion criteria (which are regular expressions) with parts of the uniform resource locators (URLs) for the web pages (the path component and the query component). A web page (or other document, such as a CSS file) is excluded if any exclusion criterion matches the compared part of the resource’s URL.

Exclusion criteria can contain characters that are valid in URLs as well as * (asterisk), which matches zero or more characters. Matching is case sensitive.

An excluded document is not in the index, so users won’t be able to find the document by searching in embedded Site Search modules or in search apps that use the Site Search APIs.

Including previously excluded documents or excluding previously included ones reindexes the data source.

Block documents
(All data sources)

Omit specific documents one-by-one from search results for all queries.

Blocked documents are in the index, so blocking a document or unblocking it doesn’t necessitate reindexing.

Important
Excluding and blocking documents are not intended to provide data security or privacy. Site Search is intended for use with the public Web. Excluded documents still exist on the indexed website. Users can find the documents by searching on the source websites or by having document URLs. Similarly, blocked documents still exist in the data sources from which they were indexed. Also, changes in document names and locations on a website can undo exclusion or blocking of documents.

For information about blocking documents, see Block documents.

Exclude documents

Websites consist of files organized in directories. Some of the files and directories might be of interest to search users and others not. You can exclude files and directories that aren’t of interest to search users by specifying exclusion criteria (which are simple regular expressions) when adding or configuring a data source.

Site Search determines which documents to exclude from the index by comparing the exclusion criteria with parts of the uniform resource locators (URLs) (the path and query components) of all of the web pages and other files (for example, CSS files) that it crawls on the website. Documents for which there are matches are excluded from the index.

Workflow

  1. Add or configuring a data source without specifying exclusion criteria. Even if you don’t intend to keep the facets, it’s helpful to add facets for file_extension, domain, and path segments, for example, path_1 and path_2. This will give you insights into the files present and their organization.

  2. Save and index the data source.

  3. See what you got. Browse the results and perform searches. Click facets. Make notes about what to exclude, and consider the expressions needed to do so.

  4. Add exclusion criteria.

  5. Iterate. Repeat steps 4, 2, and 3 until you’ve excluded what you want to.

Part of URLs matched

In the examples here, matched parts are shown in green and underlined. Unmatched parts are in red.

The part of the URL that is matched is:

  • If there is a path component – The path component, omitting the leading slash that precedes it, and the query component (if present), but not a fragment component (if present):

    Parts of URL with path matched

    The leading slash that precedes the path component is indicated here in red:

    The first slash

  • If there is no path component – The query component but not a fragment component (if present):

    Parts of URL without path matched

Here are some example URLs, indicating the part of each that is matched:

URL match-part examples

Exclusion criteria

Exclusion criteria are a series of strings to match against document URLs. For each exclusion criterion, Site Search matches the string you provide against a specific part of each document URL that the web crawler finds.

Exclusion criteria can contain characters that are valid in URLs as well as * (asterisk), which indicates zero or more characters.

Note
Always specify exclusion criteria to match the entire path component (and possibly query component), irrespective of the choice for the Start URL or whether the Start URL contains a path component. The / (slash) in the Admin UI before what you enter represents the slash you omit at the beginning of the path.

Matching the path component

When present (most of the time), the path component follows the authority component, separated by a slash (the one Site Search doesn’t include in comparisons). The path component consists of one or more path segments, typically referencing directories and files.

This table provides examples of exclusion criteria designed to match the path components of document URLs.

Exclusion goal Approach and examples

Exclude all files in matching top-level directories, including all files in directories below the top-level directories.

Specify an expression to match the desired top-level path segments. End the expression with a terminal /* to match all directories and files below that.

Notice that the expression doesn’t begin with an *.

Example:
documentation/*

Excludes (match): http://mycorp.com/documentation/admin/server.html
Excludes (match): http://mycorp.com/documentation/navigation.html?page=2
Excludes (match): http://mycorp.com/documentation/
Includes (not a match): http://mycorp.com/support/documentation/admin/server.html
Includes (not a match): http://mycorp.com/support/documentation

Exclude specific files in matching top-level directories.

Specify an expression to match the desired top-level path segments. End the expression with a file name and extension, possibly using the wildcard * in the file name and/or extension.

Notice that the expression doesn’t begin with an *.

Example:
documentation/navigation.html

Excludes (match): http://mycorp.com/documentation/navigation.html
Includes (not a match): http://mycorp.com/documentation/navigation.html?page=2
Includes (not a match): http://mycorp.com/documentation/navigation
Includes (not a match): http://mycorp.com/support/documentation/navigation.html

Exclude all files in matching sub-top-level directories.

Specify the full path to the directory. End the expression with a terminal /* to match all directories and files below that, and query components if present.

Example:
*/documentation/*

Excludes (match): http://mycorp.com/documentation/admin/server.html
Includes (not a match): http://mycorp.com/support/documentation/admin/server.html

Exclude specific files in matching sub-top-level directories.

Specify the full path to the directory. End the expression with a file name and extension, possibly using the wildcard * in the file name and/or extension.

If used, a terminal * matches the entire remainder of the path component, and a query component if present.

Example:
*/documentation/navigation.html

Excludes (match): http://mycorp.com/support/documentation/navigation.html
Includes (not a match): http://mycorp.com/documentation/navigation.html
Includes (not a match): http://mycorp.com/support/documentation/navigation.html?page=2
Includes (not a match): http://mycorp.com/support/documentation/navigation

The focus in the table is the right things to do at the beginning of an exclusion criterion and at the end. In the middle, an * matches zero or more characters (as it does at the beginning and end). Here are some examples of exclusion criteria that contain asterisks in the middle of the expressions:

doc*/*
doc*/nav*
*/doc*/*
*/doc*/nav*.html

Matching the query component

When present (uncommonly), the query component immediately follows the authority or path with no intervening slash.

You might want to indicate that Site Search should ignore all query components. Alternatively, you might want to exclude documents that have specific query components.

This table provides examples of exclusion criteria designed to match the query components of document URLs.

Exclusion goal Approach and examples

Exclude all documents referenced by URLs that have query components

Specify an * to match all path segments. Specify the question mark that starts the query component, and then the wildcard * to match all query components.

Example:
*?*

Excludes (match): http://mycorp.com/documentation/navigation.html?page=2
Includes (not a match): http://mycorp.com/documentation/navigation.html

Exclude specific documents referenced by URLs that have specific query components.

Specify the desired string for matching the path component. Follow that by the specific query component.

Example:
shop/used?category=145

Excludes (match): http://mycorp.com/shop/used?category=145
Includes (not a match): http://mycorp.com/shop/used?category=173

Change exclusion criteria

After some experience searching, you might decide that you want to exclude additional documents or re-include excluded ones.

Note
When you change exclusion criteria, Site Search discards the current index and reindexes using the new exclusion criteria.
To change exclusion criteria for documents
  1. In the Site Search menu, click the data source for which you want to change exclusion criteria.

  2. Under Exclude documents, add and remove exclusion criteria:

    • To add an exclusion criterion, click Add or scroll to the bottom of the list of criteria, and then enter the document to exclude.

    • To remove an exclusion criterion, hover over the criterion, and then click Delete Delete.

  3. Click Save and Index to save the changed exclusion criteria and re-index the data source. When the re-indexing completes, users will not find excluded documents (using the new exclusion criteria) in search results.