Web data source

Table of Contents

Web data source settings
Crawl logic
- New URL
- Existing URL
Errors

Understanding the web data source configuration helps you understand how that data source crawls and indexes documents. You can then adjust as needed to get the desired results.

Web data source settings

These settings control how a Springboard app crawls and indexes a website.

For conceptual information, see Data sources.
For information about the user interface screen, see Data sources screen.
For information about adding and managing data sources, see Manage Springboard data sources.

Data source name

In the Data source name field, enter a unique name that lets you easily identify the data source configuration from the list in that Springboard application.

In the Region field, select a region from the drop-down menu. Your data is ingested in this region. Choosing a region allows you to place your Springboard applications in the geographic region closest to your end users. Regions also allow you to adhere to any regulatory requirements.

You cannot change a data source’s region in an existing data source.

Start URL

For Start URL, enter the URL where the crawl begins. This can be a site, sitemap, or sitemap index. When crawling the same website using more than one data source configuration, it is possible to index the same page more than once. A page indexed by two separate data source configurations is assigned two different document IDs because the data source ID is used as part of the document ID.

The Start URL cannot be changed in an existing data source. You must delete the data source and add a new one with the correct URL.

Labels

Enter optional labels to identify your data source. You can create and arrange multiple labels. On the Data Sources screen, only the first label is used when sorting the data sources by the Labels column.

Include pages

Include pages has two options for pages to crawl:

Pages under the start URL. Only pages under the root URL entered in the Start URL field are indexed and included in the crawl. Links for subdomains are ignored. For example, if the start URL is: https://example.com/catalog:
- Valid links include:
  - https://www.example.com/catalog/books
  - https://example.com/catalog/books
- The crawl does not include these links:
  - https://example.com/books
  - https://www.example.com/books
  - https://books.example.com
Pages on this site and its subdomains. Pages on this site and its subdomains are indexed and included in the crawl. A subdomain’s name ends with the domain’s name. For example, A.B is a subdomain of B.

As another example, if the start URL is: https://example.com, a valid subdomain is https://doc.example.com. Links with the root URL are indexed; otherwise they are ignored.

If your start URL is a sitemap or sitemap index, select this option. Otherwise, the web crawl does not return any results.

Include file types

Select the file types to include in the crawl. The web crawl returns these document types when the extension suffix is in the URL or if the content-type is specified in an HTTP header. HTML content is always indexed. For detailed information, see File extension processing.

The web crawl can index these types of files with the specified extensions:

Slide
- .pptx
- .odp
PDF
- .pdf
Spreadsheet
- .xls
- .xslx
- .ods
Word
- .doc
- .docx

Include external domains for selected file types

Enter external domains that contain the selected file types you want to include in the crawl. The format for the entry is example.com, not https://example.com. Do not use https: in the format for the external domain. Subdomains are automatically included, unless added to the list of exclude links.

Requirements for files to be included in the crawl

For a file to be included, it must adhere to all of the following guidelines:

The file must exist in an external domain entered in the Include external domains for selected file types field or in a subdomain of the external domain.
The file has not been excluded based on the values in the Exclude links field.
The file’s type must match a value selected in the Include file types field.

Examples

Example settings and values

The examples in the table use the following settings and values:

Setting Value

Setting	Value
Include file types	PDF
Include external domains for selected file types	`example.com`
Exclude links	`https://dev.example.com/`

Include file types

PDF

Include external domains for selected file types

example.com

Exclude links

https://dev.example.com/

With these settings, the following files are included because they meet these conditions:

The file type is PDF.
The domain is example.com or a subdomain.
The subdomain is not in the list of excluded links.

URL Included Excluded

URL	Included	Excluded
`https://example.com/customers.pdf`	✅	✘
`https://docs.example.com/guidelines.pdf`	✅	✘

https://example.com/customers.pdf

✅

✘

https://docs.example.com/guidelines.pdf

✅

✘

Subdomains are automatically included. However, the subdomain dev.example.com is listed in the Exclude links field, so files located there are not included.

Although example-dev.com is not in the list of Exclude links, it is also not a domain listed in the Include external domains for selected file types field.

URL Included Excluded

URL	Included	Excluded
`https://dev.example.com/code.pdf`	✘	✅
`https://example-dev.com/customers.pdf`	✘	✅

https://dev.example.com/code.pdf

✘

✅

https://example-dev.com/customers.pdf

✘

✅

The file type DOCX is not a selected file type in the Include file types field. Even if a DOCX file it is located at an allowed external domain, it is not ingested.

URL Included Excluded

URL	Included	Excluded
`https://example.com/customers.docx`	✘	✅

https://example.com/customers.docx

✘

✅

Only files are ingested. An external page is not included, even though it is in an allowed external domain.

URL Included Excluded

URL	Included	Excluded
`https://example.com/index.html`	✘	✅

https://example.com/index.html

✘

✅

Include metatags

Enter up to 10 alphanumeric metadata tag names that are ingested from the <head> of the HTML files and used to facet and filter search results.

If the metadata tags you enter exist and contain values, they are ingested in the crawl and the values for those metadata tags and the number of occurrences for each value display in the Experience Optimizer Facets panel. To populate the metadata tags facet information in Experience Optimizer, a metatag API request is performed during the crawl. The results information is cached for four hours. If another crawl is performed within that four-hour period, the API request is not performed again during that period because the request can be time-intensive and impact system performance.

For detailed information about metadata tags, see Custom metadata tags.

Include query parameters

Some websites use query components to organize and serve content. In the URL, these key=value pairs are found after the first ? character following the host name and separated by the & character for multiple key/value pairs.

For example, if the Start URL is: https://example.com/news, then two unique web pages might be https://example.com/news?year=2022&month=1 and https://example.com/news?year=2022&month=2.

You can enter relevant query parameters for your website in the Include query parameters field in the application’s user interface when adding or editing a data source. These values are case-sensitive, so enter them in the Include query parameters field exactly as they are used on the website.

The order of the query parameters entered does not matter. When the data source is actively crawled, the query parameters are parsed and sorted alphabetically. This ensures multiple query parameters are always compared in the same order for each URL path.

The parameters after the question mark in website URLs are identified, parsed, and then compared to the list of parameters specified in this field so different values in those parameters are treated as different pages. Only the query parameters entered in the Include query parameters field are used to identify unique web pages.

Only the parameters entered in the Include query parameters field are included for the crawl, even though the Start URL may contain other parameters. Start URL website parameters not specified are ignored and are not used when creating an index.

Example 1: No query parameters entered

If no query parameters are specified in the Include query parameters field, then no parameters, (not even existing URL website parameters), are parsed during the crawl.

If no query parameters are entered, the following URLs are treated the same. Only the first of these links encountered during a crawl is added to the index:

https://example.com/news?recipes=cake
https://example.com/news?recipes=pie

Example 2: Query parameters entered do not exist on URL website

If a query parameter that does not exist on the URL website is entered in the Include query parameters field, then the parameter is included in the crawl, but no results are returned. No URL website parameters are parsed during the crawl.

For example, a recipes query parameter is the only one entered in the Include query parameters field, but does not exist on the https://example.com/news site. The recipes parameter is included in the crawl, but no results are returned, and no URL website parameters are parsed during the crawl. The query parameters for these links are ignored:

https://example.com/news?Recipes=cake
https://example.com/news?recipe=cake
https://example.com/news?year=2022&month=1

Example 3: One query parameter entered exists on the URL website

If only one query parameter is entered in the Include query parameters field, it is included in the crawl. If it exists on the URL website, relevant results are returned. No other URL website parameters are parsed during the crawl.

For example, the month query parameter is entered in the Include query parameters field and it exists on the URL website. It is included in the crawl, and relevant results are returned. But any other URL website parameters, such as year, are ignored. The following URLs are treated as the same page in the index:

https://example.com/news?month=1
https://example.com/news?year=2022&month=1

Example 4: More than one query parameter entered

All of the query parameters entered in the Include query parameters field are included in the crawl, even if they do not exist on the URL website. The parameters that exist on the website return relevant results. No other URL website parameters are parsed during the crawl.

If the month and year query parameters are both entered in the Include query parameters field, they are included in the crawl. If they exist on the URL website, they return relevant results. Any other website URL parameters, such as city, are ignored.

The following URLs are treated as the same page in the index:

https://example.com/news?month=1&year=2022
https://example.com/news?month=1&year=2022&city=Union

Include links

Add to the Include links field any full or partial URLs that you wish to include in the crawl. Only URLs matching these values are included.

If you do not add links to your list of include links, the crawl will follow the parameters set in other fields, such as your start URL and crawl levels.

Exclude links take priority over include links. For more information, see Priority over include links.

For example, you can choose to ingest your product articles while ignoring everything else by adding https://example.com/products to your list of include links.

URL Included Excluded

URL	Included	Excluded
`https://example.com/products`	✅	✘
`https://example.com/products/electronics`	✅	✘
`https://example.com/blog/products`	✘	✅
`https://example.com/videos`	✘	✅
`https://example.com/contact`	✘	✅

https://example.com/products

✅

✘

https://example.com/products/electronics

✅

✘

https://example.com/blog/products

✘

✅

https://example.com/videos

✘

✅

https://example.com/contact

✘

✅

You can also add products to your list of include links, but any URL containing the string is included. For best results, use the exact URL with the protocol (http:// or https://) and with or without www. as applicable.

URL Included Excluded

URL	Included	Excluded
`https://example.com/products`	✅	✘
`https://example.com/blog/products`	✅	✘
`https://example.com/videos`	✘	✅
`https://example.com/contact`	✘	✅

https://example.com/products

✅

✘

https://example.com/blog/products

✅

✘

https://example.com/videos

✘

✅

https://example.com/contact

✘

✅

Exclude links

Add to the Exclude links field any full or partial URLs that you wish to exclude from the crawl. URLs matching the values set in this field are excluded. For best results, use the exact URL with the protocol (http:// or https://) and with or without www., as applicable.

Excluding links is similar to using a disallow in a robots.txt file and prevents ingestion of the documents. For more information, see Robots. Pages removed from a website still show up in query results unless excluding them is specified.

This is different from a block, which ingests the document but prevents it from showing up in query results.

Exclude a domain by adding it to the list of exclude links. For example, exclude the entire domain example.com.

URL Included Excluded

URL	Included	Excluded
`https://example.com`	✘	✅
`https://docs.example.com`	✘	✅
`https://example.com/docs`	✘	✅

https://example.com

✘

✅

https://docs.example.com

✘

✅

https://example.com/docs

✘

✅

Exclude unencrypted URLs by excluding http:. Encrypted URLs, which begin with https:, are crawled.

URL Included Excluded

URL	Included	Excluded
`http://example.com`	✘	✅
`https://example.com`	✅	✘

http://example.com

✘

✅

https://example.com

✅

✘

Note that partial and full string matches exclude the URL. Excluding dir excludes all URLs that contain the string.

URL Included Excluded

URL	Included	Excluded
`https://example.com/dir`	✘	✅
`https://example.com/dir/sub`	✘	✅
`https://example.com/sub/dir`	✘	✅
`https://example.com/sub/dir.html`	✘	✅
`https://example.com/directory`	✘	✅
`https://example.com/directory/sub`	✘	✅
`https://example.com/di/r`	✅	✘

https://example.com/dir

✘

✅

https://example.com/dir/sub

✘

✅

https://example.com/sub/dir

✘

✅

https://example.com/sub/dir.html

✘

✅

https://example.com/directory

✘

✅

https://example.com/directory/sub

✘

✅

https://example.com/di/r

✅

✘

Exclude specific paths by excluding a target string, with or without /. For example, excluding /subfolder/ excludes any URL containing the string.

URL Included Excluded

URL	Included	Excluded
`https://example.com/subfolder/dir`	✘	✅
`https://example.com/subfolder/list`	✘	✅
`https://example.com/subfolder`	✅	✘
`https://example.com/subfolders/list`	✅	✘

https://example.com/subfolder/dir

✘

✅

https://example.com/subfolder/list

✘

✅

https://example.com/subfolder

✅

✘

https://example.com/subfolders/list

✅

✘

Exclude certain file type formats by excluding the file extension. For example, if you want to ingest spreadsheet files, but you want to exclude open source formats, exclude .ods. Note that using ods instead of .ods excludes all paths that include ods.

Your data source configuration must specify which file types to ingest. For more information, see Include file types.

URL Included Excluded

URL	Included	Excluded
`https://example.com/spreadsheets/quarterly-sales.ods`	✘	✅
`https://example.com/spreadsheets/quarterly-sales.xls`	✅	✘
`https://example.com/pods/pod1.html`	✘	✅

https://example.com/spreadsheets/quarterly-sales.ods

✘

✅

https://example.com/spreadsheets/quarterly-sales.xls

✅

✘

https://example.com/pods/pod1.html

✘

✅

Priority over include links

Exclude links have priority over include links. If your include links list includes https://example.com/dir, and your exclude links list includes private, any URL that matches the exclude links term is excluded from the crawl.

URL Included Excluded

URL	Included	Excluded
`https://example.com/dir/docs/index.html`	✅	✘
`https://example.com/dir/private/schedule.pdf`	✘	✅
`https://example.com/dir/private-drive/index.html`	✘	✅
`https://docs.example.com/dir/index.html`	✘	✅

https://example.com/dir/docs/index.html

✅

✘

https://example.com/dir/private/schedule.pdf

✘

✅

https://example.com/dir/private-drive/index.html

✘

✅

https://docs.example.com/dir/index.html

✘

✅

Data ingest run scheduling

For Data ingest run scheduling, create a schedule to automatically run your data source ingestion. For example, if the value is Monthly and the data source was created on January 5, the data ingestion runs on the fifth day of each month after that.

By default, the data source schedule is set to Monthly with the date and time based on your browser time when the data source was added. If you set up the data source after the twenty-eighth day of the month, the date is set to 28.

If a scheduled data ingestion is in progress, Springboard ignores manual data ingestions for the same data source created by the Save & Run button. If a manual recrawl is in progress and a future data ingestion for the same data source is scheduled, the scheduled data ingestion begins after the manual recrawl is complete.

To initiate an on-demand data ingestion outside the regular schedule, see run on-demand data ingestion.

The following options are available. These options cannot be combined.

Hourly. Run a data ingestion every eight hours or every twelve hours, starting at the hour of your choice.
- Time of crawl. The hour to run the first daily data source ingestion. The data source ingestion begins during the selected hour and runs at the selected interval.
- Interval. The frequency of the crawls. The options are every eight hours or every twelve hours. For example, if you select 2:00 AM and an interval of every eight hours, then your data ingestion is scheduled to run at 2:00 AM, 12:00 PM, and 8:00 PM.
Daily. Run a data ingestion once every day, starting at the hour of your choice.
- Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour and runs every 24 hours afterward.
Weekly. Run a data ingestion every week on the days and hour of your choice.
- Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour on each day selected.
- Days of week. The days of the week to run a data source ingestion. Selecting different times for different days is not supported.
Monthly. Run a data ingestion monthly, starting at the day and hour of your choice.
- Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour and runs at that time on the selected day of each month.
- Day of month. The day of the month to run a data source ingestion.

Springboard uses 1-hour UTC offsets for dates and times in the UI. All dates and times display in your browser’s local time zone.

If you’re located in a country or territory with a 30-minute or 45-minute UTC offset, the uneven UTC offset may appear in the UI.

Limit crawl levels

Limit crawl levels has a draggable scale to set the maximum number of levels to crawl. The maximum crawl depth defines the maximum number of jumps between the start URL and any found links. Only links containing the root URL are considered when adding pages to a crawl.

The sitemap and sitemap index URLs do not appear in your search results. But the crawl levels function as follows:

If the start URL is a sitemap index:
- You may not see documents if the crawl level is set to 1 because a sitemap index typically contains sitemaps.
- A crawl level of 2 indexes the URLs specified in the sitemaps.
- For higher crawl levels, a crawl level of n indexes the links on the pages found during crawl level n-1.
If the start URL is a sitemap:
- A crawl level of 1 indexes the URLs linked in the sitemap.
- A crawl level of 2 indexes the URLs linked on the pages found during crawl level 1.
- For higher crawl levels, a crawl level of n indexes the links on the pages found during crawl level n-1.
If the start URL is a regular URL:
- A crawl level of 1 indexes the start URL and the pages linked on the start URL.
- A crawl level of 2 indexes the URLs linked on the pages found during crawl level 1.
- For higher crawl levels, a crawl level of n indexes the links on the pages found during crawl level n-1.

Crawl logic

New URL

When the data source is initially created and you click Save & Run, Springboard attempts to crawl the URL based on the values entered in the data source fields.

If the URL can be reached, Springboard ingests the URL content based on data source settings.

Existing URL

If the URL exists and Springboard has ingested the content before, subsequent crawls occur in two ways:

On the schedule set by the value in the Data ingest run scheduling field.
Started on demand when you access the Edit a data source screen and click Save & Run. You do not have to make changes to the data source to click Save & Run and invoke a recrawl.

Springboard detects changes to the web data source during each crawl. If an existing URL:

Can be reached, Springboard ingests the latest version of the content.
Cannot be reached, Springboard deletes existing content.

To safeguard your data against deletion, the crawl stops and returns an error if it detects that changes to a website impact 30% or more of existing URLs. If these major changes to the data source are intentional, delete the existing data source and recreate it to ingest and reindex the documents.

Errors

For HTTP 4xx family errors, an error message is sent to the Springboard job service and the page is not crawled. For error and resolution information, see Client error codes 4xx.

For HTTP 5xx error information, see Server error codes 5xx.

Some errors activate retry attempts. For more information, see Retry logic for 5xx and client-side timeout errors.

For more information about web data source errors and how to fix them, see the Web data source troubleshooting guide.