Web data source
Understanding the web data source configuration helps you understand how that data source crawls and indexes documents. You can then adjust as needed to get the desired results.
Web data source settings
These settings control how a Springboard app crawls and indexes a website.
-
For conceptual information, see Data sources.
-
For information about the user interface screen, see Data sources screen.
-
For information about adding and managing data sources, see Manage Springboard data sources.
Data source name
In the Data source name field, enter a unique name that lets you easily identify the data source configuration from the list in that Springboard application.
Region
In the Region field, select a region from the drop-down menu. Your data is ingested in this region. Choosing a region allows you to place your Springboard applications in the geographic region closest to your end users. Regions also allow you to adhere to any regulatory requirements.
You cannot change a data source’s region in an existing data source.
Start URL
For Start URL, enter the URL where the crawl begins. This can be a site, sitemap, or sitemap index. When crawling the same website using more than one data source configuration, it is possible to index the same page more than once. A page indexed by two separate data source configurations is assigned two different document IDs because the data source ID is used as part of the document ID.
The Start URL cannot be changed in an existing data source. You must delete the data source and add a new one with the correct URL.
Labels
Enter optional labels to identify your data source. You can create and arrange multiple labels. On the Data Sources screen, only the first label is used when sorting the data sources by the Labels column.
Include pages
Include pages has two options for pages to crawl:
-
Pages under the start URL. Only pages under the root URL entered in the Start URL field are indexed and included in the crawl. Links for subdomains are ignored. For example, if the start URL is:
https://example.com/catalog
:-
Valid links include:
-
https://www.example.com/catalog/books
-
https://example.com/catalog/books
-
-
The crawl does not include these links:
-
https://example.com/books
-
https://www.example.com/books
-
https://books.example.com
-
-
-
Pages on this site and its subdomains. Pages on this site and its subdomains are indexed and included in the crawl. A subdomain’s name ends with the domain’s name. For example, A.B is a subdomain of B.
As another example, if the start URL is:
https://example.com
, a valid subdomain ishttps://doc.example.com
. Links with the root URL are indexed; otherwise they are ignored.If your start URL is a sitemap or sitemap index, select this option. Otherwise, the web crawl does not return any results.
Include file types
Select the file types to include in the crawl. The web crawl returns these document types when the extension suffix is in the URL or if the content-type
is specified in an HTTP header. HTML content is always indexed. For detailed information, see File extension processing.
The web crawl can index these types of files with the specified extensions:
-
Slide
-
.pptx
-
.odp
-
-
PDF
-
.pdf
-
-
Spreadsheet
-
.xls
-
.xslx
-
.ods
-
-
Word
-
.doc
-
.docx
-
Include external domains for selected file types
Enter external domains that contain the selected file types you want to include in the crawl. The format for the entry is example.com
, not https://example.com
. Do not use https:
in the format for the external domain. Subdomains are automatically included, unless added to the list of exclude links.
Requirements for files to be included in the crawl
For a file to be included, it must adhere to all of the following guidelines:
-
The file must exist in an external domain entered in the Include external domains for selected file types field or in a subdomain of the external domain.
-
The file has not been excluded based on the values in the Exclude links field.
-
The file’s type must match a value selected in the Include file types field.
Examples
Example settings and values
The examples in the table use the following settings and values:
Setting | Value |
---|---|
Include file types |
|
Include external domains for selected file types |
|
Exclude links |
|
With these settings, the following files are included because they meet these conditions:
-
The file type is PDF.
-
The domain is
example.com
or a subdomain. -
The subdomain is not in the list of excluded links.
URL | Included | Excluded |
---|---|---|
|
✅ |
✘ |
|
✅ |
✘ |
Subdomains are automatically included. However, the subdomain dev.example.com
is listed in the Exclude links field, so files located there are not included.
Although example-dev.com
is not in the list of Exclude links, it is also not a domain listed in the Include external domains for selected file types field.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✘ |
✅ |
The file type DOCX is not a selected file type in the Include file types field. Even if a DOCX file it is located at an allowed external domain, it is not ingested.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
Only files are ingested. An external page is not included, even though it is in an allowed external domain.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
Include metatags
Enter up to 10 alphanumeric metadata tag names that are ingested from the <head>
of the HTML files and used to facet and filter search results.
If the metadata tags you enter exist and contain values, they are ingested in the crawl and the values for those metadata tags and the number of occurrences for each value display in the Experience Optimizer Facets panel. To populate the metadata tags facet information in Experience Optimizer, a metatag API request is performed during the crawl. The results information is cached for four hours. If another crawl is performed within that four-hour period, the API request is not performed again during that period because the request can be time-intensive and impact system performance.
For detailed information about metadata tags, see Custom metadata tags.
Include query parameters
Some websites use query components to organize and serve content. In the URL, these key=value
pairs are found after the first ?
character following the host name and separated by the &
character for multiple key/value pairs.
For example, if the Start URL is: https://example.com/news
, then two unique web pages might be https://example.com/news?year=2022&month=1
and https://example.com/news?year=2022&month=2
.
You can enter relevant query parameters for your website in the Include query parameters field in the application’s user interface when adding or editing a data source. These values are case-sensitive, so enter them in the Include query parameters field exactly as they are used on the website.
The order of the query parameters entered does not matter. When the data source is actively crawled, the query parameters are parsed and sorted alphabetically. This ensures multiple query parameters are always compared in the same order for each URL path.
The parameters after the question mark in website URLs are identified, parsed, and then compared to the list of parameters specified in this field so different values in those parameters are treated as different pages. Only the query parameters entered in the Include query parameters field are used to identify unique web pages.
Only the parameters entered in the Include query parameters field are included for the crawl, even though the Start URL may contain other parameters. Start URL website parameters not specified are ignored and are not used when creating an index. |
Example 1: No query parameters entered
If no query parameters are specified in the Include query parameters field, then no parameters, (not even existing URL website parameters), are parsed during the crawl.
If no query parameters are entered, the following URLs are treated the same. Only the first of these links encountered during a crawl is added to the index:
-
https://example.com/news?recipes=cake
-
https://example.com/news?recipes=pie
Example 2: Query parameters entered do not exist on URL website
If a query parameter that does not exist on the URL website is entered in the Include query parameters field, then the parameter is included in the crawl, but no results are returned. No URL website parameters are parsed during the crawl.
For example, a recipes
query parameter is the only one entered in the Include query parameters field, but does not exist on the https://example.com/news
site. The recipes
parameter is included in the crawl, but no results are returned, and no URL website parameters are parsed during the crawl. The query parameters for these links are ignored:
-
https://example.com/news?Recipes=cake
-
https://example.com/news?recipe=cake
-
https://example.com/news?year=2022&month=1
Example 3: One query parameter entered exists on the URL website
If only one query parameter is entered in the Include query parameters field, it is included in the crawl. If it exists on the URL website, relevant results are returned. No other URL website parameters are parsed during the crawl.
For example, the month
query parameter is entered in the Include query parameters field and it exists on the URL website. It is included in the crawl, and relevant results are returned. But any other URL website parameters, such as year
, are ignored. The following URLs are treated as the same page in the index:
-
https://example.com/news?month=1
-
https://example.com/news?year=2022&month=1
Example 4: More than one query parameter entered
All of the query parameters entered in the Include query parameters field are included in the crawl, even if they do not exist on the URL website. The parameters that exist on the website return relevant results. No other URL website parameters are parsed during the crawl.
If the month
and year
query parameters are both entered in the Include query parameters field, they are included in the crawl. If they exist on the URL website, they return relevant results. Any other website URL parameters, such as city
, are ignored.
The following URLs are treated as the same page in the index:
-
https://example.com/news?month=1&year=2022
-
https://example.com/news?month=1&year=2022&city=Union
Include links
Add to the Include links field any full or partial URLs that you wish to include in the crawl. Only URLs matching these values are included.
If you do not add links to your list of include links, the crawl will follow the parameters set in other fields, such as your start URL and crawl levels. |
Exclude links take priority over include links. For more information, see Priority over include links. |
For example, you can choose to ingest your product articles while ignoring everything else by adding https://example.com/products
to your list of include links.
URL | Included | Excluded |
---|---|---|
|
✅ |
✘ |
|
✅ |
✘ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
You can also add products
to your list of include links, but any URL containing the string is included. For best results, use the exact URL with the protocol (http://
or https://
) and with or without www.
as applicable.
URL | Included | Excluded |
---|---|---|
|
✅ |
✘ |
|
✅ |
✘ |
|
✘ |
✅ |
|
✘ |
✅ |
Exclude links
Add to the Exclude links field any full or partial URLs that you wish to exclude from the crawl. URLs matching the values set in this field are excluded. For best results, use the exact URL with the protocol (http://
or https://
) and with or without www.
, as applicable.
Excluding links is similar to using a disallow in a robots.txt
file and prevents ingestion of the documents. For more information, see Robots. Pages removed from a website still show up in query results unless excluding them is specified.
This is different from a block, which ingests the document but prevents it from showing up in query results.
Exclude a domain by adding it to the list of exclude links. For example, exclude the entire domain example.com
.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
Exclude unencrypted URLs by excluding http:
. Encrypted URLs, which begin with https:
, are crawled.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✅ |
✘ |
Note that partial and full string matches exclude the URL. Excluding dir
excludes all URLs that contain the string.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✅ |
✘ |
Exclude specific paths by excluding a target string, with or without /
. For example, excluding /subfolder/
excludes any URL containing the string.
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✘ |
✅ |
|
✅ |
✘ |
|
✅ |
✘ |
Exclude certain file type formats by excluding the file extension. For example, if you want to ingest spreadsheet files, but you want to exclude open source formats, exclude .ods
. Note that using ods
instead of .ods
excludes all paths that include ods
.
Your data source configuration must specify which file types to ingest. For more information, see Include file types. |
URL | Included | Excluded |
---|---|---|
|
✘ |
✅ |
|
✅ |
✘ |
|
✘ |
✅ |
Priority over include links
Exclude links have priority over include links. If your include links list includes https://example.com/dir
, and your exclude links list includes private
, any URL that matches the exclude links term is excluded from the crawl.
URL | Included | Excluded |
---|---|---|
|
✅ |
✘ |
|
✘ |
✅ |
|
✘ |
✅ |
|
✘ |
✅ |
Data ingest run scheduling
For Data ingest run scheduling, create a schedule to automatically run your data source ingestion. For example, if the value is Monthly and the data source was created on January 5, the data ingestion runs on the fifth day of each month after that.
By default, the data source schedule is set to Monthly with the date and time based on your browser time when the data source was added. If you set up the data source after the twenty-eighth day of the month, the date is set to 28.
If a scheduled data ingestion is in progress, Springboard ignores manual data ingestions for the same data source created by the Save & Run button. If a manual recrawl is in progress and a future data ingestion for the same data source is scheduled, the scheduled data ingestion begins after the manual recrawl is complete.
To initiate an on-demand data ingestion outside the regular schedule, see run on-demand data ingestion.
The following options are available. These options cannot be combined.
-
Hourly. Run a data ingestion every eight hours or every twelve hours, starting at the hour of your choice.
-
Time of crawl. The hour to run the first daily data source ingestion. The data source ingestion begins during the selected hour and runs at the selected interval.
-
Interval. The frequency of the crawls. The options are every eight hours or every twelve hours. For example, if you select 2:00 AM and an interval of every eight hours, then your data ingestion is scheduled to run at 2:00 AM, 12:00 PM, and 8:00 PM.
-
-
Daily. Run a data ingestion once every day, starting at the hour of your choice.
-
Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour and runs every 24 hours afterward.
-
-
Weekly. Run a data ingestion every week on the days and hour of your choice.
-
Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour on each day selected.
-
Days of week. The days of the week to run a data source ingestion. Selecting different times for different days is not supported.
-
-
Monthly. Run a data ingestion monthly, starting at the day and hour of your choice.
-
Time of crawl. The hour to run the data source ingestion. The data source ingestion begins during the selected hour and runs at that time on the selected day of each month.
-
Day of month. The day of the month to run a data source ingestion.
-
Springboard uses 1-hour UTC offsets for dates and times in the UI. All dates and times display in your browser’s local time zone. If you’re located in a country or territory with a 30-minute or 45-minute UTC offset, the uneven UTC offset may appear in the UI. |
Limit crawl levels
Limit crawl levels has a draggable scale to set the maximum number of levels to crawl. The maximum crawl depth defines the maximum number of jumps between the start URL and any found links. Only links containing the root URL are considered when adding pages to a crawl.
The sitemap and sitemap index URLs do not appear in your search results. But the crawl levels function as follows:
-
If the start URL is a sitemap index:
-
You may not see documents if the crawl level is set to 1 because a sitemap index typically contains sitemaps.
-
A crawl level of 2 indexes the URLs specified in the sitemaps.
-
For higher crawl levels, a crawl level of
n
indexes the links on the pages found during crawl leveln-1
.
-
-
If the start URL is a sitemap:
-
A crawl level of 1 indexes the URLs linked in the sitemap.
-
A crawl level of 2 indexes the URLs linked on the pages found during crawl level 1.
-
For higher crawl levels, a crawl level of
n
indexes the links on the pages found during crawl leveln-1
.
-
-
If the start URL is a regular URL:
-
A crawl level of 1 indexes the start URL and the pages linked on the start URL.
-
A crawl level of 2 indexes the URLs linked on the pages found during crawl level 1.
-
For higher crawl levels, a crawl level of
n
indexes the links on the pages found during crawl leveln-1
.
-
Crawl logic
New URL
When the data source is initially created and you click Save & Run, Springboard attempts to crawl the URL based on the values entered in the data source fields.
If the URL can be reached, Springboard ingests the URL content based on data source settings.
Existing URL
If the URL exists and Springboard has ingested the content before, subsequent crawls occur in two ways:
-
On the schedule set by the value in the Data ingest run scheduling field.
-
Started on demand when you access the Edit a data source screen and click Save & Run. You do not have to make changes to the data source to click Save & Run and invoke a recrawl.
Springboard detects changes to the web data source during each crawl. If an existing URL:
-
Can be reached, Springboard ingests the latest version of the content.
-
Cannot be reached, Springboard deletes existing content.
To safeguard your data against deletion, the crawl stops and returns an error if it detects that changes to a website impact 30% or more of existing URLs. If these major changes to the data source are intentional, delete the existing data source and recreate it to ingest and reindex the documents. |
Errors
For HTTP 4xx
family errors, an error message is sent to the Springboard job service and the page is not crawled. For error and resolution information, see Client error codes 4xx.
For HTTP 5xx
error information, see Server error codes 5xx.
Some errors activate retry attempts. For more information, see Retry logic for 5xx and client-side timeout errors.
For more information about web data source errors and how to fix them, see the Web data source troubleshooting guide.