Troubleshooting guideWeb data source

Table of Contents

Incorrect language detection
The web crawl does not index any documents
The web crawl does not index all the expected pages
The web crawl indexes duplicate pages
New pages are not indexed
Some file types are not indexed
The meta tag blocks individual pages
Crawlers are blocked in the robots.txt file
The Exclude Links option ignores wildcards
The same page has different document IDs in different data sources
Internal link problems
Slow site crawls
UTF-8 characters do not parse properly
Invalid JSON error for queries with special characters

The Springboard web data source starts at a base URL and retrieves data from a website using HTTP.

This guide describes some common errors in using the web data source and how to fix them.

Client-side request timeout and client-side connection timeout errors activate retry attempts. For more information, see Retry logic for 5xx and client-side timeout errors.

Incorrect language detection

The web data source crawl uses HTML properties to detect a document’s language. The following table lists the HTML properties used to detect a page’s language, ranked from highest priority to lowest.

If a higher priority property’s value is present, that value is used even if lower priority properties contradict the higher priority property’s value.
If a higher priority property is absent, the web crawl checks for the presence of the next attribute and its value.

Priority Attribute Description Example

Priority	Attribute	Description	Example
1	`lang` attribute as set in the `html` tag	The language the document is written in.	`<html lang="es-ES">`
2	`content-language` header field	Describes the language of the document. The header field can contain multiple languages. This method is non-conforming in HTML5.	`<meta http-equiv="content-language" content="en">`
3	`Content-Language` HTTP header	Describes the language of the intended audience as stated in RFC 7231. This header may contain multiple languages.	`Content-Language: fr, en`
4	`meta` element’s `name` value	Describes the language of the document. This method does not comply to any HTML standards.	`<meta name="language" content="english">`

lang attribute as set in the html tag

The language the document is written in.

<html lang="es-ES">

content-language header field

Describes the language of the document. The header field can contain multiple languages. This method is non-conforming in HTML5.

<meta http-equiv="content-language" content="en">

Content-Language HTTP header

Describes the language of the intended audience as stated in RFC 7231. This header may contain multiple languages.

Content-Language: fr, en

meta element’s name value

Describes the language of the document. This method does not comply to any HTML standards.

<meta name="language" content="english">

If the web crawl detects the wrong language for your web pages, check the HTML content for your pages and search for the HTML attributes in the preceding table.

The web crawl does not index any documents

If the Start URL is:

An XML sitemap index and you choose to index only pages under the start URL, then the crawl may not return any pages. Edit your data source to index pages on the site and its subdomains.
A sitemap index, set the crawl level to 2 or higher to see results.

The web crawl does not index all the expected pages

If the web crawl is missing pages or indexes some pages, but not others you expect:

The crawl level may not be set high enough. For more information about crawl levels, see Limit crawl levels.
Non-unique canonical URLs may exist. To correct this, ensure canonical URLs included in pages are unique. For example:
- If a web page’s <head> includes a canonical URL, such as <link rel="canonical" href="https://example.com/index.html">, then the web crawl uses this value as the document.url.
- If two pages share the same canonical URL, then the web crawl treats them as duplicates and only indexes the first one encountered.

The web crawl indexes duplicate pages

If the web crawl indexes pages containing identical content in multiple URLs, take the following steps to deduplicate your content in the data source:

Use the robots.txt file to block indexing of known duplicate pages, such as /backup/ or /archive/.
Include <meta name="robots" content="noindex" /> at the top of the duplicated web pages.
Use HTTP 301 permanent redirects when restructuring content causes redirects.
If you are indexing different sections of your website with different data sources, edit each data source to exclude the other section.

New pages are not automatically indexed. To manually reindex the content of a data source, edit the data source and click Save & Run. You can also reindex the data source without editing it if you access the Edit data source screen and click Save & Run. See Edit a web data source for step-by-step instructions.

If you do not manually reindex a data source, the Data ingest run scheduling field contains a value that specifies when the data ingestion runs. Any pages you’ve added recently are not indexed until the next time the data source is reindexed.

Some file types are not indexed

The web crawl indexes certain file types in addition to HTML content. See Web data source functionality to view all supported file types. For detailed information, see File extension processing.

The `meta` tag blocks individual pages

If a file’s meta tag data and the data source settings differ, the web crawl obeys the meta tag data first.

If a page includes a meta tag similar to:

<meta name="robots" content="noindex" />, the page is not indexed. Remove this tag to index the page.
<meta name="robots" content="nofollow" />, the web crawl cannot follow any links on that page and does not index those links. Remove this meta tag to index the links on the page.

Individual links can also use the nofollow value such as <a href="page-name.html" rel="nofollow">. Remove rel=nofollow to index the link.

Crawlers are blocked in the `robots.txt` file

If the robots.txt file and the data source settings are different, Springboard obeys the robots.txt file first.

The robots.txt file may block crawlers.

If your robots.txt file blocks all crawlers, add an exception for the Springboard crawler:
```
User-agent: Lucidworks-Web/1.1
Disallow:
```
If pages in a subfolder are not indexed and you have not excluded that page from your data source, the robots.txt file may be blocking the subfolder. Edit the robots.txt file to unblock the subfolder.

The Exclude Links option ignores wildcards

The Exclude Links section does not support wildcards. You can use this section to exclude full or partial URLs.

The same page has different document IDs in different data sources

Every document has a unique ID based on the document, data source, and customer.

If the same website is indexed in two data sources, the two pages in the two data sources have different IDs. This behavior is intentional and maintains document uniqueness.

Internal link problems

If a page is not linked anywhere else on the website, that page is not crawled. Create a link to the page from elsewhere on the website, and the linked page will be crawled on the next scheduled site crawl.

A page is not indexed if it is located too many levels from the home page.

Springboard’s web data source supports crawl levels up to 10 levels deep. If you are optimizing your website for SEO, accessing any page from the home page should take four or fewer clicks.

The web data source can overlook links contained in outdated web technologies such as Adobe Flash or HTML frames.

Slow site crawls

Broken links can slow the web data source crawler. Perform regular site audits to fix or remove broken links.

UTF-8 characters do not parse properly

Springboard ingests content using UTF-8 character encoding to allow searches in multiple languages. On a web page with content-type character encoding set to UTF-8, the web crawl collects and indexes all properly encoded UTF-8 characters into Page fields. If a web page does not provide a charset content-type response header, then the Springboard default for character encoding is set as charset=UTF-8.

For anything other than alphanumeric English characters, searches are performed as an exact match. For other character encodings such as ISO-8859-1, Windows-1251, or EUC-KR, Springboard processes the characters as much as possible, but some characters might still show up as errors in the search results.

Invalid JSON error for queries with special characters

Some characters used in search and searchahead queries must be escaped for proper interpretation. Character escaping in JSON requests is handled using a backslash, \. If a query includes a backslash or double quotes, you must escape these or the JSON request is considered invalid.

This table lists which special characters can be escaped in a request.

Escape Sequence Resulting Character

Escape Sequence	Resulting Character
`\\`	`\`
`\"`	`"`
`\b`	backspace
`\t`	tab
`\n`	newline
`\f`	form feed
`\r`	carriage return

\\

\

\"

"

\b

backspace

\t

tab

\n

newline

\f

form feed

\r

carriage return

A backslash followed by any character not listed in this table results in an invalid JSON error message. Unicode character sequences are not accepted; instead they are passed as literal values. Wildcards are not supported.