RobotsWeb data source
What is a robots.txt
file?
A robots.txt
file tells a search engine which pages to crawl or not ignore on a website. A website can have crawling and indexing instructions for specific search engines by naming them as user agents. The instructions for crawling behavior primarily use allow and disallow rules. Additionally, there are page-level instructions for use in meta tags or HTTP headers.
Springboard’s web data source is designed with two obedience behaviors for robots.txt
files:
-
Obey Allow, Disallow, and other rules found in a
robots.txt
file -
Obey robots meta tags and HTTP response headers for
noindex
,nofollow
, and other rules
Springboard does not use the crawl-delay rule because it is a non-standard instruction and Google does not use the rule in its own crawls.
|
File format
A robots.txt
file is placed at the website top-level directory and is named robots.txt
, for example https://www.example.com/robots.txt
. The robots.txt
file requires only two lines, User-agent
and Disallow
, but can have other instructions.
The robots.txt
requires the following instructions:
-
User-agent
. Name of the crawler. Use*
to indicate all crawlers. -
Disallow
. URL string. Use/
to indicate the entire site.
The following example blocks search engines from indexing the entire site:
User-agent: *
Disallow: /
The following example blocks a specific directory:
User-agent: *
Disallow: /archive/
What happens if the robots.txt
is not found?
If a robots.txt
file is not found and returns an HTTP 4xx
family error, Springboard assumes the site does not define the robots.txt
file, defaults to allow all, and crawls everything.
For an HTTP 5xx
family error, the site might be on maintenance, or it could have some other problem. Springboard does not proceed with crawling until the server is in a good state.