RobotsWeb data source

Table of Contents

What is a robots.txt file?
File format
What happens if the robots.txt is not found?

What is a `robots.txt` file?

A robots.txt file tells a search engine which pages to crawl or not ignore on a website. A website can have crawling and indexing instructions for specific search engines by naming them as user agents. The instructions for crawling behavior primarily use allow and disallow rules. Additionally, there are page-level instructions for use in meta tags or HTTP headers.

Springboard’s web data source is designed with two obedience behaviors for robots.txt files:

Obey Allow, Disallow, and other rules found in a robots.txt file
Obey robots meta tags and HTTP response headers for noindex, nofollow, and other rules

Springboard does not use the crawl-delay rule because it is a non-standard instruction and Google does not use the rule in its own crawls.

File format

A robots.txt file is placed at the website top-level directory and is named robots.txt, for example https://www.example.com/robots.txt. The robots.txt file requires only two lines, User-agent and Disallow, but can have other instructions.

The robots.txt requires the following instructions:

User-agent. Name of the crawler. Use * to indicate all crawlers.
Disallow. URL string. Use / to indicate the entire site.

The following example blocks search engines from indexing the entire site:

User-agent: *
Disallow: /

The following example blocks a specific directory:

User-agent: *
Disallow: /archive/

What happens if the `robots.txt` is not found?

If a robots.txt file is not found and returns an HTTP 4xx family error, Springboard assumes the site does not define the robots.txt file, defaults to allow all, and crawls everything.

For an HTTP 5xx family error, the site might be on maintenance, or it could have some other problem. Springboard does not proceed with crawling until the server is in a good state.

RobotsWeb data source

What is a robots.txt file?

File format

What happens if the robots.txt is not found?

What is a `robots.txt` file?

What happens if the `robots.txt` is not found?