Product Selector

Fusion 5.9
    Fusion 5.9

    RobotsWeb data source

    What is a robots.txt file?

    A robots.txt file tells a search engine which pages to crawl or not ignore on a website. A website can have crawling and indexing instructions for specific search engines by naming them as user agents. The instructions for crawling behavior primarily use allow and disallow rules. Additionally, there are page-level instructions for use in meta tags or HTTP headers.

    Springboard’s web data source is designed with two obedience behaviors for robots.txt files:

    • Obey Allow, Disallow, and other rules found in a robots.txt file

    • Obey robots meta tags and HTTP response headers for noindex, nofollow, and other rules

    Springboard does not use the crawl-delay rule because it is a non-standard instruction and Google does not use the rule in its own crawls.

    File format

    A robots.txt file is placed at the website top-level directory and is named robots.txt, for example https://www.example.com/robots.txt. The robots.txt file requires only two lines, User-agent and Disallow, but can have other instructions.

    The robots.txt requires the following instructions:

    • User-agent. Name of the crawler. Use * to indicate all crawlers.

    • Disallow. URL string. Use / to indicate the entire site.

    The following example blocks search engines from indexing the entire site:

    User-agent: *
    Disallow: /

    The following example blocks a specific directory:

    User-agent: *
    Disallow: /archive/

    What happens if the robots.txt is not found?

    If a robots.txt file is not found and returns an HTTP 4xx family error, Springboard assumes the site does not define the robots.txt file, defaults to allow all, and crawls everything.

    For an HTTP 5xx family error, the site might be on maintenance, or it could have some other problem. Springboard does not proceed with crawling until the server is in a good state.