Robots.txt: what it is, what is it for, how to compile and test it for Google

As small as powerful, this file has the ability to prevent access or open the doors of a site to search engine crawlers. Let's see what its features are, why it is so important and how to make the most of it.

Robots.txt: what is it and what is it for?

It is a text file usually of a few bytes in weight, but with enormous power: to allow access to crawlers to scan the site, or to prevent it, totally or partially. This file therefore has the function of establishing which parts of the site can be analysed and potentially indexed by the spiders of Google and other search engines.

Robots.txt file, born way back in 1994 to manage a mailing list, is based on the robots exclusion standard, a protocol created specifically for communication between websites and crawlers. Thanks to this file, it is possible, for example, to block crawler access to a specific directory or page, in order that crawlers will (theoretically) ignore it and not index it.

Source: Google Search Central

How does robots.txt file work?

First of all, in order to work, robots file must be placed in the root of the site tree. Once compiled with the instructions you decide to use, crawlers that respect this standard and follow it – not everyone does – will first of all read its content before they start crawling the site.

The principle on which its instructions are based is very simple: it is possible to assign a value to each field. A classic example is the one that determines which user agent can access the site and which areas of the site are inhibited:

User-agent: *

Disallow: 

The asterisk stands for all elements, so the instruction above indicates that the site is open to all crawlers. However, in order to see if there are any exceptions relating to parts of the site, we must also read the second instruction.

In the example shown there is no value after disallow, so no area of the site is inhibited.

Using the slash after the disallow instruction would otherwise prevent access to the entire site.

User-agent: *

Disallow: /

Fonte: https://nicholasmarmonti.com/

The most likely use of this instruction requires that a directory or one or more specific pages be entered in this field, limiting the „no access“ to these.

When can robots.txt file be useful?

There are several cases where it can be useful to make use of this file:

  • Block a page that you do not want to index for a variety of reasons (privacy, irrelevance for indexing purposes, etc.)
  • Block a directory and all the pages it contains
  • Block a specific crawler
  • Block images or videos of any format

In this way, for example, it is possible to exclude entire areas of the site as restricted areas or other sections that you do not want to share publicly.

Main features of robots.txt file

As we have seen, the first function of this file is to regulate access to site data by spiders. Now let’s see what are the main instructions that webmasters can add in the code with a few simple commands.

  • User-agent: it is one of the mandatory parameters and allows you to indicate to which robots the rules are applied
  • Disallow: allows you to hide the pages indicated to the spiders and therefore prevent them from ending up in SERP
  • Allow: used as an exception to the disallow directive for specific pages or sections (only Googlebot respects it)
  • Sitemap: allows you to indicate the sitemap of the site
  • Crawl-delay: defines the wait between two requests that the bots must respect
  • Host: allows you to define whether the site should be viewed with or without www (supported only by Yandex at present)

But robots file also has some other important features, such as the ability to tell crawlers where the sitemap is:

Sitemap: https://www.mysite.com/sitemap.xml

In addition to this, which is probably the most popular feature together with those that determine crawler access to the site, robots.txt file also has many other features for more experienced users. Here is a short list:

  • Delay in crawling: through the crawl-delay directive it is possible to set the time window within which the crawler will not be able to access more than once.
  • Host: for sites with multiple mirrors, thanks to the Host function it is possible to declare which is the main domain.
  • Meta tags and headers: in the file you can insert meta tags and headers such as the one that tells crawlers not to index the site or a part of it: “noindex”.

When do you need a robots file?

If there are no specific needs, it is not essential by itself to have robots file. However, given the ease with which it can be created and implemented, it is in any case worth having it, if for no other reason than being able to control which areas of the site are theoretically accessible to search engines and which are not, and also being able to change these settings in every moment.

On the other hand, those who find themselves in one of these situations cannot do without it:

  • Those who want to exclude certain areas of the site from the search results, such as the private area or a staging environment
  • Those who need to keep pictures, videos and other files private
  • Those who have a large site and want to avoid the risk of bots wasting crawl budgets in secondary areas of the site
  • Those who have a site with a lot of duplicate content

Robots and Crawl budget

Implementing robots file and configuring it adequately and in detail, therefore, becomes essential for the optimization of large sites, especially for those that make extensive use of parameterized URLs to manage page filters, for example. In fact, these sites easily reach a number of pages that are in the thousands, of which a large number are not particularly significant for search engines.

Since crawlers assign each site a variable but still limited crawl budget (a sort of credit to spend on crawling the pages of the site), it becomes essential to exclude all pages that are not significant for indexing, leaving the budget available for pages that have SEO value.

With a few simple strings it is possible to achieve this result.

How to test robots file for Google

Once robots file has been developed by hand or with the use of a software program or a plugin, we must upload it to the site’s hosting server and make sure that it is readable by crawlers, otherwise it was all for nothing.

There are several tools on the web for testing your robots files, but why not rely on Google itself? By using Search console, you can test your robots file and verify that it is correctly configured.

The platform not only shows the instructions contained in the file, but also allows you to test the access of Google bots to the site.

In case you do not want to limit any crawler, it is suggested to set the file as follows:

User-agent: *

Disallow:

The interface returns the test result in a few seconds, indicating whether the Google bots can access it or not. We can then possibly make the appropriate changes, uploading the file and testing it again.

The tool also allows you to test specific Google bots, such as images and news bots.

In this example, the tester tells us that Googlebot – the bot by Google, precisely – cannot access pages with the .html extension.