Robots.txt Generator

Generate Robots.txt Files


Robots.txt Generator

The robots.txt file is a text file used to instruct web robots (also known as web crawlers or spiders) how to behave when crawling a website. It's a standard  method used by websites to communicate with web crawlers to provide guidance on which pages or sections of the site should not be crawled or indexed.

About

The "Robots.txt Generator" is a tool designed to streamline the process of creating robots.txt files for websites. It simplifies the task of managing how search engine robots interact with a site by allowing users to define access rules for specific search engine bots. This tool is particularly useful for webmasters and developers who want fine-grained control over the crawling and indexing of their websites by search engines.

The generator offers an intuitive interface, allowing users to establish a default policy,  determining whether all search engine robots should be granted or denied access to the site. Users have the flexibility to override this default policy on a per-robot basis by specifying their preference as "Allowed" or "Refused" for each individual search engine robot included in the tool.

In addition, the generator enables users to add restrictive rules for directories or path patterns, specify crawl delay parameters, and provide website's sitemap location information. 

Supported robots

The Robots.txt Generator is preconfigured with the following bots and crawlers:

Google
Googlebot - Googlebot Smartphone & Googlebot Desktop.
Googlebot-Image - Googlebot Image $ Google Favicon.
Googlebot-News - Googlebot News.
Googlebot-Video - Googlebot Video.
Storebot-Google - Google StoreBot - crawls through certain types of pages, including, but not limited to, product details pages, cart pages, and checkout pages.

Bing 
Bingbot - Bingbot is a  standard Bing crawler and handles most of the crawling needs each day.

Yahoo
Slurp - The Yahoo Search robot for crawling and indexing web page information.

Ahrefs
AhrefsBot - Search Engine Optimization.

Amazon
Amazonbot - Amazonbot is  web crawler used to improve  Amazon's  services, such as enabling Alexa to answer even more questions for customers. Amazonbot respects standard robots.txt rules.

Baidu
Baiduspider - Baiduspider is a Baidu search engine program used to visit pages on the internet and build information into the Baidu index. There is also a namunber of  other special purpose spider with the names starting with "Baiduspider."

DuckDuckGo
DuckDuckBot - DuckDuckBot is the Web crawler for DuckDuckGo.

Moz
DotBot - Moz's web crawler, it gathers web data for the Moz Link Index.

Naver
Yeti - The name of Naver's search robot is Yeti.

OpenAI ChatGPT
GPTBot - OpenAI's web crawler.

Yandex
Yandex - The Yandex search engine has many special purpose bots with the names starting with “Yandex.”

Robots.txt configuration examples

Let’s illustrate the functionality of this tool by creating a basic robot.txt file that may be sufficient for most websites. 

First, let’s define the “Default for All Robots” as “Allowed”, set all individual robots’ settings as “Default” and then click on “Generate Robot.txt “.  The tool will produce a file with a single configuration group that looks like this:

User-agent: * 
Disallow:

User-agent: This line specifies the web crawler or user agent to which the rules apply. In this case, the  asterisk (*) symbol states that the rule applies to all web crawlers.
Disallow: This line specifies the URL path that should not be crawled by the specified user agent.  In our example, the empty path allows unrestricted access to the entire site.

If required,  all web crawlers can be instructed not to crawl the specific directory or directories on the website. To illustrate, enter a comma separated list of directories, for example:  “/private/,/restricted/”, into the “Disallowed Directories” field.

User-agent: *
Disallow: /private/
Disallow: /restricted/

In this example, all web crawlers are instructed not to crawl the "/private/" and "/restricted/"  directories  on the website but otherwise have unrestricted access to the rest of the URLs on the site..

As you can see, multiple Disallow lines can be added to the same group to restrict access to multiple path prefixes. 

Path prefixes may include the following special symbols:
$ - Designates the end of the match pattern.
* - Designates zero or more instances of any character.

Furthermore, the  Allow rules can be used in combination with Disallow to fine tune robots’ access. 

As the next step, let's demonstrate how restrictions can be applied to a single crawler. If, for example, we want to prevent GPTbot from indexing our site, we can change the corresponding setting to “Refused”.  The resulting robots.txt file will now include an additional group of configuration lines:

User-agent: GPTBot # OpenAI ChatGPT
Disallow: /

User-agent: GPTBot - Explicitly specifies the search engine robot to which the rule applies, in this case - GPTBot.
Disallow: / -  Indicates that the bot in question is not allowed to access any part of the site.
# - Designates the remainder of the line as a comment.

If required, similar restrictions can be applied to other bots by following the procedure described above. If the bot you wish to restrict is not included in this tool, copy the group from the previous example into the robots.txt file and replace “GPTbot” with the name (product-token) of the bot you wish to block. One of the ways to find the name is to check the bot’s user agent string recorded in web server logs, for example:

[03/Oct/2023:03:03:13 -0400] "GET / HTTP/1.1" 301 580 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
[03/Oct/2023:14:21:15 -0400] "GET / HTTP/1.1" 200 4170 "-" "Mozilla/5.0 (compatible; Qwantify-prod/1.0; +https://help.qwant.com/bot/)"
[03/Oct/2023:15:47:15 -0400] "GET / HTTP/1.1" 200 4170 "-" "Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36"
[03/Oct/2023:17:31:49 -0400] "GET / HTTP/1.1" 403 439 "-" "MaxPointCrawler/Nutch-1.17 (valassis.crawler at valassis dot com)"


The robots.txt we’ve just reviewed started with “Allow All” configuration. The alternative approach would be to start with  “Disallow All”  and then only permit access to robots on as needed basis. 

For example:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

This configuration denies access to all robots with the exception of Googlebot.

For more details about the robots.txt configuration and format see RFC 9309 Robots Exclusion Protocol.

It's important to note that while robots.txt is a widely recognized standard, not all web crawlers obey its directives. Many well-behaved crawlers follow the rules, but malicious or poorly programmed crawlers may ignore them. Additionally, robots.txt is a public file, and it doesn't provide security for sensitive information - it's more of a convention for communication with cooperative web crawlers. If you need to protect sensitive information, other measures such as authentication and access controls should be implemented.

Related Tools

Contact

Missing something?

Feel free to request missing tools or give some feedback using our contact form.

Contact Us