Robots.txt
is a text file that search engines like google look for, in order to decide which pages and routes needs to be crawled and indexed. Generally it is placed at the root of the domain. For example: example.com/robots.txt
. It is an important file that also will have affect on your site’s performance.
Table of contents
Open Table of contents
Cloudflare Workers
Cloudflare workers are serverless functions are executed on the edge locations of Cloudflare’s vast network. It is often known for it’s zero cold start promises. In order to set up a custom response to robots.txt
, you need to make sure that you’ve set up your domain on Cloudflare (free tier works).
export default {
async fetch() {
const responseText = `User-agent: *\nAllow: /content/images\nAllow: /content/images/\nAllow: /content/images/*\nDisallow: /`;
return new Response(responseText, {
headers: { "Content-Type": "text/plain" },
});
},
};
The above snippet returns a text/plain
response with the following body:
User-agent: *
Allow: /content/images
Allow: /content/images/
Allow: /content/images/*
Disallow: /
This will inform the search engines to crawl for routes that begin with /content/images
and all its subdirectories. All other routes will be explicitly ignored. Although the routes might still be crawled by the bots, we’re asking them to honor the preferences.
The next step is to set up a route which will trigger the worker when it gets matched. For the above example, the route URL will have to be set to https://example.com/robots.txt
.
Use cases
It is sometimes hard to directly make changes to a website content and its files directly. In the above case, I had to configure indexing of Ghost CMS content to be ignored. This is required if you plan to use the CMS in a headless mode. Instead of having to edit the theme in order to set a custom robots.txt
, this approach will effectively ignore the robots.txt
file being served by Ghost and will serve the response from workers directly.
The same approach can be used to serve any custom response/page/file, including files like sitemap.xml
.
Bonus
If you’d like to restrict AI bots from scraping content on your website, add their respective disallow configurations.
For instance:
User-agent: Amazonbot
Disallow: /
User-agent: Arquivo-web-crawler
Disallow: /
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
Disallow: /
User-agent: Bingbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Coherence.ai
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Magpie-crawler
Disallow: /
User-agent: Mojeek
Disallow: /
User-agent: MoodleBot
Disallow: /
User-agent: NewsNow
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: Omilibot
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: Peer39_ crawler
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Scrapy
Disallow: /
User-agent: Seekr
Disallow: /
User-agent: Turnitin
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: YouBot
Disallow: /
While this list is not exhaustive, it should cover most of the popular AI bots. As new AI services are introduced each model may use its own crawler. Therefore, the Robots.txt file needs to be manually updated frequently.
It is up to the bots to honor the configurations. The mere existence of a valid Robots.txt configuration does not necessarily ensure that your website content won’t be scraped.