robots.txt is a text file at https://example.com/robots.txt that tells search engine crawlers which paths they're allowed to crawl. It's an old, simple, surprisingly powerful tool — and the source of more accidental traffic disasters than almost any other SEO file.
What it does (and doesn't do)
robots.txt controls crawling, not indexing. A URL blocked in robots.txt can still appear in search results if Google learns about it from elsewhere (links from other sites, sitemaps, etc.) — but Google can't crawl the page, so the SERP entry will be sparse ("No information available for this page").
To keep a URL out of the index entirely, use a noindex meta tag or HTTP header on the page itself, and leave it crawlable so Google can see the noindex directive.
Minimum useful structure
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
Common mistakes
- Blocking the entire site by accident —
Disallow: /to all user-agents is a real cause of "we shipped a redesign and our traffic dropped to zero." Always check robots.txt on every deploy. - Blocking CSS and JS — Google needs to render your page to evaluate it. Don't
Disallow: /static/or/_next/. - Using robots.txt for sensitive paths — robots.txt is public. Listing
/admin/secret-pageadvertises its existence to anyone curious. Use authentication, not robots.txt. - Forgetting to declare the sitemap —
Sitemap: https://...helps crawlers find it.
Per-bot directives
You can target specific crawlers:
User-agent: Googlebot
Allow: /
User-agent: GPTBot
Disallow: /
This is how many sites have blocked AI training crawlers since 2023. GPTBot, ClaudeBot, CCBot, Google-Extended, and others respect robots.txt. Whether you should block them is a separate strategic question.