robots.txt is the smallest file on most sites and the largest single cause of self-inflicted SEO disasters I see in audits. It's a plain-text file with a handful of directives — and yet I've seen Fortune-500 sites accidentally block themselves from Google for weeks because of one misplaced character.
This post is six failure modes I've watched destroy real traffic, with the exact pattern that caused each and the one-line fix.
1. The "we shipped the dev robots.txt" classic
The single most common disaster I see, every year, across every kind of site:
User-agent: *
Disallow: /
That's the robots.txt every dev environment uses. It blocks the whole site from indexing — appropriate for staging. Catastrophic for production.
It ships in three ways:
- CI promoted staging files: a deploy pipeline copies
staging/robots.txtto production by mistake. - CMS template override: someone added the disallow during a redesign, never reverted.
- CDN cache poisoning: production serves the staging robots.txt from a misconfigured cache rule.
Fix: add a robots.txt check to your post-deploy smoke tests. Two lines:
curl -s https://example.com/robots.txt | grep -q "Disallow: /$" && \
echo "DISASTER: site disallowed" && exit 1
Run it on every deploy. If it ever fires, you've saved yourself two weeks of recovery.
2. Blocking /static/ or /_next/ (or the equivalent)
Modern frameworks (Next.js, Nuxt, Remix, Astro) serve JavaScript and CSS from versioned paths under /_next/static/, /assets/, or similar. I've seen this disaster pattern more than once:
User-agent: *
Disallow: /_next/
The intent is usually "keep crawlers off the noisy build artefacts." The result is: Google can't fetch your JavaScript, can't render the page, sees a blank skeleton, and your rankings drop within 4-6 weeks.
Google needs to render your page to evaluate it. This has been explicit since the 2019 evergreen Googlebot rollout. Blocking your JS/CSS means Google evaluates an empty shell.
Fix: never block paths that serve render-required assets. The minimum-correct robots.txt for a Next.js or similar app:
User-agent: *
Allow: /
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
/api/ blocking is fine — these are dynamic endpoints, not pages. Everything else: let Googlebot in.
3. Using robots.txt instead of noindex
A URL blocked in robots.txt can still appear in search results. If Google learns about the URL from elsewhere — a backlink, a sitemap, an internal link from a page Google did crawl — it can show the URL in the SERP with a placeholder ("No information available for this page").
Example: you block /admin/ to keep Google out. Months later, you Google your brand and find example.com/admin/login in the results, with no description, looking broken and unprofessional.
Fix: to keep a URL out of the index entirely:
- Make it crawlable (don't block in robots.txt)
- Add
<meta name="robots" content="noindex">on the page itself, OR - Return an
X-Robots-Tag: noindexHTTP header
Google has to crawl the page to see the noindex directive. Blocking it in robots.txt prevents the crawl, which prevents the noindex from being seen, which leaves the URL eligible for indexing. The two directives need to be coordinated.
See the robots.txt glossary entry for the full crawling-vs-indexing distinction.
4. Trusting robots.txt with sensitive paths
robots.txt is public. Anyone can fetch https://example.com/robots.txt and read every path you've blocked. Including the paths that look interesting precisely because you blocked them.
I once audited a site whose robots.txt listed every admin URL, internal API endpoint, and staging subdomain. The intent was to keep them out of Google. The effect was to publish a directory of "places worth attacking" to anyone with curl.
Fix: don't put security-sensitive paths in robots.txt at all. Use authentication, IP allowlisting, or non-guessable URLs. If you don't want Google to surface the path, make it noindex; if you don't want anyone unauthorised to reach it, gate it behind auth.
A reasonable rule: anything in robots.txt should be content you'd be happy to discuss in a public conference talk. If it's not, it shouldn't be there.
5. Forgetting the sitemap declaration
Robots.txt is the recommended place to point crawlers at your sitemap:
Sitemap: https://example.com/sitemap.xml
Without this, Google can still find your sitemap via Search Console submission, but other crawlers (Bing, AI training bots that respect robots.txt) may not. The declaration is one line and helps everyone.
Fix: every site should have a Sitemap: line in robots.txt pointing at the canonical sitemap URL. Multiple Sitemap: lines are valid if you have a sitemap index.
6. AI-training-bot blocking without thinking it through
Since 2023, many sites have added blanket blocks for AI training crawlers:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
This is a legitimate strategic choice — your content, your rules. But understand the trade-off:
GPTBot/ClaudeBot/CCBotblocking prevents your content from training future models. It does not prevent ChatGPT/Claude from citing you in real-time search via web browsing (different bots).Google-Extendedblocking opts you out of Gemini training. It does not affect Google Search or AI Overview ranking — those are governed byGooglebot, which respects a separate directive.
The mistake is to block these in panic without distinguishing training crawlers from search crawlers. Many sites blocked all bots, then noticed their AI Overview citation rate dropped — because they accidentally blocked Googlebot or Google-Extended in ways they didn't intend.
Fix: if you want to block AI training but stay searchable:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Allow: /
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
The order matters: specific user-agents take precedence over User-agent: *. Each blocked bot needs its own block.
A pre-deploy robots.txt checklist
Five checks for every robots.txt change:
- Does it block the whole site?
grep "Disallow: /$" robots.txt— if it matches, abort. - Are static asset paths allowed? Make sure no
Disallow:covers/_next/,/static/,/assets/, or wherever your CSS and JS live. - Is the sitemap declared?
grep -i "^Sitemap:" robots.txtshould return at least one line. - Does it expose sensitive paths? Read the file out loud. Would you put any of those URLs on a public slide?
- Are AI-bot blocks targeting the right bots? Don't lump search and training bots into the same block.
Pair this checklist with a CI test that fetches the live robots.txt after every deploy and asserts at least one allow-path is present. Five lines of bash; a year of disaster prevention.
Tools and references
- robots.txt glossary entry — quick reference on the crawling-vs-indexing distinction.
- sitemap.xml glossary entry — the file robots.txt usually references.
- Search Console exports — how to check which URLs Google is actually indexing vs blocking.
A correctly-configured robots.txt is invisible — nobody notices. A broken one can destroy a quarter of traffic in a single deploy. It's the highest stakes-per-byte file on your site. Treat it accordingly.