Robots.txt mistakes that cratered real traffic — and how to avoid them

robots.txt is the smallest file on most sites and the largest single cause of self-inflicted SEO disasters I see in audits. It's a plain-text file with a handful of directives — and yet I've seen Fortune-500 sites accidentally block themselves from Google for weeks because of one misplaced character.

This post is six failure modes I've watched destroy real traffic, with the exact pattern that caused each and the one-line fix.

1. The "we shipped the dev robots.txt" classic

The single most common disaster I see, every year, across every kind of site:

User-agent: *
Disallow: /

That's the robots.txt every dev environment uses. It blocks the whole site from indexing — appropriate for staging. Catastrophic for production.

It ships in three ways:

CI promoted staging files: a deploy pipeline copies staging/robots.txt to production by mistake.
CMS template override: someone added the disallow during a redesign, never reverted.
CDN cache poisoning: production serves the staging robots.txt from a misconfigured cache rule.

Fix: add a robots.txt check to your post-deploy smoke tests. Two lines:

curl -s https://example.com/robots.txt | grep -q "Disallow: /$" && \
  echo "DISASTER: site disallowed" && exit 1

Run it on every deploy. If it ever fires, you've saved yourself two weeks of recovery.

2. Blocking `/static/` or `/_next/` (or the equivalent)

Modern frameworks (Next.js, Nuxt, Remix, Astro) serve JavaScript and CSS from versioned paths under /_next/static/, /assets/, or similar. I've seen this disaster pattern more than once:

User-agent: *
Disallow: /_next/

The intent is usually "keep crawlers off the noisy build artefacts." The result is: Google can't fetch your JavaScript, can't render the page, sees a blank skeleton, and your rankings drop within 4-6 weeks.

Google needs to render your page to evaluate it. This has been explicit since the 2019 evergreen Googlebot rollout. Blocking your JS/CSS means Google evaluates an empty shell.

Fix: never block paths that serve render-required assets. The minimum-correct robots.txt for a Next.js or similar app:

User-agent: *
Allow: /
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

/api/ blocking is fine — these are dynamic endpoints, not pages. Everything else: let Googlebot in.

3. Using robots.txt instead of noindex

A URL blocked in robots.txt can still appear in search results. If Google learns about the URL from elsewhere — a backlink, a sitemap, an internal link from a page Google did crawl — it can show the URL in the SERP with a placeholder ("No information available for this page").

Example: you block /admin/ to keep Google out. Months later, you Google your brand and find example.com/admin/login in the results, with no description, looking broken and unprofessional.

Fix: to keep a URL out of the index entirely:

Make it crawlable (don't block in robots.txt)
Add <meta name="robots" content="noindex"> on the page itself, OR
Return an X-Robots-Tag: noindex HTTP header

Google has to crawl the page to see the noindex directive. Blocking it in robots.txt prevents the crawl, which prevents the noindex from being seen, which leaves the URL eligible for indexing. The two directives need to be coordinated.

See the robots.txt glossary entry for the full crawling-vs-indexing distinction.

4. Trusting robots.txt with sensitive paths

robots.txt is public. Anyone can fetch https://example.com/robots.txt and read every path you've blocked. Including the paths that look interesting precisely because you blocked them.

I once audited a site whose robots.txt listed every admin URL, internal API endpoint, and staging subdomain. The intent was to keep them out of Google. The effect was to publish a directory of "places worth attacking" to anyone with curl.

Fix: don't put security-sensitive paths in robots.txt at all. Use authentication, IP allowlisting, or non-guessable URLs. If you don't want Google to surface the path, make it noindex; if you don't want anyone unauthorised to reach it, gate it behind auth.

A reasonable rule: anything in robots.txt should be content you'd be happy to discuss in a public conference talk. If it's not, it shouldn't be there.

5. Forgetting the sitemap declaration

Robots.txt is the recommended place to point crawlers at your sitemap:

Sitemap: https://example.com/sitemap.xml

Without this, Google can still find your sitemap via Search Console submission, but other crawlers (Bing, AI training bots that respect robots.txt) may not. The declaration is one line and helps everyone.

Fix: every site should have a Sitemap: line in robots.txt pointing at the canonical sitemap URL. Multiple Sitemap: lines are valid if you have a sitemap index.

6. AI-training-bot blocking without thinking it through

Since 2023, many sites have added blanket blocks for AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

This is a legitimate strategic choice — your content, your rules. But understand the trade-off:

GPTBot / ClaudeBot / CCBot blocking prevents your content from training future models. It does not prevent ChatGPT/Claude from citing you in real-time search via web browsing (different bots).
Google-Extended blocking opts you out of Gemini training. It does not affect Google Search or AI Overview ranking — those are governed by Googlebot, which respects a separate directive.

The mistake is to block these in panic without distinguishing training crawlers from search crawlers. Many sites blocked all bots, then noticed their AI Overview citation rate dropped — because they accidentally blocked Googlebot or Google-Extended in ways they didn't intend.

Fix: if you want to block AI training but stay searchable:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

The order matters: specific user-agents take precedence over User-agent: *. Each blocked bot needs its own block.

A pre-deploy robots.txt checklist

Five checks for every robots.txt change:

Does it block the whole site? grep "Disallow: /$" robots.txt — if it matches, abort.
Are static asset paths allowed? Make sure no Disallow: covers /_next/, /static/, /assets/, or wherever your CSS and JS live.
Is the sitemap declared? grep -i "^Sitemap:" robots.txt should return at least one line.
Does it expose sensitive paths? Read the file out loud. Would you put any of those URLs on a public slide?
Are AI-bot blocks targeting the right bots? Don't lump search and training bots into the same block.

Pair this checklist with a CI test that fetches the live robots.txt after every deploy and asserts at least one allow-path is present. Five lines of bash; a year of disaster prevention.

Tools and references

robots.txt glossary entry — quick reference on the crawling-vs-indexing distinction.
sitemap.xml glossary entry — the file robots.txt usually references.
Search Console exports — how to check which URLs Google is actually indexing vs blocking.

A correctly-configured robots.txt is invisible — nobody notices. A broken one can destroy a quarter of traffic in a single deploy. It's the highest stakes-per-byte file on your site. Treat it accordingly.

This post is six failure modes I've watched destroy real traffic, with the exact pattern that caused each and the one-line fix.

1. The "we shipped the dev robots.txt" classic

The single most common disaster I see, every year, across every kind of site:

User-agent: *
Disallow: /

That's the robots.txt every dev environment uses. It blocks the whole site from indexing — appropriate for staging. Catastrophic for production.

It ships in three ways:

CI promoted staging files: a deploy pipeline copies staging/robots.txt to production by mistake.
CMS template override: someone added the disallow during a redesign, never reverted.
CDN cache poisoning: production serves the staging robots.txt from a misconfigured cache rule.

Fix: add a robots.txt check to your post-deploy smoke tests. Two lines:

curl -s https://example.com/robots.txt | grep -q "Disallow: /$" && \
  echo "DISASTER: site disallowed" && exit 1

Run it on every deploy. If it ever fires, you've saved yourself two weeks of recovery.

2. Blocking `/static/` or `/_next/` (or the equivalent)

Modern frameworks (Next.js, Nuxt, Remix, Astro) serve JavaScript and CSS from versioned paths under /_next/static/, /assets/, or similar. I've seen this disaster pattern more than once:

User-agent: *
Disallow: /_next/

Google needs to render your page to evaluate it. This has been explicit since the 2019 evergreen Googlebot rollout. Blocking your JS/CSS means Google evaluates an empty shell.

Fix: never block paths that serve render-required assets. The minimum-correct robots.txt for a Next.js or similar app:

User-agent: *
Allow: /
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

/api/ blocking is fine — these are dynamic endpoints, not pages. Everything else: let Googlebot in.

3. Using robots.txt instead of noindex

Example: you block /admin/ to keep Google out. Months later, you Google your brand and find example.com/admin/login in the results, with no description, looking broken and unprofessional.

Fix: to keep a URL out of the index entirely:

Make it crawlable (don't block in robots.txt)
Add <meta name="robots" content="noindex"> on the page itself, OR
Return an X-Robots-Tag: noindex HTTP header

See the robots.txt glossary entry for the full crawling-vs-indexing distinction.

4. Trusting robots.txt with sensitive paths

robots.txt is public. Anyone can fetch https://example.com/robots.txt and read every path you've blocked. Including the paths that look interesting precisely because you blocked them.

A reasonable rule: anything in robots.txt should be content you'd be happy to discuss in a public conference talk. If it's not, it shouldn't be there.

5. Forgetting the sitemap declaration

Robots.txt is the recommended place to point crawlers at your sitemap:

Sitemap: https://example.com/sitemap.xml

Fix: every site should have a Sitemap: line in robots.txt pointing at the canonical sitemap URL. Multiple Sitemap: lines are valid if you have a sitemap index.

6. AI-training-bot blocking without thinking it through

Since 2023, many sites have added blanket blocks for AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

This is a legitimate strategic choice — your content, your rules. But understand the trade-off:

GPTBot / ClaudeBot / CCBot blocking prevents your content from training future models. It does not prevent ChatGPT/Claude from citing you in real-time search via web browsing (different bots).
Google-Extended blocking opts you out of Gemini training. It does not affect Google Search or AI Overview ranking — those are governed by Googlebot, which respects a separate directive.

Fix: if you want to block AI training but stay searchable:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

The order matters: specific user-agents take precedence over User-agent: *. Each blocked bot needs its own block.

A pre-deploy robots.txt checklist

Five checks for every robots.txt change:

Does it block the whole site? grep "Disallow: /$" robots.txt — if it matches, abort.
Are static asset paths allowed? Make sure no Disallow: covers /_next/, /static/, /assets/, or wherever your CSS and JS live.
Is the sitemap declared? grep -i "^Sitemap:" robots.txt should return at least one line.
Does it expose sensitive paths? Read the file out loud. Would you put any of those URLs on a public slide?
Are AI-bot blocks targeting the right bots? Don't lump search and training bots into the same block.

Pair this checklist with a CI test that fetches the live robots.txt after every deploy and asserts at least one allow-path is present. Five lines of bash; a year of disaster prevention.

Tools and references

robots.txt glossary entry — quick reference on the crawling-vs-indexing distinction.
sitemap.xml glossary entry — the file robots.txt usually references.
Search Console exports — how to check which URLs Google is actually indexing vs blocking.

1. The "we shipped the dev robots.txt" classic

2. Blocking /static/ or /_next/ (or the equivalent)

3. Using robots.txt instead of noindex

4. Trusting robots.txt with sensitive paths

5. Forgetting the sitemap declaration

6. AI-training-bot blocking without thinking it through

A pre-deploy robots.txt checklist

Tools and references

1. The "we shipped the dev robots.txt" classic

2. Blocking /static/ or /_next/ (or the equivalent)

3. Using robots.txt instead of noindex

4. Trusting robots.txt with sensitive paths

5. Forgetting the sitemap declaration

6. AI-training-bot blocking without thinking it through

A pre-deploy robots.txt checklist

Tools and references

2. Blocking `/static/` or `/_next/` (or the equivalent)

2. Blocking `/static/` or `/_next/` (or the equivalent)