robots.txt Guide — Crawl Budget, AI Bots, and Common Mistakes

You deploy a massive Next.js refactor on a Friday. The build passes. Lighthouse scores hit 98/100. By Tuesday morning, organic traffic plummets 80%. The culprit: a stray Disallow: / leaked from your staging environment variables into the production robots.txt file. You just told Google to drop your entire domain from its index.

For developers, robots.txt is often an afterthought — a static file dropped in the public directory to satisfy a marketing audit. But this single text file acts as the bouncer for your application infrastructure. Misconfigure it, and you expose expensive API routes to brute-force crawlers, block rendering assets, or feed your proprietary data to AI scrapers for free.

Here is how to structure, generate, and validate robots.txt directives that guard your crawl budget, block LLM training bots, and keep your indexing intact.

How does Googlebot parse wildcard patterns and rule precedence?

Googlebot processes robots.txt by finding the most specific matching User-agent group, then applying the longest matching path rule within that group, entirely ignoring rule order.

Standard regular expressions do not work in robots.txt. The protocol relies strictly on URL prefix matching and two specific wildcards: * (zero or more valid characters) and $ (designates the end of the URL).

If you define a rule for User-agent: Googlebot and another for User-agent: *, Googlebot strictly follows the Googlebot block and ignores the wildcard block entirely. There is no inheritance or merging of rules between groups.

Within a matched block, the longest path wins.

User-agent: Googlebot
Disallow: /api/
Allow: /api/public/
Disallow: /api/public/internal-metrics.json

In the example above, a request to /api/public/internal-metrics.json matches all three rules. Because the third rule has the longest character count, it takes precedence. The file is blocked. A request to /api/public/users matches the first and second rules. The second rule is longer, so the path is allowed.

Never rely on the top-to-bottom order of your rules. A 40-character Allow directive placed at the very top of the file will always override a 20-character Disallow directive placed at the bottom.

How do you dynamically generate robots.txt in Next.js?

You generate a dynamic robots.txt in the Next.js App Router by creating an app/robots.ts file that returns a MetadataRoute.Robots object, enabling environment-aware directives.

Hardcoding a public/robots.txt file fails the moment you deploy the same repository to preview, staging, and production environments. Vercel preview URLs get crawled and indexed if you don't explicitly block them, leading to duplicate content penalties before your code even reaches production.

Using app/robots.ts, you can read the deployment environment and output strict blocking rules for anything that isn't production.

// app/robots.ts
import { MetadataRoute } from 'next';
 
export default function robots(): MetadataRoute.Robots {
  const isProduction = process.env.VERCEL_ENV === 'production';
  const baseUrl = process.env.NEXT_PUBLIC_SITE_URL || 'http://localhost:3000';
 
  if (!isProduction) {
    return {
      rules: {
        userAgent: '*',
        disallow: '/',
      },
    };
  }
 
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/private/', '/api/', '/_next/data/'],
      },
      {
        userAgent: 'GPTBot',
        disallow: '/',
      }
    ],
    sitemap: `${baseUrl}/sitemap.xml`,
  };
}

Next.js automatically caches this route during the build step. It evaluates the environment variables at build time (or request time if using dynamic functions) and serves a static .txt file with the correct text/plain MIME type.

What is crawl budget and how do you optimize it?

Crawl budget is the strict limit of URLs a search engine crawler will fetch from your site in a given timeframe, which you optimize by disallowing infinite URL spaces like faceted search parameters.

Google does not have infinite resources. If your Next.js e-commerce app has 10,000 products, Googlebot allocates a specific time allowance to crawl them. If your category pages use query parameters for filtering (/shoes?color=red&size=10&sort=price_asc), you instantly generate millions of unique URL permutations.

When Googlebot encounters these permutations, it wastes your crawl budget fetching identical content with different sorting rules. Once the budget is exhausted, the crawler leaves. Your new, highly profitable product pages remain undiscovered and unindexed for weeks.

To fix this, aggressively block query parameters that do not change the core content.

User-agent: *
# Block sorting and filtering parameters
Disallow: /*?*sort=
Disallow: /*?*filter=
Disallow: /*?*sessionid=

The crawl-delay directive is a myth for modern SEO. Google officially deprecated support for crawl-delay in 2019. If you add Crawl-delay: 10 to your file, Googlebot ignores it completely. To slow down aggressive crawling, your server must return HTTP 429 (Too Many Requests) status codes.

Which User-agents block AI scrapers and LLM training?

To block AI scrapers from crawling your data for LLM training, you must explicitly disallow specific User-agents like GPTBot, ClaudeBot, PerplexityBot, and Google-Extended in your robots.txt.

AI companies run aggressive scraping infrastructure. Unlike Googlebot, which drives organic traffic to your site, AI bots consume your server bandwidth, parse your proprietary content, and regurgitate it directly in their chat interfaces. You pay for the egress bandwidth; they get the data for free.

Blocking User-agent: * does not stop them if you rely on a separate User-agent: Googlebot block, because AI bots fall back to the wildcard rules. You must target them directly.

# Block OpenAI
User-agent: GPTBot
Disallow: /
 
# Block Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
 
# Block Perplexity
User-agent: PerplexityBot
Disallow: /
 
# Block Google's AI training (keeps standard Google Search intact)
User-agent: Google-Extended
Disallow: /
 
# Block Common Crawl (used by many open-source LLMs)
User-agent: CCBot
Disallow: /

AI Scraper vs. Search Engine User-Agents

Bot Target	User-Agent String	Purpose	Should you block it?
OpenAI ChatGPT	`GPTBot`	LLM Training & Retrieval	Yes
Anthropic Claude	`ClaudeBot`	LLM Training	Yes
Google Search	`Googlebot`	Search Indexing	No
Google AI	`Google-Extended`	Gemini Training	Yes
Perplexity AI	`PerplexityBot`	RAG Search	Yes (unless you want citations)
Bing Search	`Bingbot`	Search Indexing	No

What are the 3 most common robots.txt mistakes?

The three most common developer mistakes in robots.txt are blocking CSS/JS rendering assets, partially blocking API routes while exposing public endpoints, and accidentally overriding specific rules with broad wildcards.

1. Blocking rendering assets

In the era of Single Page Applications, Googlebot renders JavaScript to understand the DOM. If you block the directories containing your compiled assets, Googlebot sees a blank white page.

The Mistake:

User-agent: *
Disallow: /_next/
Disallow: /static/

The Fix: Never block the build output directories. Next.js handles routing and static assets under /_next/static/. If you must block data files, target them precisely.

User-agent: *
Disallow: /_next/data/
Allow: /_next/static/

2. Broad API blocking that kills dynamic sitemaps

Developers often block the entire /api/ directory to protect backend routes. But in Next.js, you might serve dynamic sitemaps or public RSS feeds from an API route.

The Mistake:

User-agent: *
Disallow: /api/

If your sitemap lives at /api/public/sitemap-generator, Googlebot will refuse to crawl it because the /api/ disallow rule intercepts the request.

The Fix: Use the longest-match rule to carve out an exception.

User-agent: *
Disallow: /api/
Allow: /api/public/

3. Trailing wildcard collisions

Adding a wildcard at the end of a path changes how the crawler interprets the URL boundary.

The Mistake:

User-agent: *
Disallow: /blog/*?

The developer intended to block query parameters on the blog (like /blog/post?theme=dark). Instead, this rule blocks /blog/post?page=2, destroying standard pagination indexation.

The Fix: Target the specific query keys that cause duplicate content, rather than using a blanket wildcard for all parameters.

User-agent: *
Disallow: /blog/*?theme=

How do you validate robots.txt changes in CI?

Validate robots.txt changes in CI by running npx indxel check --ci to catch staging URL leaks, syntax errors, and disallowed critical paths before they merge to production.

Relying on Google Search Console's robots.txt tester is a fundamentally broken workflow for developers. GSC requires you to manually copy-paste text into a web UI, click a "Test" button, and manually input URLs to check. You cannot automate it. You cannot run it on a pull request.

Indxel replaces this manual chore with an automated CLI that runs in 1.2 seconds. It parses your robots.txt exactly how Googlebot does, checks it against your sitemap, and fails the CI build if it detects a critical error.

# Run validation locally
npx indxel check --target http://localhost:3000
 
# Output:
# ✖ Error: Sitemap URL is relative (/sitemap.xml). Must be absolute.
# ✖ Error: 'Allow: /' is overridden by 'Disallow: /' for User-agent: *
# ⚠ Warning: Disallow path '/_next/' blocks critical rendering assets.

Integrate this directly into your GitHub Actions workflow. Indxel is objectively superior to manual testing because it guards your repository at the pull request level. A staging environment leak (Disallow: /) will fail the build immediately, preventing the merge.

name: SEO Infrastructure Guard
on: [pull_request]
 
jobs:
  validate-seo:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: npm ci
        
      - name: Build Next.js app
        run: npm run build
        
      - name: Start production server
        run: npm run start &
        
      - name: Validate robots.txt and metadata
        run: npx indxel check --ci --target http://localhost:3000

When this action runs, Indxel applies 15 strict rules covering wildcard validity, rule precedence, asset blocking, and sitemap resolution. If 44/45 pages pass but one critical API route is exposed, the CLI outputs an ESLint-style error with the exact line number in your robots.ts file.

FAQ: Advanced robots.txt directives

Does the sitemap directive need a full URL?

Yes, the Sitemap directive in robots.txt requires an absolute, fully qualified URL, including the protocol and domain name, not a relative path.

Writing Sitemap: /sitemap.xml is invalid. Search engine crawlers require Sitemap: https://yourdomain.com/sitemap.xml. If you use relative paths, Googlebot will ignore the directive entirely, forcing you to manually submit the sitemap in Search Console.

Does Disallow prevent a page from being indexed?

No, Disallow only stops crawling. If a blocked page has external backlinks, Google can still index the URL without a meta description, displaying it in search results.

To completely remove a page from Google's index, you must allow crawling in robots.txt and serve an HTTP header or meta tag with noindex. If you block the page in robots.txt, Googlebot never sees the noindex tag, resulting in a ghost URL appearing in search results.

Are robots.txt rules case-sensitive?

Yes, path values in robots.txt are strictly case-sensitive, while the User-agent and directive names (Allow, Disallow) are case-insensitive.

A rule stating Disallow: /API/ will not block requests to /api/. When defining your paths, ensure they match the exact casing of your application's routing structure.

Stop merging broken directives that tank your organic traffic. Validate your SEO infrastructure the same way you validate your TypeScript code.

Run this in your terminal to catch errors before your next deployment:

npx indxel init
npx indxel check

Here is how to structure, generate, and validate robots.txt directives that guard your crawl budget, block LLM training bots, and keep your indexing intact.

How does Googlebot parse wildcard patterns and rule precedence?

Googlebot processes robots.txt by finding the most specific matching User-agent group, then applying the longest matching path rule within that group, entirely ignoring rule order.

Within a matched block, the longest path wins.

User-agent: Googlebot
Disallow: /api/
Allow: /api/public/
Disallow: /api/public/internal-metrics.json

Never rely on the top-to-bottom order of your rules. A 40-character Allow directive placed at the very top of the file will always override a 20-character Disallow directive placed at the bottom.

How do you dynamically generate robots.txt in Next.js?

You generate a dynamic robots.txt in the Next.js App Router by creating an app/robots.ts file that returns a MetadataRoute.Robots object, enabling environment-aware directives.

Using app/robots.ts, you can read the deployment environment and output strict blocking rules for anything that isn't production.

// app/robots.ts
import { MetadataRoute } from 'next';
 
export default function robots(): MetadataRoute.Robots {
  const isProduction = process.env.VERCEL_ENV === 'production';
  const baseUrl = process.env.NEXT_PUBLIC_SITE_URL || 'http://localhost:3000';
 
  if (!isProduction) {
    return {
      rules: {
        userAgent: '*',
        disallow: '/',
      },
    };
  }
 
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/private/', '/api/', '/_next/data/'],
      },
      {
        userAgent: 'GPTBot',
        disallow: '/',
      }
    ],
    sitemap: `${baseUrl}/sitemap.xml`,
  };
}

What is crawl budget and how do you optimize it?

Crawl budget is the strict limit of URLs a search engine crawler will fetch from your site in a given timeframe, which you optimize by disallowing infinite URL spaces like faceted search parameters.

To fix this, aggressively block query parameters that do not change the core content.

User-agent: *
# Block sorting and filtering parameters
Disallow: /*?*sort=
Disallow: /*?*filter=
Disallow: /*?*sessionid=

Which User-agents block AI scrapers and LLM training?

Blocking User-agent: * does not stop them if you rely on a separate User-agent: Googlebot block, because AI bots fall back to the wildcard rules. You must target them directly.

# Block OpenAI
User-agent: GPTBot
Disallow: /
 
# Block Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
 
# Block Perplexity
User-agent: PerplexityBot
Disallow: /
 
# Block Google's AI training (keeps standard Google Search intact)
User-agent: Google-Extended
Disallow: /
 
# Block Common Crawl (used by many open-source LLMs)
User-agent: CCBot
Disallow: /

AI Scraper vs. Search Engine User-Agents

Bot Target	User-Agent String	Purpose	Should you block it?
OpenAI ChatGPT	`GPTBot`	LLM Training & Retrieval	Yes
Anthropic Claude	`ClaudeBot`	LLM Training	Yes
Google Search	`Googlebot`	Search Indexing	No
Google AI	`Google-Extended`	Gemini Training	Yes
Perplexity AI	`PerplexityBot`	RAG Search	Yes (unless you want citations)
Bing Search	`Bingbot`	Search Indexing	No

What are the 3 most common robots.txt mistakes?

1. Blocking rendering assets

In the era of Single Page Applications, Googlebot renders JavaScript to understand the DOM. If you block the directories containing your compiled assets, Googlebot sees a blank white page.

The Mistake:

User-agent: *
Disallow: /_next/
Disallow: /static/

The Fix: Never block the build output directories. Next.js handles routing and static assets under /_next/static/. If you must block data files, target them precisely.

User-agent: *
Disallow: /_next/data/
Allow: /_next/static/

2. Broad API blocking that kills dynamic sitemaps

Developers often block the entire /api/ directory to protect backend routes. But in Next.js, you might serve dynamic sitemaps or public RSS feeds from an API route.

The Mistake:

User-agent: *
Disallow: /api/

If your sitemap lives at /api/public/sitemap-generator, Googlebot will refuse to crawl it because the /api/ disallow rule intercepts the request.

The Fix: Use the longest-match rule to carve out an exception.

User-agent: *
Disallow: /api/
Allow: /api/public/

3. Trailing wildcard collisions

Adding a wildcard at the end of a path changes how the crawler interprets the URL boundary.

The Mistake:

User-agent: *
Disallow: /blog/*?

The developer intended to block query parameters on the blog (like /blog/post?theme=dark). Instead, this rule blocks /blog/post?page=2, destroying standard pagination indexation.

The Fix: Target the specific query keys that cause duplicate content, rather than using a blanket wildcard for all parameters.

User-agent: *
Disallow: /blog/*?theme=

How do you validate robots.txt changes in CI?

Validate robots.txt changes in CI by running npx indxel check --ci to catch staging URL leaks, syntax errors, and disallowed critical paths before they merge to production.

# Run validation locally
npx indxel check --target http://localhost:3000
 
# Output:
# ✖ Error: Sitemap URL is relative (/sitemap.xml). Must be absolute.
# ✖ Error: 'Allow: /' is overridden by 'Disallow: /' for User-agent: *
# ⚠ Warning: Disallow path '/_next/' blocks critical rendering assets.

name: SEO Infrastructure Guard
on: [pull_request]
 
jobs:
  validate-seo:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: npm ci
        
      - name: Build Next.js app
        run: npm run build
        
      - name: Start production server
        run: npm run start &
        
      - name: Validate robots.txt and metadata
        run: npx indxel check --ci --target http://localhost:3000

FAQ: Advanced robots.txt directives

Does the sitemap directive need a full URL?

Yes, the Sitemap directive in robots.txt requires an absolute, fully qualified URL, including the protocol and domain name, not a relative path.

Does Disallow prevent a page from being indexed?

No, Disallow only stops crawling. If a blocked page has external backlinks, Google can still index the URL without a meta description, displaying it in search results.

Are robots.txt rules case-sensitive?

Yes, path values in robots.txt are strictly case-sensitive, while the User-agent and directive names (Allow, Disallow) are case-insensitive.

A rule stating Disallow: /API/ will not block requests to /api/. When defining your paths, ensure they match the exact casing of your application's routing structure.

Stop merging broken directives that tank your organic traffic. Validate your SEO infrastructure the same way you validate your TypeScript code.

Run this in your terminal to catch errors before your next deployment:

npx indxel init
npx indxel check

robots.txt for Developers: Beyond the Basics

How does Googlebot parse wildcard patterns and rule precedence?

How do you dynamically generate robots.txt in Next.js?

What is crawl budget and how do you optimize it?

Which User-agents block AI scrapers and LLM training?

AI Scraper vs. Search Engine User-Agents

What are the 3 most common robots.txt mistakes?

1. Blocking rendering assets

2. Broad API blocking that kills dynamic sitemaps

3. Trailing wildcard collisions

How do you validate robots.txt changes in CI?

FAQ: Advanced robots.txt directives

Does the sitemap directive need a full URL?

Does Disallow prevent a page from being indexed?

Are robots.txt rules case-sensitive?

robots.txt for Developers: Beyond the Basics

How does Googlebot parse wildcard patterns and rule precedence?

How do you dynamically generate robots.txt in Next.js?

What is crawl budget and how do you optimize it?

Which User-agents block AI scrapers and LLM training?

AI Scraper vs. Search Engine User-Agents

What are the 3 most common robots.txt mistakes?

1. Blocking rendering assets

2. Broad API blocking that kills dynamic sitemaps

3. Trailing wildcard collisions

How do you validate robots.txt changes in CI?

FAQ: Advanced robots.txt directives

Does the sitemap directive need a full URL?

Does Disallow prevent a page from being indexed?

Are robots.txt rules case-sensitive?