All posts
documentation
seo
open-source

SEO for Documentation Sites: The Missing Guide

How to optimize documentation sites for search. Metadata templates, structured data, breadcrumbs, and auto-indexation for docs frameworks.

March 15, 20268 min

You ship v2 of your API docs on a Friday. Monday morning, organic traffic to your reference pages drops 40%. The culprit: the documentation framework generated duplicate content for v1 and v2, diluting search rankings, and 45 pages lost their meta descriptions during the Markdown migration.

Documentation is your product's front door. If developers can't search for "Auth endpoint [Your Product]" and land on the exact code snippet they need, they use something else.

SEO for documentation sites requires a different architecture than marketing pages. You deal with thousands of auto-generated pages, deep nesting, versioning conflicts, and Markdown-driven metadata.

Why do documentation sites fail at SEO?

Documentation sites fail at SEO because they heavily rely on automated page generation, duplicate content across framework versions, and lack granular metadata overrides for deep API references.

When you feed a folder of Markdown files into a static site generator, you lose the manual control you have in a standard Next.js page.tsx file. Common failure modes include:

  • Title tag collision: 50 pages titled "Overview" because the framework uses the first # H1 as the meta title.
  • Version duplication: /docs/v1/auth and /docs/v2/auth serve 95% identical content. Google penalizes this as duplicate content and arbitrarily drops one from the index.
  • Missing Open Graph images: Links shared in Slack unfurl into blank grey boxes because dynamic og:image generation wasn't configured for Markdown routes.

How do Docusaurus, Nextra, and Starlight compare for SEO?

Starlight provides the most robust built-in SEO defaults, Docusaurus requires manual plugin configuration for versioning, and Nextra relies heavily on custom Next.js configurations to achieve feature parity.

If you are evaluating frameworks based strictly on how much SEO infrastructure you have to build yourself, here is the breakdown:

FeatureDocusaurus (React)Nextra (Next.js)Starlight (Astro)
Canonical VersioningManual via @docusaurus/plugin-content-docsCustom Next.js middleware requiredBuilt-in frontmatter config
JSON-LD BreadcrumbsBuilt-inRequires custom <Head> componentBuilt-in
Meta Override StrategyFrontmatter arrayuseConfig() hookFrontmatter object
Sitemap Generation@docusaurus/plugin-sitemapNext.js sitemap.ts@astrojs/sitemap

Starlight wins on developer experience for SEO. Nextra offers the highest ceiling for customization because you have access to the underlying Next.js App Router.

How to configure metadata templates across docs frameworks?

Configure metadata templates by defining a base configuration in your framework's global config file and overriding specific fields using Markdown frontmatter for individual pages.

Hardcoding <title> tags in every Markdown file is brittle. You need a cascading metadata system. The global config sets the template (%s | Acme Docs), the folder sets the section (API Reference), and the file sets the specific page (Authentication).

Nextra Metadata Configuration

Nextra uses the theme.config.jsx file to inject global <head> tags. To create dynamic titles based on Markdown frontmatter, use the useConfig hook.

// theme.config.tsx
import { useConfig } from 'nextra-theme-docs'
 
export default {
  logo: <span>Acme Docs</span>,
  project: {
    link: 'https://github.com/acme/docs',
  },
  head: () => {
    const { frontMatter, title } = useConfig()
    const metaTitle = title ? `${title} | Acme Docs` : 'Acme Documentation'
    const metaDescription = frontMatter.description || 'Official documentation for Acme API.'
 
    return (
      <>
        <title>{metaTitle}</title>
        <meta name="description" content={metaDescription} />
        <meta property="og:title" content={metaTitle} />
        <meta property="og:description" content={metaDescription} />
        <meta property="og:image" content={`https://acme.com/api/og?title=${encodeURIComponent(title)}`} />
      </>
    )
  }
}

Keep documentation meta titles between 50 and 60 characters. Google truncates titles at exactly 600 pixels. If your template is Authentication Overview | Acme Enterprise API Documentation, the actual page title will be truncated before the brand name.

Starlight Metadata Configuration

Starlight maps standard Astro frontmatter directly to SEO tags. You define base attributes in astro.config.mjs and override them in the .mdx files.

// astro.config.mjs
import { defineConfig } from 'astro/config';
import starlight from '@astrojs/starlight';
 
export default defineConfig({
  integrations: [
    starlight({
      title: 'Acme Docs',
      customCss: ['./src/tailwind.css'],
      head: [
        {
          tag: 'meta',
          attrs: { property: 'og:site_name', content: 'Acme Developer Hub' },
        },
      ],
    }),
  ],
});

In your Markdown files, the frontmatter dictates the specific tags:

---
title: Authentication
description: Learn how to authenticate with the Acme API using OAuth 2.0.
head:
  - tag: meta
    attrs:
      name: robots
      content: index, follow
---

How to handle canonical URLs for versioned documentation?

Handle versioned documentation by setting the canonical URL of older versions to point to the corresponding page in the latest active version of your documentation.

When you release v2 of your API, you maintain v1 docs for backwards compatibility. If both /v1/endpoints/users and /v2/endpoints/users exist, search engines see duplicate content. You must tell the crawler that v2 is the authoritative source.

Docusaurus handles this natively if you configure the plugin correctly. You want older versions to explicitly point to the current version.

// docusaurus.config.js
module.exports = {
  title: 'Acme Docs',
  url: 'https://docs.acme.com',
  baseUrl: '/',
  presets: [
    [
      '@docusaurus/preset-classic',
      {
        docs: {
          sidebarPath: require.resolve('./sidebars.js'),
          // This forces Docusaurus to generate canonical tags
          // pointing to the latest version for all older version pages.
          editCurrentVersion: true,
          versions: {
            current: {
              label: 'v2.0.0 (Latest)',
              path: 'v2',
            },
            '1.0.0': {
              label: 'v1.0.0',
              path: 'v1',
              banner: 'unmaintained',
            },
          },
        },
      },
    ],
  ],
};

If a page was deprecated in v1 and doesn't exist in v2, the canonical tag should self-reference the v1 page, but you should add a <meta name="robots" content="noindex"> tag to remove it from search results entirely. Deprecated endpoints shouldn't rank on Google.

How to implement Breadcrumb structured data for nested docs?

Implement Breadcrumb structured data by injecting an ItemList JSON-LD script tag into the <head> of your documentation layout component to reflect the folder structure.

Documentation is deeply nested. A path like /docs/api/v2/authentication/oauth is hard for crawlers to contextualize without explicit structured data. Breadcrumbs tell Google exactly where a page sits in the hierarchy, transforming the search result URL from a raw string into a formatted, clickable breadcrumb trail.

Here is the exact JSON-LD payload you need to generate per page:

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": 1,
      "name": "Documentation",
      "item": "https://docs.acme.com/"
    },
    {
      "@type": "ListItem",
      "position": 2,
      "name": "API Reference",
      "item": "https://docs.acme.com/api"
    },
    {
      "@type": "ListItem",
      "position": 3,
      "name": "Authentication",
      "item": "https://docs.acme.com/api/authentication"
    }
  ]
}

In Next.js (Nextra), you inject this by extending the head function in your theme config, mapping over the current route path to generate the itemListElement array dynamically.

Do not use empty URLs in your JSON-LD. If a parent folder like /docs/api does not have an index.md file and returns a 404, omit it from the breadcrumb list. Google Search Console will flag empty item fields as critical errors and invalidate the entire breadcrumb trail.

How to control LLM crawlers with robots.txt?

Control LLM crawlers by explicitly allowing or disallowing user agents like GPTBot, Claude-Bot, and CCBot in your robots.txt file at the root of your documentation site.

Documentation is high-signal training data for Large Language Models. If you build a public API, you want OpenAI and Anthropic to ingest your docs so that developers using ChatGPT get accurate code snippets for your SDK.

If you are hosting internal docs or proprietary algorithms, you must block them.

To explicitly allow LLMs to train on your documentation (recommended for public DevTools):

# /public/robots.txt
User-agent: *
Allow: /
 
# Explicitly allow AI crawlers
User-agent: GPTBot
Allow: /
 
User-agent: Claude-Bot
Allow: /
 
User-agent: CCBot
Allow: /

To block LLMs from ingesting your documentation:

# /public/robots.txt
User-agent: *
Allow: /
 
# Block AI crawlers from scraping proprietary docs
User-agent: GPTBot
Disallow: /
 
User-agent: Claude-Bot
Disallow: /
 
User-agent: CCBot
Disallow: /
 
User-agent: anthropic-ai
Disallow: /
 
User-agent: Google-Extended
Disallow: /

How to validate documentation SEO in CI/CD?

Validate documentation SEO by running npx indxel check --ci in your GitHub Actions workflow to block pull requests that break canonicals, remove descriptions, or output invalid JSON-LD.

When dealing with hundreds of Markdown files, manual QA is impossible. A developer updates a frontmatter block, accidentally deletes the description field, and ships it. You won't notice until traffic drops.

Indxel runs 15 specific rules covering title length, description presence, og:image HTTP status, canonical URL resolution, and JSON-LD validity directly against your build output.

Checking a 450-page Nextra site takes 2.4 seconds. That adds almost zero overhead to your build while guarding against critical regressions.

Add the validation step to your CI pipeline after the build step.

# .github/workflows/docs-seo.yml
name: Docs SEO Validation
on: [pull_request]
 
jobs:
  validate-seo:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          
      - name: Install dependencies
        run: npm ci
        
      - name: Build docs
        run: npm run build
        
      # Run Indxel against the static output directory
      - name: Validate SEO
        run: npx indxel check ./out --ci --diff

The CLI outputs warnings in the same format as ESLint — one line per issue, with the file path and rule ID. If an issue violates a critical rule, the process exits with code 1, failing the build.

$ npx indxel check ./out --ci --diff
 
Validating 450 pages...
 
✖ 3 critical errors found:
 
/out/docs/v2/auth.html
  Error: Missing meta description (rule: meta-description-presence)
 
/out/docs/v1/endpoints.html
  Error: Canonical URL points to 404 page /docs/v2/endpoints (rule: valid-canonical)
 
/out/docs/guides/quickstart.html
  Error: JSON-LD missing required 'item' field in BreadcrumbList (rule: valid-jsonld)
 
Score: 98/100
447/450 pages pass.
Build failed due to critical errors.

Frequently Asked Questions

Should I use subdomains or subdirectories for documentation?

Use subdirectories (/docs) instead of subdomains (docs.domain.com) because subdirectories share domain authority directly with your main marketing site. Search engines treat subdomains as separate entities, requiring you to build authority from scratch for your documentation.

Does client-side routing hurt documentation SEO?

Client-side routing hurts SEO only if the framework fails to pre-render static HTML for the initial load. Docusaurus, Nextra, and Starlight all generate static HTML at build time, meaning search engine crawlers see the full content immediately without executing JavaScript.

How do I fix "Crawled - currently not indexed" for documentation pages?

Fix "Crawled - currently not indexed" errors by improving internal linking and ensuring the page has unique content. Auto-generated API reference pages often trigger this error because they lack sufficient text density or are orphaned without links from higher-level guide pages.


Catch frontmatter typos and broken canonicals before they merge to main.

npx indxel init