How to track AI bot traffic on your documentation site

The most attentive reader of your documentation right now is probably not a human. It's a bot. Several bots, actually, hitting your pages on a steady cadence, fetching markdown mirrors, sometimes following links into corners of the site nobody on your team has touched in months. Most teams have no idea this is happening. Their analytics show a polite trickle of human sessions and a slow climb in organic traffic, and the AI side of the audience is invisible because the tools they use to measure docs were built for an era when readers had eyeballs.

That blind spot has a cost. AI agents read your docs to answer the questions developers used to ask in your support inbox. Their answers ship straight into a developer's editor as code that calls your API. When the bot can't find the right page, can't parse the chunk it pulled, or hits a 404 on the link you renamed last quarter, you don't get a complaint email. You get a bad integration written in someone else's repo, and you find out about it weeks later when the support ticket finally lands.

This post is a working guide to closing that blind spot. Which bots to expect, how to spot them in your logs, what good looks like, and what to fix once you start measuring.

The bots that actually matter

Most lists you'll find on the web mix three different things into one bucket: training crawlers, retrieval-time fetchers, and search indexers. Treat them as the same and you'll make the wrong calls about access, caching, and which traffic is worth optimising for.

The honest taxonomy is:

Training crawlers — they read once, model later

These are the bots that scrape the open web to assemble training corpora. Hits from them don't translate to a citation tomorrow; they translate to whatever the next model release happens to know about your product. The list to know: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI-training opt-out signal, which respects your robots.txt separately from regular Googlebot), Applebot-Extended (Apple's equivalent), CCBot (Common Crawl, which feeds a huge number of open and closed models), Bytespider (ByteDance), cohere-ai (Cohere), and Meta-ExternalAgent (Meta's training fetcher).

Retrieval-time fetchers — a user just asked a question about you

These bots only fire when a real human prompts an assistant and the assistant decides your page is relevant to the answer. The hit is a strong signal: someone, somewhere, just asked a question that your documentation is plausibly going to answer. Watch for ChatGPT-User (OpenAI's live-retrieval fetcher), Claude-User (Anthropic's user-initiated fetcher), and Perplexity-User (Perplexity's). These are the bots whose graphs you should care most about.

Search-style indexers — building an answer index, not a training set

Sitting between the two are the search-flavoured crawlers that maintain an index used to answer queries in near-real-time. OAI-SearchBot (OpenAI's search index), Claude-SearchBot (Anthropic's), PerplexityBot (Perplexity's index crawler), and Bingbot (which powers a non-trivial share of ChatGPT's web answers) all sit in this bucket. Their behaviour is closer to a traditional search engine — periodic, broad, link-following — but the destination of the index is an answer, not a results page.

Two more categories deserve a mention even though they're harder to track. Agent-IDE traffic — a developer running Claude Code, Cursor, Windsurf, or a self-hosted agent that fetches your page through a browser-style request — usually arrives with a generic user-agent or one that's a thin wrapper around a popular HTTP library. You can sometimes spot it by the access pattern (one user, dozens of pages, all in a minute, all related to a single feature), but you can't always isolate it cleanly. MCP traffic is the cleanest of all: if your documentation exposes a Model Context Protocol server, every agent talking to it announces itself through the protocol, with the user identifying themselves on the way in.

Where the data actually lives

If you only look at one source, look at your raw HTTP access logs. Every other tool is a derivative.

The user-agent header is the field you care about. A line like this in an access log is what an OpenAI training crawl looks like:

20.171.207.42 - - [09/May/2026:08:14:22 +0000] "GET /docs/webhooks HTTP/2.0" 200 14823 "-" "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)"

A handful of grep commands will get you most of the way to a working dashboard:

# Daily count per AI bot, sorted by frequency
grep -oE '(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Google-Extended|Applebot-Extended|CCBot|Bytespider|cohere-ai|Meta-ExternalAgent)' access.log \
  | sort | uniq -c | sort -rn

# 404s the bots are hitting (these are your broken citations)
grep -E 'GPTBot|ClaudeBot|PerplexityBot|ChatGPT-User|Claude-User|Perplexity-User' access.log \
  | awk '$9 == 404 {print $7}' | sort | uniq -c | sort -rn

# Top pages each retrieval-time bot is fetching today
grep 'ChatGPT-User\|Claude-User\|Perplexity-User' access.log \
  | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

If your docs run on a managed host, the same data is usually a couple of clicks away. Vercel surfaces every request through its log drains and HTTP analytics, with the user-agent string preserved verbatim — pipe those drains into Datadog, BetterStack, Logflare, or a Grafana stack of your choice and you have a real dashboard within an afternoon. Cloudflare exposes the same telemetry through Logpush, and its native bot-management view already groups the major AI bots into a verified-bot category. Netlify, Render, and Fly all expose the user-agent in their request logs.

Two practical notes. First, verify the bots that claim to be GPTBot or ClaudeBot. Each major operator publishes the IP ranges its bots fetch from — OpenAI at openai.com/gptbot.json, Anthropic at anthropic.com/claudebot.json, Perplexity at perplexity.com/perplexitybot.json — and unverified user-agent strings are trivial to spoof. Second, your logs will not give you citations, only fetches. We'll come back to citations at the end of the post.

What good looks like

Once the data is in front of you, the obvious next question is whether the picture you're seeing is healthy. There's no universal answer, but the shape of a healthy AI-bot footprint on a developer documentation site has a few consistent features:

Crawl coverage matches your sitemap. Training crawlers and search indexers should be hitting close to every URL listed in your sitemap over a four-to-eight-week window. If they're stopping after the homepage and a couple of top-level pages, the rest of your docs are functionally invisible to anything those crawlers feed.

Retrieval fetches concentrate on a small head of pages. Live retrieval traffic from ChatGPT-User, Claude-User, and Perplexity-User should look like a power-law curve: a small number of high-value pages getting hit dozens of times a day, a long tail getting hit occasionally. If everything is flat, you don't have enough useful pages — or the useful ones aren't structured so retrieval can find them.

404 rate per bot is near zero. Every 404 a bot hits is a citation that won't happen, or a citation that will happen with a stale or hallucinated URL. The retrieval-time bots are particularly punishing here, because the broken link they hit was already chosen by a model that decided your site was the right answer.

Markdown-mirror traffic exceeds rendered-HTML traffic for retrieval-time fetchers. If your docs ship per-page .md mirrors (/docs/webhooks.md next to /docs/webhooks), you should see retrieval-time bots preferring the markdown URL. They get cleaner content, you get cleaner attribution, and the conversion of a fetch into a useful chunk is far higher.

llms.txt and llms-full.txt see steady, measurable traffic. A site that publishes them and never gets a hit on them is a site whose AI optimisation is theatre. You should see a low but consistent stream of fetches from the major AI user-agents to both files. If you don't, either the files aren't linked from where bots can find them or the bots haven't picked them up yet — and the fix is usually as simple as referencing them from robots.txt and your sitemap.

What to do when the picture is wrong

Most measurement projects collapse if there's no obvious next move. Five recurring failure patterns and the fixes for each:

The bots are hitting your docs but the content they pull is empty. This usually means your docs render heavily on the client. A bot that doesn't run JavaScript fetches the HTML shell, sees nothing, and walks away. The fix is server-side rendering or static generation for documentation pages — and a quick curl from the command line to confirm the rendered HTML actually contains the prose, not a <div id="root"></div>.

Retrieval-time bots are hitting 404s on links your sitemap still claims exist. A page got renamed without a redirect, a slug changed during a content reorg, or a draft URL leaked into a sitemap and isn't published. Ship a permanent redirect for everything that moved, prune broken entries from the sitemap, and accept that LLMs cache URLs for longer than feels reasonable. A 301 today is still useful in three months when a model trained nine months ago tries to cite the old URL.

Training crawlers are crawling, but no retrieval-time bots ever show up. This is the quiet failure mode of a docs site that's technically correct but nobody is asking questions about. The fix isn't an SEO fix; it's a positioning fix. The page needs to be the answer to a question developers actually type into an assistant, and the title and opening paragraph need to make that obvious. Authentication is invisible. How to authenticate API requests with a personal access token is the chunk a model picks.

One specific bot is missing entirely. ClaudeBot fetches every page and PerplexityBot never does, or vice versa. Check robots.txt for an entry that's blocking the bot you're missing. Many CMS templates and CDN defaults still ship with User-agent: * rules that Anthropic, Perplexity, and OpenAI dutifully respect. If you want the traffic, you have to allow it.

Bots are scraping the rendered HTML when you've published markdown mirrors. This is a discoverability problem, not a bot problem. Make sure each rendered page links to its markdown counterpart in a way a crawler can follow, list the mirrors in llms.txt, and reference llms.txt itself from robots.txt. Once the markdown surface is reachable, the better-behaved bots will prefer it.

The harder, more important question: citations

Crawls are the easy half. The harder question is whether all that traffic actually leads to your product showing up in a citation when a developer asks an assistant about a problem your docs solve.

This part is genuinely difficult to measure, and the honest answer is that the existing tooling is young. A few practical approaches that hold up:

Spot-check by hand on a fixed cadence. Pick ten developer-shaped questions your docs are the right answer for. Ask them in ChatGPT with web search on, in Claude with the right tool enabled, and in Perplexity. Note who cites you, who cites a competitor, and who hallucinates. Repeat weekly. The trend over time matters more than any single result.

Watch your inbound referrers from AI products. ChatGPT, Claude, and Perplexity all send a recognisable referer when a user clicks a citation. They're a small fraction of traffic, but they're a clean signal: every one of those clicks is a developer who already trusts you enough to leave the assistant's answer.

Listen for citations indirectly. A new wave of tools — Profound, Peec, Am I Cited, AthenaHQ — index AI-assistant answers across a corpus of queries and tell you when your domain shows up. They're imperfect, they're early, and they're worth the modest subscription fee if your docs are commercially load-bearing.

Read your own docs through the AI's eyes. Open the page in a fresh chat with no context, ask the assistant to summarise it, then ask the question your customer would have asked. Notice what the model gets wrong, where it hedges, where it invents detail you never wrote. Each of those is a chunk you should rewrite.

A workable rhythm

The teams who get this right don't run a one-off audit. They run a tight, boring rhythm: a weekly digest of bot traffic, a monthly review of 404s the bots are hitting, a quarterly spot-check of citations against a fixed prompt list. Each pass takes an hour. Each pass surfaces one or two specific things to fix. The compounding effect over a year is the difference between a docs site that quietly disappears from the AI-assistant ecosystem and one that becomes the default citation for its corner of the developer world.

If your docs run on Doccupine, the structural pieces — sitemap, robots.txt, llms.txt, llms-full.txt, per-page markdown mirrors, MCP server, server-side rendering — are wired up by default. The instrumentation work is yours, but the substrate the bots prefer is already in place. If you're on a different stack, the rhythm above still works; you'll spend a little more time on the substrate and the rest plays out the same way.

Start your free trial

If you're already running this kind of measurement and seeing something I haven't covered, send the surprising stuff to [email protected]. The unexpected traffic shapes are usually where the next post starts.