Crawler Accessibility · June 12, 2026

llms.txt at 18 Months: 7.4% of Fortune 500 Have It, AI Crawlers Rarely Fetch It

After 18 months of industry conversation, only 7.4% of Fortune 500 companies have published llms.txt — and the AI crawlers it targets rarely fetch the file at all.

Seven of every eight Fortune 500 companies have no llms.txt file. That single statistic, from ProGEO.ai's March 2026 audit of all 500 entries, reveals the gap between the volume of content written about AI-crawler optimization and the actual state of implementation: after eighteen months of industry conversation, the file designed to help AI systems navigate your site is present on just 7.4% of the largest US companies—and, when it is present, AI crawlers rarely fetch it.

Method

Three datasets inform this analysis. ProGEO.ai published a March 2026 audit measuring three content-discovery protocols across all 500 Fortune 500 domains. SE Ranking surveyed 300,042 domains in early 2026 for broad-web adoption rates. Similarweb studied the 50 domains most frequently cited in AI assistant responses. Crawler behavior data comes from published bot-traffic reports covering hundreds of millions of observed HTTP requests.

Adoption Is Thin—and Enterprise Lags the Broader Web

The ProGEO dataset measured three signals: robots.txt, JSON-LD schema markup, and llms.txt. The gaps are stark.

Fortune 500 Content Discovery Protocol Adoption, 2026

Percentage of Fortune 500 companies with each protocol deployed; robots.txt is near-universal, llms.txt at 7.4%

Source: ProGEO.ai Fortune 500 Audit, March 2026

Robots.txt reached near-universal deployment decades ago; 92.8% of the Fortune 500 have one. JSON-LD, far younger and requiring active maintenance, sits at 53.8%. llms.txt, designed specifically for the AI-search era, comes in at 7.4%—37 companies out of 500.

The broader web tells a counterintuitive story. SE Ranking's survey of 300,042 domains found llms.txt adoption at 10.13%, meaning smaller developer-facing sites are outpacing the Fortune 500. Technically-oriented teams building and shipping their own tooling adopted the format faster than enterprise marketing and IT departments could build consensus around it. Sectors with legal or compliance constraints—financial services, healthcare, law—show publication rates under 10% even among their top-100 domains. Developer tooling and SaaS companies lead by a wide margin, with adoption described as routine in those segments by Q1 2026.

The Fortune 500 counterintuitively underperforms because the file's early adopters skewed heavily technical and small. The largest brands, which move through more bureaucratic approval cycles, have not caught up.

robots.txt Is Present but Mostly AI-Silent

Having robots.txt is table stakes. The actionable question is whether it says anything about AI crawlers. Among the 92.8% of Fortune 500 companies with robots.txt, only 11% explicitly name an AI user agent. That means roughly 82% of the Fortune 500 have robots.txt files that give AI crawlers no explicit instruction—neither permission nor restriction. The crawler operates on whatever defaults the operator chose, which in most cases means crawling is unrestricted.

Most-Blocked AI Crawlers in robots.txt DISALLOW Rules, Q1 2026

Share of DISALLOW entries in a large Q1 2026 robots.txt corpus naming each AI crawler

Source: Q1 2026 robots.txt Corpus Analysis

The crawlers that do appear in DISALLOW rules follow a predictable pattern. GPTBot leads at 5.52% of DISALLOW entries in a large-scale Q1 2026 robots.txt corpus, followed by CCBot, ClaudeBot, Google-Extended, and Bytespider. These are block rates, not crawl rates. A crawler appearing frequently in DISALLOW rules means many sites are actively excluding it—not that it is universally blocked everywhere.

This matters because robots.txt is currently the only crawler-control mechanism with consistent enforcement across the major AI crawler operators. The leading bot operators all publish explicit user-agent strings, document their crawler behavior, and follow robots.txt directives in practice. llms.txt has no comparable enforcement commitment from any major AI provider as of mid-2026.

AI Crawlers Do Not Consistently Fetch llms.txt

The most consequential finding is not adoption—it is bot behavior. Across 515 million analyzed LLM-bot traffic events, the share of requests to /llms.txt is statistically negligible. The dominant AI crawlers overwhelmingly skip the file and crawl HTML directly.

The reason is architectural. llms.txt was designed as an inference-time resource: when a user is already in a conversation with an AI assistant and the assistant retrieves your site for in-session context, a well-structured llms.txt can help the model navigate your content hierarchy. That use case is real but narrow. It is not a pre-crawl signal that shapes which pages get indexed or prioritized during training-data or search-indexing crawls—the processes that determine AI-visibility at scale.

The Similarweb study of the 50 domains most frequently cited by leading AI assistants found exactly one—Target.com—with a published /llms.txt file. The 49 best-performing sites in AI search either have no llms.txt or were not recognized for it. Their prominence in AI-generated answers derives from HTML structure, schema markup, inbound links, and content authority: signals AI crawlers already consume without any additional protocol.

No Standards Body, No Enforcement, No Consistent Format

There is no standards body behind llms.txt. The specification—a community proposal from late 2024—has not been adopted by W3C, IETF, or any major AI provider. No major AI crawler operator has publicly committed to using llms.txt as a signal in their production indexing or answer-generation surfaces. The file is a voluntary community convention, not a protocol.

The result is format inconsistency in practice. Files in the wild range from a single Markdown-formatted link index to extensive per-section summaries with external references. The specification recommends a summary section, an optional full version at /llms-full.txt, and curated content links—but there is no validation mechanism and no enforcement. Among the 37 Fortune 500 sites with llms.txt, content ranges from detailed technical documentation inventories to minimal marketing-page listings. No published research has demonstrated that a specific llms.txt format produces measurably different crawl or citation outcomes.

What This Means for Site Owners

The finding that should drive immediate action is not llms.txt adoption—it is the robots.txt AI-user-agent gap. If your robots.txt does not explicitly name current AI crawlers, you have no documented policy for how those crawlers treat your content. Adding explicit AI user-agent directives to robots.txt is lower implementation effort than publishing a new file type, and it has higher certainty of being respected because the major crawler operators already follow robots.txt directives.

For AI-visibility work, JSON-LD schema markup is the most reliably consumed structured-data signal available today. At 53.8% Fortune 500 adoption and consistently read by both training-data crawlers and AI search engines, schema markup offers concrete, parseable signal without depending on whether crawlers choose to check a new endpoint.

llms.txt has a legitimate use case, but narrower than its coverage suggests. If you publish documentation, developer tooling, or product content that AI assistants are likely to retrieve during active user sessions—not just index in bulk—a well-structured llms.txt can help models navigate your content once they have already reached your domain. Treat it as a session-time navigation aid rather than a crawler-acquisition mechanism.

Before committing engineering resources to llms.txt, instrument it first. Log every GET /llms.txt request and match user-agent strings against known AI crawler lists. After four weeks, review how many verified AI crawler requests hit the file. If that count is in single or low double digits, the file is not being consumed at a scale that justifies significant optimization effort.