Block the Wrong Bot and Your Products Vanish From AI Search

You can have the cleanest product catalog on the open web and still be invisible inside ChatGPT and Perplexity. Not because your data is wrong, but because a single line in a file most merchants never open is quietly telling the shopping crawlers to stay out.

TL;DR: AI shopping engines use two different kinds of crawlers: training bots and retrieval bots. Blocking the retrieval bot removes your products from AI search answers entirely, even if your feed and schema are perfect. Most robots.txt rules were written before this distinction existed, so a lot of stores are blocking the wrong bot by accident.

The robots.txt file sits at the root of your domain and tells automated crawlers which paths they can read. For two decades it was a search-engine housekeeping detail. In 2026 it is a discovery gate that decides whether an AI shopping assistant can see your products at all.

Why are there two AI crawlers instead of one?

OpenAI runs separate crawlers for separate jobs, and each one reads your robots.txt independently. GPTBot collects content to train foundation models. OAI-SearchBot reads pages to answer live questions inside ChatGPT search. They are governed by different rules, and you can allow one while blocking the other.

The OpenAI documentation is explicit about the consequence. "Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers," per OpenAI's crawler reference. That single sentence is the whole problem. A merchant who blocked GPTBot in 2023 to keep their content out of training data may have copied a robots.txt rule that also caught the retrieval bot, and now they are missing from ChatGPT shopping results without ever knowing it.

Google works the same way. Google-Extended controls training use of your content, while standard Googlebot feeds AI Mode and AI Overviews. Perplexity splits its crawlers along similar lines. The pattern holds across every major engine: the bot that trains the model and the bot that answers the shopper are not the same bot.

How a training-era block became a discovery problem

Most ecommerce robots.txt files were last touched during the 2023 backlash against AI training. The instinct then was defensive: keep our content out of the machines. That instinct is now expensive.

The blocking data confirms how widespread it is. An analysis of top news sites found that 71% block at least one AI retrieval bot, and 67% block PerplexityBot specifically, which is used for indexing rather than training. Publishers led this wave, but the same copy-paste robots.txt rules circulated through ecommerce templates and agency boilerplate.

GPTBot has been the most-blocked AI crawler since it launched. Within months, 15% of the top 100 websites and 7% of the top 1,000 were blocking it, and the blocking habit has only spread since. When a store blocks GPTBot with a broad rule and does not separately allow the retrieval bot, it can disappear from the exact surfaces where AI-referred shopping traffic is growing fastest.

Blocking a training bot protects your content from being used to train a model. Blocking a retrieval bot removes you from the answer. Merchants keep doing the second when they only meant to do the first.

What this looks like on a real store

The failure is silent. Nothing breaks, no error fires, and your site analytics look normal because human traffic is unaffected. The only symptom is absence: a shopper asks ChatGPT or Perplexity for a product in your category, and a competitor's name comes back instead of yours.

This is why crawl access sits underneath everything else in AI discovery. Schema markup, enriched attributes, and a clean product feed all assume the crawler can reach the page in the first place. If the retrieval bot is blocked, none of that work ever gets read. Proper AI crawler management is the foundation the rest of your AI shopping search strategy stands on.

It also interacts with how AI search actually retrieves. When an engine runs query fan-out, it decomposes one shopping question into 8 to 12 sub-queries and fetches passages to answer each one. Passage-level retrieval only works if the retrieval bot can fetch the passage. Block it, and you are not ranked low for those sub-queries. You are not in the candidate set at all.

Which bots matter for AI shopping discovery

The list of crawlers that decide your AI shopping visibility is short. These are the retrieval-side agents to allow, not block, if you want products surfaced in AI answers.

Crawler	Engine	Job	Block it and...
OAI-SearchBot	ChatGPT search	Retrieval for live answers	You drop out of ChatGPT search answers
Googlebot	Google AI Mode, AI Overviews	Indexing that feeds AI surfaces	You lose AI Mode and Overview placement
PerplexityBot	Perplexity	Indexing for answers	You drop out of Perplexity results
GPTBot	OpenAI training	Foundation model training	Your content is not used for training (no direct search impact)
Google-Extended	Google training	Gemini model training	Your content is not used for Gemini training (no direct search impact)

The distinction in the last two rows is the one that trips merchants up. Blocking GPTBot or Google-Extended is a defensible choice about training data. Blocking OAI-SearchBot, Googlebot, or PerplexityBot is a choice to be invisible in AI shopping, and almost nobody makes it on purpose.

What to do this week

These steps take a few hours and need no new tooling. They are the cheapest AI visibility fix most stores have available.

Open yourdomain.com/robots.txt in a browser today. Read it line by line. Note every User-agent block and every Disallow rule that could affect a crawler.
Confirm OAI-SearchBot, Googlebot, and PerplexityBot are not disallowed from your product and category paths. If a broad Disallow: / or a catch-all rule applies to them, that is your problem.
Separate training from retrieval deliberately. If you want to block training, block GPTBot and Google-Extended by name, and explicitly allow the retrieval bots so the rule does not bleed over.
Check your CDN and bot-protection layer too. Cloudflare and similar services can block AI crawlers at the network edge even when robots.txt allows them. The robots.txt fix is wasted if the firewall still returns a block.
Consider an llms.txt file to guide AI agents to your highest-value product and category pages once access is open. It complements crawl access; it does not replace it.

After the fix, verify rather than assume. Run the same category queries in ChatGPT, Perplexity, and Google AI Mode that a shopper would, and track whether your products start appearing. Monitoring your found rate across engines is how you confirm the gate actually opened, and OpenAI notes it can take roughly 24 hours after a robots.txt change for its search systems to adjust.

Frequently Asked Questions

Does blocking GPTBot hurt my ChatGPT shopping visibility?
Not by itself. GPTBot handles training, and blocking it keeps your content out of model training. The risk is a broad rule that also catches OAI-SearchBot, the retrieval crawler. If only GPTBot is blocked and OAI-SearchBot is allowed, your ChatGPT search visibility is intact.

How do I know if my store is blocking the wrong bot?
Read your robots.txt directly at yourdomain.com/robots.txt. Look for any Disallow rule that applies to OAI-SearchBot, Googlebot, or PerplexityBot on your product paths. A catch-all block or an over-broad rule written during the 2023 training backlash is the most common culprit.

Will fixing robots.txt put my products in AI search immediately?
No. Allowing a crawler grants access; it does not guarantee placement. OpenAI says it can take around 24 hours for a robots.txt change to register, and the engine still has to crawl, index, and judge your pages relevant. Crawl access is the precondition, not the finish line.

Is robots.txt the only place crawlers get blocked?
No. CDN and bot-management tools like Cloudflare can block AI crawlers at the network edge regardless of robots.txt. Check both layers. A permissive robots.txt does nothing if your firewall is still returning a block to the retrieval bot.

What about smaller AI shopping engines and agents?
The same logic applies. Each agent that fetches live web content uses an identifiable user agent. The safe default for a store that wants AI discovery is to allow retrieval and indexing bots broadly while restricting training bots by name if you choose to restrict at all.

Crawl access is the least glamorous part of AI shopping readiness and the most consequential when it is wrong. Every dollar spent on structured data and enriched feeds assumes a crawler can read the page. Before optimizing what the bots see, make sure the right ones can see it at all.