Technical

Robots.txt

Robots.txt is a plain text file at the root of your site that tells search engine crawlers which URLs they may or may not request. It controls crawling, not indexing.

Robots.txt is a small text file that lives at the root of your domain, at yoursite.com/robots.txt, and it is the first thing a well-behaved crawler checks before it fetches anything else. It uses a simple set of rules to say which paths a given bot is allowed to request and which it should leave alone. It is one of the oldest tools in technical SEO and also one of the easiest to get wrong.

bolt

Robots.txt controls whether a crawler fetches a URL. It does not control whether that URL ends up in the index.

That distinction trips people up constantly, so let me be blunt about it. If you Disallow a page in robots.txt, you stop Google from fetching it. But if other sites link to that page, Google can still list the URL in search results, just with no description because it was never allowed to read the content. Robots.txt is a crawling instruction, full stop.

Basic syntax

# Apply to all crawlers
User-agent: *

# Block a private folder
Disallow: /admin/

# Block a specific file
Disallow: /private-page.html

# Allow everything else (implicit, but explicit is fine)
Allow: /

# Point crawlers to your sitemap
Sitemap: https://www.yoursite.com/sitemap.xml

The pieces are simple. User-agent names the bot the rules apply to, with the asterisk meaning all bots. Disallow lists a path the bot should not request. Allow can carve out an exception inside a blocked directory. And the Sitemap line points crawlers to your sitemap, which is the one directive that helps discovery rather than restricting it.

Mistakes that wreck a site

Shipping a Disallow: / line from a staging environment, which blocks the entire site.
Blocking CSS or JavaScript files the engine needs to render the page correctly.
Using robots.txt to hide a page from search, when noindex is the right tool for that.
Putting the file anywhere other than the domain root, where crawlers will not look for it.
Assuming all bots obey it. The major search engines respect it, but malicious scrapers ignore it entirely.

warningWATCH OUT

The single most expensive robots.txt mistake is leaving Disallow: / on a live site after launch. It tells every crawler to stay out of the whole domain, and traffic can crater within days. Check this file immediately after any deployment.

targetHow to test it

Google Search Console includes a robots.txt report that shows the file Google last fetched and flags syntax problems. Paste in specific URLs to confirm whether they are allowed or blocked. Test before you trust, especially after any change.

Block to save crawl, noindex to hide

Use robots.txt to keep crawlers away from low-value sections so they spend their budget on pages that matter. Use a noindex tag when your real goal is keeping a page out of search results.

It helps to think about what robots.txt is really for. On a small site you may not need to block anything at all, and an empty file that simply points to your sitemap is perfectly fine. The file earns its keep on larger sites, where you want to steer crawlers away from low-value sections like internal search results, filtered URL variations, and admin areas so they spend their time on pages that matter. That is the productive use of the file, and it is very different from trying to hide pages from search.

Robots.txt is a blunt instrument, and that is exactly why it deserves care. One bad line affects your entire crawl, so treat every edit as a deployment that needs testing rather than a quick tweak. For how it fits with sitemaps, canonicals, and the rest of your crawl strategy, see my technical SEO guide.

RELATED TERMS

Crawling XML Sitemap Crawl Budget

GO DEEPER

data_object

Technical SEO

Crawl, render, index.

Want this handled by someone who has measured search for 20 years?

Work with me