How to Write a robots.txt File That Actually Works

robots.txt is one of the most misunderstood files on the web. Learn the correct syntax, common mistakes, what search engines actually respect, and what they ignore.

What robots.txt Does (and Does Not Do)

robots.txt is a text file placed at the root of your website (https://example.com/robots.txt) that tells search engine crawlers which pages or directories they should or should not visit.

It is a courtesy protocol, not a security measure. Key limitations:

  • Robots.txt only applies to crawlers that choose to respect it (all major search engines do; malicious scrapers do not)
  • Blocking a URL in robots.txt does not prevent Google from indexing it if other sites link to it – Google can index a URL it has never crawled, based on signals from other pages
  • It cannot protect private content – use authentication for that
  • Syntax errors silently break rules; one misplaced character can block your entire site

The Basic Syntax

A robots.txt file consists of records, each specifying a user agent and directives:

User-agent: *
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Disallow: /staging/

Sitemap: https://example.com/sitemap.xml

User-agent: – which crawler the following directives apply to. * means all crawlers.

Disallow: – paths the crawler should not visit. An empty Disallow: means "disallow nothing" (allow everything).

Allow: – explicitly allows a path that would otherwise be blocked by a broader Disallow: rule.

Sitemap: – optional, tells crawlers where to find your XML sitemap.

Important Rules

Path matching is prefix-based

Disallow: /private/ blocks any URL starting with /private/ – including /private/page.html, /private/images/, etc.

Disallow: /page.html blocks only that exact path.

Disallow: / blocks the entire site (all URLs starting with /, which is all of them). This is correct syntax for blocking all crawlers – but applies to everything.

Multiple user agents, one block

You can group multiple user agents before a directive block:

User-agent: Googlebot
User-agent: Bingbot
Disallow: /staging/

Each line applies to all user agents listed above until the Disallow: or Allow: lines.

Wildcards

* in a Disallow: path matches any sequence of characters:

  • Disallow: /*.pdf$ – blocks all PDFs ($ anchors the end of the URL)
  • Disallow: /search? – blocks all search result URLs
  • Disallow: /*? – blocks all URLs containing a query string (be careful)

Wildcards in User-agent only support * for "all crawlers" – you cannot use partial matches like Google*.

Common Mistakes

Blocking CSS and JavaScript

Disallow: /assets/

If your CSS and JavaScript files are in /assets/, blocking them prevents Googlebot from rendering your page correctly. Google renders pages like a browser and needs access to your assets to see your page as users do.

Confusing Disallow: with noindex

Disallow: stops crawling. noindex (in an HTTP header or meta tag) stops indexing. They are different:

  • A page blocked by robots.txt can still be indexed based on signals from other pages
  • To prevent both crawling and indexing: use noindex (do not block crawling, or Google cannot see the noindex directive)
  • To prevent indexing of a crawlable page: add <meta name="robots" content="noindex"> to the page

Empty Disallow = Allow All

User-agent: *
Disallow:

This means "disallow nothing" – it allows all crawlers to access everything. This is valid and commonly used as a placeholder to tell crawlers they are welcome.

Case sensitivity

URLs in robots.txt are case-sensitive. Disallow: /Private/ does not block /private/.

A Minimal Correct robots.txt

For a standard website with no pages to hide:

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

This explicitly allows all crawlers and points them to your sitemap.

Validating Your robots.txt

Use Google Search Console's robots.txt Tester (under Settings) to validate your file and test how specific URLs are treated. The robots.txt Generator on this site builds a valid robots.txt from a form interface, handling correct syntax for you.

Summary

robots.txt uses User-agent:, Disallow:, and Allow: directives to guide (not enforce) crawler behavior. Common mistakes include blocking assets, confusing disallow with noindex, and using wildcard paths incorrectly. Always validate before deploying – a single syntax error can accidentally block your entire site from being crawled.