The syntax of this file is obscure to most of us: it tells robots not to look at pages which have certain paths in their URLs. Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: everything not forbidden is OK.
You can usually read this file by just requesting it from the server in a browser (for example,www.0email.net/robots.txt). You'll see it as a simple text page, but it's easy to read.
This is all documented in the Standard for Robot Exclusion, and all robots should recognize and honor the rules in the robots.txt file.
Entry Meaning
User-agent: *
Disallow:
The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
In this example, all robots can visit every directory except the three mentioned.
User-agent: BadBot
Disallow: / In this case, the BadBot robot is not allowed to see anything. The slash is shorthand for "all directories"
The User Agent can be any unique substring, and robots are not supposed to care about capitalization.
User-agent: BadBot
Disallow: /
User-agent: *
Disallow: /private/
The blank line indicates a new "record" - a new user agent command.
All other robots can see everything except the "private" folder.
User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html
User-agent: *
Disallow: /tmp/
Disallow: /private/
This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.
All other robots can see everything except the tmp and private directories.
If you think this is inefficient, you're right!
Bad Examples - Common Wrong Entries
use one of the robots.txt checkers to see if your file is malformed
User-agent: *
Disallow / NO! This entry is missing the colon after the disallow.
User-agent: *
Disallow: *
NO! If you want to disallow everything, use a slash (indicating the root directory).
User-agent: sidewiner
Disallow: /tmp/
NO! Robots will ignore misspelled User Agent names. Check your server logs and the listings of User Agent names.
Thanks to BM at Inktomi
for suggesting a "bad examples" section
--------------------------------------------------------------------------------
The official guidelines were written up in 1996 or so:
Standard for Robot Exclusion
Guidelines For For Robot Writers
Web Server Administrator's Guide to the Robots Exclusion Protocol.
Robots.txt Checkers
SearchEngineWorld robots.txt syntax checker, Tutorial, and Most Frequent Problems pages
BotWatch robots.txt syntax checker
UK Office for Library and Information Networking - WebWatch Robots.txt checker.
RoboGen visual editor for Robots Exclusion files, allowing users to choose folders and files interactively, manage multiple domains and recognize large numbers of user agents (robot self-identifiers).
Robotcop
This free server module watches for spiders which read pages disallowed in robots.txt, and blocks all further requests from that IP address. It is particularly useful for blocking email address harvesters, while still allowing legitimate search engine spiders. Be sure to double-check your robots.txt file (use one or more of the checkers above), before implementing it, and to watch your server logs carefully. The August 2002 version (0.6) works with Apache 1.3 on FreeBSD and Linux.
Listings of Robot "User Agent" Names
Note that your robots.txt file does not have to include complete names or version numbers -- the standard says "A case insensitive substring match of the name without version information is recommended."
Web Robots Database
List at robotstxt.org, may not be current.
SearchEngineWatch SpiderSpotting Chart
Displays User Agent and host names for webwide search engine robot spiders.
Agents and Robots List - WebReference.com
lightly annotated listing of agent and robot software
Search Engine Robots
Lists of search engines, agent names and their information links, updated fairly frequently.
There are a few proposed extensions of the Robots.txt standard, but they have been pretty quiet lately:
Martin Koster's 1996 RFC Draft Memo on Web Robots Control
Sean Connor's proposal for a An Extended Standard for Robot Exclusion (version 2.0)
Charles Koller's Robot Exclusion Standard Revisited (1996)
For more information on robots on the SearchTools Site:
Robots Information Page
Summary of the most important things about web crawling robots
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problem
