Jump to content
xisto Community
Sign in to follow this  
sofiaweb

"robots.txt" Search Bots

Recommended Posts

If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go.
We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol.

Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index.
The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is):

TRUE: http://forums.xisto.com/robots.txt (Works)

FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work)

All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only.

So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots!

Then now we are starting to write in it, a simple robots.txt looks like this.

User-agent: *
Disallow:

The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want.

User-agent: *
Disallow: /

wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course.

If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not.

User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ?
Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file.
User-agent: googlebot
Disallow:

User-agent: yahoo-slurp
Disallow: /
In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away.

Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like:
User-agent: *
Disallow: /emails/
As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots.
Here are few samples to make it clearer.
To exclude all folders from all the bots
User-agent: *
Disallow: /
To exclude any folder from all the bots
User-agent: *
Disallow: /emails/
To exclude all folders from a bot
User-agent: googlebot
Disallow: /
User-agent: *
Disallow:
To allow just one bot to crawl the site
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /
To allow all the bots to see the all folders:
User-agent: *
Disallow:
After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid.

Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information.

Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ "

Major Known Spiders
Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ)

Google
Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:

User-agent: Googlebot-Image
Disallow: /*.gif$

Yahoo
Yahoo also has a few specific commands, including the:

Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server.

Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like:

User-agent: Yahoo-Blogs/v3.9
Crawl-delay: 20

Ask / Teoma
Supports the crawl-delay command.

MSN Search
Supports the crawl-delay command. Also allows wildcard behavior

User-agent: msnbot
Disallow: /*.[file extension]$
(the "$" is required, in order to declare the end of the file)

Examples:

User-agent: msnbot
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$

Why do I want a Robots.txt?
There are several reasons you would want to control a robots visit to your site:

*It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc)

*It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma.

*It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month.

*It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet.

*It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one.


So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

MAJOR SEARCH BOTS - SPIDERS NAMES
Google = googlebot
MSN Search = msnbot
Yahoo = yahoo-slurp
Ask/Teoma = teoma
GigaBlast = gigabot
Scrub The Web = scrubby
DMOZ Checker = robozilla
Nutch = nutch
Alexa/Wayback = ia_archiver
Baidu = baiduspider

Specific Special Bots:
Google Image = googlebot-image
Yahoo MM = yahoo-mmcrawler
MSN PicSearch = psbot
SingingFish = asterias
Yahoo Blogs = yahoo-blogs/v3.9

feel free to share your information.
Edited by yordan
Quoted the text copied from http://urbanoalvarez.es/blog/2008/04/18/writing-a-good-robots-file/ (see edit history)

Share this post


Link to post
Share on other sites

Nice guide. Another key to a good site is blocking off unwanted areas, and duplicate entries. robots.txt is a great way of doing this, and takes little to no time to set up. Another great file that doesn't take long to setup is .htaccess, but that's another topic :D

Share this post


Link to post
Share on other sites
Confused about the "$" symbol"robots.txt"

This line was a little confusing to me:(the "$" is required, in order to declare the end of the file)I am going to assume you were talking only about file types when you said that and the convention for a typical folder block would require the "$"I hope I am correct.

-reply by Steve

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.