lory12

Members

View Profile See their activity

Content Count
6
Joined
August 26, 2009
Last visited
August 28, 2009

Community Reputation

0 Neutral

See reputation activity

About lory12

Rank
Newbie

Update Form Only If Field Is Empty

lory12 posted a topic in Programming

If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9If we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go. We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol. Simply, the robots.txt is a very simple text file that is placed on our root directory. For example http://forums.xisto.com/robots.txt.'>http://forums.xisto.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index. The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is): TRUE: http://forums.xisto.com/robots.txt (Works) FALSE: http://forums.xisto.com/no_longer_exists/ (Does not work) All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only. So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots! Then now we are starting to write in it, a simple robots.txt looks like this. User-agent: * Disallow: The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want. User-agent: * Disallow: / wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course. If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not. User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ? Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file. User-agent: googlebot Disallow: User-agent: yahoo-slurp Disallow: / In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away. Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like: User-agent: * Disallow: /emails/ As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots. Here are few samples to make it clearer. To exclude all folders from all the bots User-agent: * Disallow: / To exclude any folder from all the bots User-agent: * Disallow: /emails/ To exclude all folders from a bot User-agent: googlebot Disallow: / User-agent: * Disallow: To allow just one bot to crawl the site User-agent: googlebot Disallow: User-agent: * Disallow: / To allow all the bots to see the all folders: User-agent: * Disallow: After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid. Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information. Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ " Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "{:content:}quot; to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$ Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20 Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command. Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$ (the "{:content:}quot; is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$ Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: *It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) *It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. *It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. *It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. *It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. MAJOR SEARCH BOTS - SPIDERS NAMES Google = googlebot MSN Search = msnbot Yahoo = yahoo-slurp Ask/Teoma = teoma GigaBlast = gigabot Scrub The Web = scrubby DMOZ Checker = robozilla Nutch = nutch Alexa/Wayback = ia_archiver Baidu = baiduspider Specific Special Bots: Google Image = googlebot-image Yahoo MM = yahoo-mmcrawler MSN PicSearch = psbot SingingFish = asterias Yahoo Blogs = yahoo-blogs/v3.9
- August 26, 2009
- 1 reply
Free Unlimited File Storage Unlimited File Uploads

lory12 replied to muazamali's topic in Websites and Web Designing

file limit is 25mb? ,it's not good
- August 26, 2009
- 22 replies
Accept Payment Online Which payment gateway do you prefer?

lory12 replied to avalon1405241471's topic in Websites and Web Designing

paypal and worldpay
- August 26, 2009
- 23 replies
Make Firefox Look Like Google Chrome

lory12 replied to Saint_Michael's topic in Software

google chrome theme isn't a very good theme
- August 26, 2009
- 8 replies
Anyone Use Google Chrome?

lory12 replied to cory1405241578's topic in Software

yep same here,i like google chrome as it hasn't much options
- August 26, 2009
- 5 replies
Free Site

lory12 replied to JasPuneet's topic in Search Engines

do you have dns control
- August 26, 2009
- 41 replies

Sign In

lory12

Content Count

Joined

Last visited

Community Reputation

About lory12

Update Form Only If Field Is Empty

Free Unlimited File Storage Unlimited File Uploads

Accept Payment Online Which payment gateway do you prefer?

Make Firefox Look Like Google Chrome

Anyone Use Google Chrome?

Free Site

Browse

Activity

Important Information