Jump to content
xisto Community
Sign in to follow this  
jlhaslip

Htaccess To Block Bots How do you block bots and spiders?

Recommended Posts

I am curious about how to block the bots and / or spiders from accessing my subdomain. I believe that I may have one using a terrible amount of Bandwidth. Is this possible? and how do I stop it?

Share this post


Link to post
Share on other sites

bots would

using a terrible amount of Bandwidth

???that's the first time i got that ?
is it really,how you got the results that it waste you so much bandwidth?
and by the way banned the bots is not clever thing i think.if Google search results not containng your site.then there is no more new guests to view your site.

Share this post


Link to post
Share on other sites

The site I have uploaded is not yet public. I also host a forum which is private. So far this month, about one third of my bandwidth is showing as being used by enquiries to/from the USA. Nothing against the Americans, of course, but the membership of the forum using my hosting account here at the trap consists entirely of Canadian members. And since the only place that I have announced the site is here at the Xisto, it really is not public. The 'wasted/lost' bandwidth this month alone is over 180 megs out of my allowed 512 megs., so I am trying to put a stop to it. I suspect someone is hijacking my bandwidth. Don't know how or why, but it would be nice to put a stop to it... Any Ideas??? (I've placed IP bans on a couple of them. I'll check again tomorrow to see if they continue.)

Share this post


Link to post
Share on other sites

If you want my advice, don't use htaccess, insted use the well known "robots.txt".

If you have a problem building robots.txt, no problem, there's a great tool for that, named :

RoboGen (free edition)
http://forums.xisto.com/no_longer_exists/

This is a fantastic litle tool that has all the most popular spiders,bots,etc, for you to allow or deny the access to them to the pages of your website, it's pretty easy to learn and work with this tool, but if you encounter any problems, just private email me.

Share this post


Link to post
Share on other sites

well, as it turns out, this was all "much ado about nothing".I banned the suspected IP addresses, used the Robot Link above and then one of the forum users couldn't sign-in. Turns out her satallite connection gets re-directed to an US based facility, so the Bandwidth usage was legitimate.Just another lesson to be had here.

Share this post


Link to post
Share on other sites

Well, this really interest me too. Google doesn't list my site in any of the search results. However, google bot is almost twenty four hours present in this particular site. And I also noticed that for about 100MB bandwidth has been wasted this way. I've read the inputs from some members, the best way to control the bots are robots. we can implement this even from meta tag without using any other method. Visit once a month may do. I need to see the particular code and insert it in the meta tag. I hope more experts will put more inputs here.

Share this post


Link to post
Share on other sites

Dragonfly, yes, you don't need to create the robots.txt file and insert it in the root directory of your website, but if you choose to use the metatag insted, you won't have to much options to protect your website directorys, and beleave me, your site will be extremelly vulnareble to agressive spiders and bots.

 

There's spiders made by skilled hackers to scan entire websites for vulnereble stuff, and if you don't have a well configured robots.txt file, you'll end up one day searching google for some keywords and your website passwords turne up as the results, like happens to many people.

 

But, forget about this particullar spiders and bots made by some skilled hackers, and let's talk about the google spider or bot.

 

Perhaps you have absolutly no idea of what's the REAL POWER OF GOOGLE!

 

Google can find passwords, usernames, cgi blackholes, sensivity data, vulnereble data, databases usernames and passwords, and millions of private things that most of webdesigners don't even imagine, and why, because they don't care about security, don't don't create the robots.txt file and/or the htaccess files (for linux's servers), because the search engines only speek one language, and that is robots.txt (allow and/or disallow access) and htaccess (also allow and/or disallow access).

 

If you want to really know all the best techniques to find this secret stuff with google, check the above website, wich the main goal is to help webdesigners protect their websites from "google hackers or google hacking".

 

You have to register at:

http://forums.xisto.com/no_longer_exists/

 

Then visit the google hacking database of querys at:

http://forums.xisto.com/no_longer_exists/

 

Now, getting back to robots.txt, a normal robots.txt look like this:

 

 

Disallowing all the spiders:

 

# Your website title -- http://forums.xisto.com/no_longer_exists/

# Robot Exclusion File -- robots.txt

# Author: your name

# Last Updated: The date

 

User-agent: *

Disallow: /dd

 

 

This robots.txt code will disallow any search engine of indexing the "dd" directory, but this is just an example.

 

Also this code was created with the robogen LE:

 

RoboGen (free edition)

http://forums.xisto.com/no_longer_exists/

 

If you want to create and edit htaccess files, there's a very usefull and pretty easy to work free tool:

 

HTAccessible

http://www.tlhouse.co.uk/

 

(it's constantly updating with more easy one-click functions to protect your directorys and files with htaccess files, and remember that htaccess files are for linux's servers.)

 

One more thing, allways create the robots.txt file and insert it on the root directory of your website with the configuration for all of your private and public directorys, or, insert the robots.txt file in the directory that you want to protect, that has only configuration for that directory only, wich i don't recommend.

 

Also if you choose to configure all your directorys in one robots.txt file, remember to insert the below code in it, for the most important and usual directorys of any website:

 

User-agent: *

Disallow: /cgi-bin/

Disallow: /images/

Disallow: /scripts/

Disallow: /your private directory 1/

Disallow: /your private directory 2/

Disallow: /your private directory 3 and so on/

 

I'm sure you understand that if you don't say to the search engines to not index the cgi-bin, images, scripts of your website (using the robots.txt file), your sensivity website data will end up in the results of other people searches on google, yahoo and/or altavista wich are the most powerfull search engines on the web.

 

So, to protect your cgi-bin directory wich is one of the main targets for hackers (website defacers, script kiddies, crackers), you'll have to allways insert the disallow code to this directory.

 

The images directory is optional, if you don't want google images spider to index your images because you have worked to much in those, i advice you to disallow the access too.

 

The scripts directory is also a main target, specially if you have php and cgi scripts, so, if you want to protect your work and scripts configuration, also disallow the access to this one too.

 

And there's much more sensitive directorys that you should, no, you must protect, wich could be, for example:

 

- email;

- newsletter;

- mailing lists;

- spreadsheets (excel data);

- and much more.

 

This directorys depend of what your website has and what has to offer, for example, if you sell templates, ebooks, videos-tutorials, you'll have also to protect this directorys or you'll end up giving all of your work to website scanners, wich by the way, it's happening all the time to webdesign beginners with no experience.

 

 

One more extremelly important thing, if you usually work with cgi or perl, specially cgi, wich is a litle bit different of perl, be very carefull with the scripts that you use on your websites, cause there are tons of high quality programs to scan cgi websites and cgi scripts in websites, for example:

 

Cgi Scan

http://forums.xisto.com/no_longer_exists/

 

Run the above tool in your website to see if it has cgi black holes, wich are very apreciated by "website defacers, script kiddies and crackers".

 

To finish this, if you want to learn much more stuff about robots,spiders,search engines and specially google search engine, tell me and i'll send you some high quality ebooks about it.

 

There's so much to tell and not many time to actually tell it!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.