Jump to content
xisto Community
NilsC

Robots Meta Tag Introduction bots and spiders crawling the web Part 2

Recommended Posts

Robots meta tags, do we need them. What good do they do and what does it control.

 

“Robots Meta Tag” are used to control if you want a spider or a bot to index a html page or not. You can give permission to index your whole site and the spider will crawl all your pages.

 

This is a great way to control bots and spiders if you don’t have access to the root directory and robots.txt file.

 

Some search engines (not all) fully obey the “Robots Meta Tag”.

 

What is the format of “Robots Meta Tag” and where do I put it on my site?

 

The “Robots Meta Tag” are placed in your HTML document in the HEAD section, its not case sensitive. The format are easy and very simple to understant.

Here Is an example on how to set the statement up (case does not matter)

<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<TITLE>...</TITLE>
</HEAD>
<BODY>
...

The previous statement lets spiders and bots know that this web-page is of limit, if the statement is in your index.html file it means the whole site is off limits.

 

The four metatag options for this is set in the content part of the statement. They are:

index

noindex

follow

nofollow

What does the options mean and how do we use them...

First was index.

This tell the spider/bot that it’s OK to index this page

 

Second is noindex

Spider/bot see this and don’t index any of the content on this page.

 

Third is follow

This let the spider/bot know that it’s OK to travel down links found on this page.

 

Last it’s the nofollow

It tells the spider/bot not to follow any of the links on this page.

 

So the combination of the statements tell the spider what it can do. If you use this to control the spider/bot make sure you don’t ask it to follow links and the only page you are linked to have a noindex statement.

What are the combinations and what do they control:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

This statement tell the spider/bot that it’s OK to index the page and follow all links.

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

Here you tell the spider/bot that it’s not OK to index but OK to follow any links on the page.

 

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

Here it’s OK to index this page and NOT permitted to follow the links.

 

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Here you tell the spider/bot to stay away from all content and all links

 

There are 2 global statements and if I remember right the syntax are as follows. (Plese check this before you use it because I have never used this setup so I can’t vouch for it’s validity)

 

<META NAME="ROBOTS" ALL=INDEX,FOLLOW>

, and

<META NAME="ROBOTS" NONE=NOINDEX,NOFOLLOW>

 

So depending on how you use the statements in each page you can control spiders and bots and the access they have. If a spider or bot hits your html page and there is no robots.txt or “Robots Meta Tag” what are the defaults. There are articles that say the predefined default is INDEX and FOLLOW. One search engine that fully obeys to “Robots Meta Tag” is Inktomi and Inktomi’s default is INDEX and NOFOLLOW. So it’s important to setup the “Robots Meta Tag” correctly to get your web site indexed by all spider/bot that you want to crawl down your web site.

 

This is all for now,

If anyone have more information on the global statements please add them so this will be complete 

 

Nils

Share this post


Link to post
Share on other sites

Hey, you got any suggestions for my forum pages.

 

I have the following variables in hand, so that I can include them in every forum page dynamically using PHP, inside meta tags.

 

1> Title

2> Description (which is usually empty)

 

I had developed a program which did this to my forums :-

1> Seperated all the words and loaded them into array.

2> Filtered out the ones like "and, for, the, as ... "

3> removed special characters and stuff like that.

4> Finally calculated the Word density of each words. ( Suppose Word computer is repeated many times, then it would be given highest rank )

5> Then That generated list was sorted in decending order (density) and the top 20 keywords were included in the meta tags.

 

DISADVANTAGES :

 

[+] It put a lot of load on the server as the entire page was to be scanned serveral times for each word.

[+] For some pages, many useless words got included which did not have any relevance. Suppose a page which gives information about tea, words like hot, drink, like, morning etc were ranking more than others.

 

Got any other solutions ?

Share this post


Link to post
Share on other sites

It can also be useful to create a robots.txt file to keep out or encourage robots. Simply place a file called robots.txt in your top directory. The format for each entry is as follows:

User-agent: robotDisallow: fileDisallow: folder

Every Time you specify a User-agent, all following rules apply to that User-agent (read robot) until a new one is specified. You can specify a User-agent of "*" to block all robots. Disallow specifies either a file or folder that the robot should not follow. In this way, all of your robot data can be centralized.

It is still useful to use meta tags however, because some robots do not check all parent directories of a file they find for a robots.txt file.

~Viz

Share this post


Link to post
Share on other sites

wow thx, great tut i'll be sure to use meta tags in my index page, but it's preety much useless if you don't have a domain, it's highly unlikely spiders visit subdomains.

It's actually not quite unlikely. Some well known hosters, such as geocities and tripod use subdomains, so spiders know to search subdomains. If you do a search, you will frequently see some subdomains listed in the results. As a result, if you wat your pages protected from spiders, use meta tags. They only add a few bytes of data, so your users will never notice unless they view the source or look for them.
~Viz

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.