Jump to content
xisto Community
CaptainRon

A New Probable Algorithm For A Search Engine

Recommended Posts

After I read the Google page rank method, I realised that they rely more on "CREDIBILITY" of the page rather than the information that the webpage provides.Now some times i make a search and i get thousans of results. I am looking for that one exact thing which i simply can't find in the top 50 results in google. I move to the next page... and viola! there it is! WHy is it that the exact thing i searched, have to be on the 50+ listings?Probably page rank algorithm now explains it.OK, suppose the search engine gives us a choice between two methods, in one Credibility is given priority and in the other the Information is given priority.We rank Information on the basis of HTML tags. If what I searched for appears in a web pages <title> tag, i think it gets the highest priority. Suppose it appears next in the <h1> or the <h2> tag, it gets the next highest priority. Likewise, we do weighting of Information in the pages by combining all the factors together.Like, just if the <title> tag contains something wouldn't mean that it is the best result. We would look for how big is the plain content in the page and how is it structured on the basis of subheadings. A sub heading could be identified as anything that is stronger than the plain matter... anyway i will think on this more... B)

Share this post


Link to post
Share on other sites

Interesting. But could you explain to me the variables based on Credibility? I'm not trying to doubt you in any way, but I am rather curious as to what mechanics the indexing system holds in order to rank things through credibility. I suppose it could be based on the number of other resources that share similar content, but then wouldn't that lower the unique factor, which could work against your Page Rank and keyword search.

Share this post


Link to post
Share on other sites

Interesting. But could you explain to me the variables based on Credibility? I'm not trying to doubt you in any way, but I am rather curious as to what mechanics the indexing system holds in order to rank things through credibility. I suppose it could be based on the number of other resources that share similar content, but then wouldn't that lower the unique factor, which could work against your Page Rank and keyword search.

Hi,
I suppose Google already masters the credibility technique with PageRank method. I want that my search shall reveal the exact information i am searching for. For that matter, I suppose the above method described shall hold good.
The search engine database shall store the pages (cache them) in hierarchical format. That is, not just HTML, but a plain sructure in which the 'content' data shall be arranged. So like wise we can quickly search the different tags to find relatively important data. I dont know if the other search engines already use this method or not.

Share this post


Link to post
Share on other sites

I think the main reason they use the pagerank system instead of one like you are suiggesting, is that it would be easy for sites to jump to the top of a ranking even if they have no REAL information about the given search criteria.For example, you say check the title as the most important, H1 second most important, etc. Well someone could simply spam their titlebar with a large number of commonly searched phrases, and then might show up in a search for "free web hosting" even though it is actually a porn site or something. Similarly, that is like when sites put "invisible" (same color as background) text with tonnes of common terms only to draw in search engines.By ranking pages based on credibility it allows pages that people actually find useful to be the first results, followed by those the crawler "thinks" has the correct information.I may have misinterpreted what you meant, but assuming I didn't then this should all be relevant heh :o Your ideas were solid though, I just think it'd be easier to trick search engines unless you thought up a plan to stop such activities,

Share this post


Link to post
Share on other sites

Most search engines, such as google and Yahoo!, refer more to <meta> tags for information regarding a site than to body content. The reason, meta tags provide a quick summary of what the page contains. Once again, however, the spamming issue exists, I can fill the keyword meta tag with umpteen zillion keywords. This is why good search engines attempt to match meta data exactly with no overflow or underflow. After meta tags are searched, content may be used to refine the search and validate meta data. As far as credibility goes, this cannot be determined from the site itself without overgeneralizing (.govs and .orgs are okay, .coms are not, with the excetion of foo.com bar.com etc.) so credibility is determined by some insanely complex algorithm usually relying on user feedback in some way and then stored into a database with the website so credibility need not be recalculated with every search.~Viz

Share this post


Link to post
Share on other sites

I believe that Google won't change its way of ranking web sites. At least not in some big way. If you aren't able to find what you are looking for, then you should change the way you are searching.For example, my friend complained about searching something on Google for half an hour and not finding it in the end. Then I tried finding it and did it in about 5 minutes. So you see, it's al about choosing the right words to search for.As for examining the content of the page more, we all know why it is not good. Google might be smart, but I think it's not smart enough to recognize false keywords. When I say false keywords I mean words at the bottom of the page, the same colour as the background, that are there because it helps increase the rank at some search engine. That's why Google isn't paying much attention to it.On the other hand, the link counting is perfect, since it is not really possible that a single person own hundreds of good web sites that would link to his other site :o

Share this post


Link to post
Share on other sites

See what u are over looking is the fact that "Content" or Data is being overlooked here.I just gave a brief scrap of what came to my mind. Let's say I get serious with this technique, I will make a more complex implementation.To give a small explanation:I will create a tree structure, just for a single page. When I say I will give more importance to the title tag, it means it will be the ROOT of the Tree. The H1 (or to be precise, any bold html that shows up prior to simple text) tags will come as nodes, and the content they discuss will come as child to those nodes. To simplify look up, the content is broken up into keywords which have a proper construct (like the way MS Word Grammar check does). These keywords are associated into a index table (just for that particular page, and in the specific subnode), with their occurence frequencies. Now since I said I will index only those keywords which follow proper construct, it will stop spammers from repeatedly wrting the same key word over and over again. After that I create a diversity factor. Usually, in previous case, a spammer could re-write a sentence with same keywords many times over and over again. To cut that, the diversity factor is calculated as a function of words in a sentence construct. It will also include non-keywords like (is that the them their etc), hence a unique paragraph with meaningful text gets properly credited.This along with frequency table will make the index table.This index table is then finally generated for the whole page and belongs to the tree structure. Such tree structure is generated for each and every page that is submitted, and then in the end these tree's finally become the part of the giant tree called the webspace. The way a page-tree enters teh web space is, it is categorically stored. Categories are created on the basis of keywords, and a page-tree can belong to several keywords (ofcourse), but are linked with weighted nodes, where the weight of the node tells that how prominent that key word is in the page tree.Remember that the keyword weight is a function of "where it appears in the page" plus the frequency plus the diversity factor. It can all become a complex mathematical equation if I sit down to seriously work upon it.But the point is... in a world dominated by Google, its impossible to outperform it. Look at Acoona... a real fine search engine with little future.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.