Jump to content
xisto Community

jglw22

Members
  • Content Count

    6
  • Joined

  • Last visited

  1. The name PageRank, which is the name given to a relevancy score used by Google, is often confused to be related directly to Web pages. However, the name comes from its inventor, Larry Page, who developed the scoring system as a research project in 1995 at Stanford University. PageRank is now renowned for giving unprecedented clarity to search on the Web and is closely associated with Google's monumental success.The PageRank algorithm calculates recursively a score for a given web page in terms of popularity, indicated by the number of inbound links that direct users to that page. An inbound link is defined as a vote and is weighted by the PageRank of the source (hence the recursive nature of the computation) and is offset by the density of outbound links from that source. The scores are then scaled logarithmically and then set to between 0-10. To how many decimal places Google store the results of a PageRank calculation is unknown. In the main Web search algorithm PageRank is simply factored into the overall score that is used to generate the ordering. A more transparent utilization of PageRank is found in the Google Directory. With this system, once a certain category has been found that matches a given search query, a collection of web pages are brought into the runtime index, upon which PageRank comes into play as an ordering over the corpus. Here is the formal definition of the PageRank formula:PR = Whatever anyone claiming to be an SEO expect tells you it is before banning you from his forum for disagreeing.In the formula, PR(pi) are the PageRanks to be calculated, M maps a page to the set of pages that have inbound links to it, L maps a page to the the total number of outbound links for it, N stands for the size of the corpus and d stands for a dampening factor, which is usually set to 0.85. PageRank is a particularly elegant method as good values can be approximated in only a few iterations.If PageRank is considered under the paradigm of Markov theory, whereby probability distributions are independent of previous states, it would be true to say that PageRank gives the probability of arriving at a given page after starting randomly at a page on the Web and clicking randomly on links for a long period of time. This, as well as being a useful property to indicate relevance, is difficult to manipulate by a Web master without discrediting their own site. This is because Google is believed to penalize what are known as links farms, which are Web sites that scheme to artificially inflate PageRank for their customers. Google have manged to further safeguard the integrity of the algorithm by introducing the 'nofollow' attribute for the 'rel' attribute for anchor tags, in 2005. Since the advent of Web 2.0, where the Web is described as being a collection of interlinked applications rather than documents, user generated content has really come into its own. The majority of this content is found in forums and web logs, in which users can post links to other sites. This made the Web incredibly prone to what is known as 'link spamming', where automated bots spider their way through the Web looking for forums and posting many links to their owners Web page. Web masters can now selectively determine, by the use of the 'nofollow' attribute if a links on their page are to be counted by Google.
  2. Tell me if you think it makes good reading.The History of Searching the Web?The ultimate search engine would basically understand everything in the world, and it would always give you the right thing. And we're a long, long way from that.? - Larry Page In this section I will be discussing the history of searching the Web up until the release of Google. This will include the preceding applicable technologies, the reasons as to why the Web revolutionized search on the whole and what search engines became popular and for what reason.Document RetrievalGerard Salton is often thought of as being the father of modern search technology. His teams at Harvard and Cornell developed the Salton?s Magic Automatic Retriever of Text (SMART) informational retrieval system in the 1960s. SMART included important concepts like the vector space model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and relevancy feedback mechanisms. In this context a term can mean a single word or a phrase.The vector space model is an algebraic model for representing text documents as vectors of terms. In this manner a document is represented by a vector in a number of dimensions. Each dimension corresponds to a separate term that appears in the document. From this relevancy ranking can be calculated by seeing how close the cosine of the angle between two document vectors is to one or zero. In the case of it being zero, the vectors are orthogonal and there is no match. Conversely in the case of it being one, this means exactly all the terms that appear in one document appear in the other. Inverse Document Frequency and Term Frequency are weightings used in deciding on the length of the vectors in the model. These weights are a statistical measure used to evaluate how important a word is to a document in a corpus (a collection of documents). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Term discrimination values is a similar method to TF and IDF identification but it is tailored toward retrieval. This methods weights relevance by looking at how dense an area is in a given vector space for a certain term. It is proposed by Salton that the more dense a given space is, the less efficient the retrieval process will be. The assumption that being able to find specific and relevant data is difficult follows from this, but is improved by searching in spaces where the term is relevant but the occurrence is sparse.Finally, relevancy feedback is the simple idea that once a query has returned a result the user may give explicit feedback on the result and this can be stored as a factor to improve searches for the same term. Gopher ProtocolThe first attempts to make the data on the Internet more accessible by providing information on its location were known as Gophers. A Gopher is a search and retrieval protocol used over the Internet in order to find distributed documents. They allowed server based text files to be hierarchically organized and easily viewed by using Gopher applications on remote client computers. These systems were popularized by the development of three systems known as Archie, Veronica and Jughead which allowed users to search across resources stored in Gopher file hierarchies on a global basis using the simple technology of using regular expressions to match file names. However, the growth in popularity began to lose steam with the rate at which the World Wide Web grew in the mid 1990s. The Hypertext Transfer Protocol (HTTP) and its browser Mosaic, the first browser, which was released in 1993, could deliver functionality above and beyond the Gopher and its applications.As We May ThinkOur ineptitude in getting at the required data is largely caused by the artificiality of the systems of indexing. Having found one item, moreover, one has to emerge from the system and re-enter on a new path. The human mind does not work this way. It operates by association.[] This was a point made in 1945 by a scientist urging his fellow scientist to work together to help build a body of knowledge for all mankind. This was perhaps the first time the underlying principle of Hypertext was discussed in the limelight of a scientific community. The first time the term Hypertext itself was used was surprisingly as early as 1963 by Ted Nelson during his work on the eventually failed Xanadu project.HTTP and WWWThe World Wide Web was created in 1989 by Sir Tim Berners-Lee, whilst working at CERN in Geneva, Switzerland. The resources that are available on the Web are retrieved by making a HTTP request from a client browser to a server. The HTTP request is made by typing HTTP:// followed by the IP address of the server. However, as it is impractical for users to remember specific IP addresses they usually type in a domain name as a URL, which is then sent to a domain name system (DNS), which is a distributed Internet database that will resolve the domain name to an IP address. The object that is returned is a file containing text written in a markup language that may be parsed by the browser and constructed into a viewable page. Most pages will contain Hyperlinks, which can be clicked to instantiate a new HTTP request to any server on the Web. With this interlinking came an implicit indication of relevance that eventually proved to be the driving force behind the development of new search methodologies for finding information on the Internet.Primitive Search EnginesWith the innovation of Hyperlinked documents came a new way to traverse the search space that is the World Wide Web. In June 1993, Matthew Gray introduced the World Wide Web Wanderer. This was a robot that made a HTTP request to a given start point then made a breadth first traversal to domains mined from that domain. The process continued indefinitely in this manner logging which servers were active. By the end of that year, three full fledged robot driven search engines had surfaced. These were Jumpstation, the World Wide Web Worm, and the Repository-Based Software Engineering Spider (RSBE). The first two of these used a simple linear search to gather URLs and associated header information. However, they did not implement any ranking scheme, whereas the RSBE spider did.DirectoriesA directory is essentially a collection of human organized favorites made available on line for people to use. The first of these directories began to crop up in 1994, the most noteworthy of which is Yahoo!. What set Yahoo! apart from other directories is that each entry came with a human compiled description. Once the popularity of Yahoo! exploded, they began to charge companies for inclusion. Four years later this concept was undermined by the launch of the Open Directory Project (also known as DMOZ). The principle of which was that it was free to add a website to it and the directory itself could be downloaded for anyone to use.WebCrawlerBrian Pinkerton of the University of Washington released WebCrawler on April 20th, 1994. It was the first crawler which indexed entire pages as well as being the first engine to provide full text search. Since a couple of corporate take overs in the last two decades it has become a meta search engine. The principle of a meta search engine is that quality of results are proportional to the size of the index an engine maintains. Therefore, rather than keeping its own index, it queries all the other popular engines and collates the results. LycosLycos was the next major search development. It was designed at Carnegie Mellon University around July of 1994 and was developed by Michale Mauldin. Lycos went public with a catalog of 54,000 documents. In addition to providing ranked relevance retrieval, Lycos provided prefix matching and word proximity bonuses. Lycos' main difference was the sheer size of its catalog. By August 1994, Lycos had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million documents; and by November 1996, Lycos had indexed over 60 million documents, which was more than any other Web search engine at the time. Today Lycos has become an Web portal, which is basically a web page designed to be used as a home page that serves as an access point to information in a variety of ways. AltaVistaAltaVista, created by researches at the Digital Equipment Corporation, also debuted in July 1994 and with it brought many important features to the Web scene. At launch, the service had two innovations which set it ahead of the other search engines. It used a fast, multi-threaded crawler, called Scooter and an efficient back-end running on advanced hardware. As well as this they had nearly unlimited bandwidth (for that time) and were the first to allow inbound link checking, natural language queries and users to add or delete their own URL within 24 hours. AltaVista also provided numerous search tips and advanced search features.OvertureOverture was originally released under the name GoTo in 1998 by Bill Gross. Overture is thought to be the pioneer of paid search. Gross is quoted to have said, 'I realized that the true value of the Internet was in its accountability. Performance guarantees had to be the model for paying for media.' The main innovation, which is mirrored by the now giant Google, was pay per click advertising and sponsored results. While Overture was very successful, there were two major reasons that prevented them from taking Google's market position. Firstly, Bill Gross decided not to overgrow the Overture brand name because he feared that would cost him distribution partnerships. When AOL selected Google as an ad partner, in spite of Google massive brand strength, this became the nails in the coffin for Overture becoming a premiere search ad platform. Secondly, the advertising was nowhere nearly as well target as Google does it today.
  3. jglw22

    Pmmod

    While learning about infinite sets in functional programming, I remembered a program I'd written some time ago. I was interested in finding an arithmetic function that naturally produced a sudoku pattern. I was reading about Roman irregation and came across something called the Latin Square. It was a four by four grid filled with four different symbols such that no row or column contained the same symbol twice. The formula that generated this pattern was MMOD(5,[0,1,2,3,4]), which is basically modular times tables. Unforunately, the sodoku pattern only is observed when 5 is the first argument. The program demonstrates the odd patterns found by extracting primes and hammings from MMOD(n,[0..n-1]). Source is here --> http://forums.xisto.com/no_longer_exists/ Jason
  4. After working in KM for a year and dating a Communication undergraduate I have come to understand the intractabley complex problem of understanding the true nature of knowledge transfer (and women). Also, I have recently got into Haskell, which is a remarkabley expressive functional programming language in which I have demonstrated this problem. The first part is a fix for the strictness problem with logical conjunction and disjuncition. In certain cases, you wish these to be strict in the first argument to prevent program errors in statements such as if ((p!=NULL)&&(p->data==1)){...}. However, strictness like this is causes unwanted behaviour in my program. Contains is defined at the function level. How slick is Haskell! Following that is the assignment to unknown of two variables in our model, the ability to commuicate perfectly and the ability to understand ourselves fully. Finally, we have the function allKnow, which when passed a list of people and a string of information will tell you if all these people can know the information in the same way. A fairly useless piece of code, but interesting none the less.data Modal p = Possible | Impossible | Unknown deriving Show(&&&) :: Modal Bool -> Modal Bool -> Modal BoolPossible &&& Possible = PossibleImpossible &&& _ = Impossible_ &&& Impossible = Impossible_ &&& _ = Unknown(|||) :: Modal Bool -> Modal Bool -> Modal BoolImpossible ||| Impossible = ImpossiblePossible ||| _ = Possible_ ||| Possible = Possible_ ||| _ = Unknownnot' :: Modal Bool -> Modal Boolnot' (Possible) = Impossiblenot' (Impossible) = Possiblenot' Unknown = Unknown contains :: Eq a => [a]->a->Boolcontains = flip elemperfectcomm :: Modal Bool perfectcomm = Unknownknowself :: Modal Boolknowself = UnknownallKnow :: Eq a => [a]->String->Modal BoolallKnow _ "" = PossibleallKnow [] k = ImpossibleallKnow (x:[]) k = knowselfallKnow (x:xs) k = comm x xs k &&& allKnow xs k where comm p ps k = if contains ps p then knowself else perfectcomm
  5. Hi everyone! This is my first post, so be kind! Basically, I'm trying to get a free host together so am writing some posts. Here's a little summin' summin' about malicious code injection with PHP applications. Basically, this security exploit is one of the oldest tricks in the books and all comes down to the fact that PHP allows execution of both local and remote scripts with the SAME function... dur. Anyway, this is how it works. Image you've just employed a young go getter, straight outta uni, who has found becoming a Jack of all trades a sinch. You place him on web site design duty and after flicking through a PHP manual is on his way. Thinking it a good idea to keep separate database connection scripts, headers and whatnot, they may have something along the lines of this... include($_GET['page'] . ".php"); This line of PHP code, is then used in URLs like the following example: STD Because the $page variable is not specifically defined, an attacker can insert the location of a malicious file into the URL and execute it on the target server as in this example: STD[/url]http://forums.xisto.com/ /> This then makes the include function call and execute a remote script from the nosey_bastard domain, which could do all sorts of nasty, even delete the entire content of the website. You have been warned! JGLW
×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.