Tell me if you think it makes good reading.The History of Searching the Web?The ultimate search engine would basically understand everything in the world, and it would always give you the right thing. And we're a long, long way from that.? - Larry Page In this section I will be discussing the history of searching the Web up until the release of Google. This will include the preceding applicable technologies, the reasons as to why the Web revolutionized search on the whole and what search engines became popular and for what reason.Document RetrievalGerard Salton is often thought of as being the father of modern search technology. His teams at Harvard and Cornell developed the Salton?s Magic Automatic Retriever of Text (SMART) informational retrieval system in the 1960s. SMART included important concepts like the vector space model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and relevancy feedback mechanisms. In this context a term can mean a single word or a phrase.The vector space model is an algebraic model for representing text documents as vectors of terms. In this manner a document is represented by a vector in a number of dimensions. Each dimension corresponds to a separate term that appears in the document. From this relevancy ranking can be calculated by seeing how close the cosine of the angle between two document vectors is to one or zero. In the case of it being zero, the vectors are orthogonal and there is no match. Conversely in the case of it being one, this means exactly all the terms that appear in one document appear in the other. Inverse Document Frequency and Term Frequency are weightings used in deciding on the length of the vectors in the model. These weights are a statistical measure used to evaluate how important a word is to a document in a corpus (a collection of documents). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Term discrimination values is a similar method to TF and IDF identification but it is tailored toward retrieval. This methods weights relevance by looking at how dense an area is in a given vector space for a certain term. It is proposed by Salton that the more dense a given space is, the less efficient the retrieval process will be. The assumption that being able to find specific and relevant data is difficult follows from this, but is improved by searching in spaces where the term is relevant but the occurrence is sparse.Finally, relevancy feedback is the simple idea that once a query has returned a result the user may give explicit feedback on the result and this can be stored as a factor to improve searches for the same term. Gopher ProtocolThe first attempts to make the data on the Internet more accessible by providing information on its location were known as Gophers. A Gopher is a search and retrieval protocol used over the Internet in order to find distributed documents. They allowed server based text files to be hierarchically organized and easily viewed by using Gopher applications on remote client computers. These systems were popularized by the development of three systems known as Archie, Veronica and Jughead which allowed users to search across resources stored in Gopher file hierarchies on a global basis using the simple technology of using regular expressions to match file names. However, the growth in popularity began to lose steam with the rate at which the World Wide Web grew in the mid 1990s. The Hypertext Transfer Protocol (HTTP) and its browser Mosaic, the first browser, which was released in 1993, could deliver functionality above and beyond the Gopher and its applications.As We May ThinkOur ineptitude in getting at the required data is largely caused by the artificiality of the systems of indexing. Having found one item, moreover, one has to emerge from the system and re-enter on a new path. The human mind does not work this way. It operates by association.[] This was a point made in 1945 by a scientist urging his fellow scientist to work together to help build a body of knowledge for all mankind. This was perhaps the first time the underlying principle of Hypertext was discussed in the limelight of a scientific community. The first time the term Hypertext itself was used was surprisingly as early as 1963 by Ted Nelson during his work on the eventually failed Xanadu project.HTTP and WWWThe World Wide Web was created in 1989 by Sir Tim Berners-Lee, whilst working at CERN in Geneva, Switzerland. The resources that are available on the Web are retrieved by making a HTTP request from a client browser to a server. The HTTP request is made by typing HTTP:// followed by the IP address of the server. However, as it is impractical for users to remember specific IP addresses they usually type in a domain name as a URL, which is then sent to a domain name system (DNS), which is a distributed Internet database that will resolve the domain name to an IP address. The object that is returned is a file containing text written in a markup language that may be parsed by the browser and constructed into a viewable page. Most pages will contain Hyperlinks, which can be clicked to instantiate a new HTTP request to any server on the Web. With this interlinking came an implicit indication of relevance that eventually proved to be the driving force behind the development of new search methodologies for finding information on the Internet.Primitive Search EnginesWith the innovation of Hyperlinked documents came a new way to traverse the search space that is the World Wide Web. In June 1993, Matthew Gray introduced the World Wide Web Wanderer. This was a robot that made a HTTP request to a given start point then made a breadth first traversal to domains mined from that domain. The process continued indefinitely in this manner logging which servers were active. By the end of that year, three full fledged robot driven search engines had surfaced. These were Jumpstation, the World Wide Web Worm, and the Repository-Based Software Engineering Spider (RSBE). The first two of these used a simple linear search to gather URLs and associated header information. However, they did not implement any ranking scheme, whereas the RSBE spider did.DirectoriesA directory is essentially a collection of human organized favorites made available on line for people to use. The first of these directories began to crop up in 1994, the most noteworthy of which is Yahoo!. What set Yahoo! apart from other directories is that each entry came with a human compiled description. Once the popularity of Yahoo! exploded, they began to charge companies for inclusion. Four years later this concept was undermined by the launch of the Open Directory Project (also known as DMOZ). The principle of which was that it was free to add a website to it and the directory itself could be downloaded for anyone to use.WebCrawlerBrian Pinkerton of the University of Washington released WebCrawler on April 20th, 1994. It was the first crawler which indexed entire pages as well as being the first engine to provide full text search. Since a couple of corporate take overs in the last two decades it has become a meta search engine. The principle of a meta search engine is that quality of results are proportional to the size of the index an engine maintains. Therefore, rather than keeping its own index, it queries all the other popular engines and collates the results. LycosLycos was the next major search development. It was designed at Carnegie Mellon University around July of 1994 and was developed by Michale Mauldin. Lycos went public with a catalog of 54,000 documents. In addition to providing ranked relevance retrieval, Lycos provided prefix matching and word proximity bonuses. Lycos' main difference was the sheer size of its catalog. By August 1994, Lycos had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million documents; and by November 1996, Lycos had indexed over 60 million documents, which was more than any other Web search engine at the time. Today Lycos has become an Web portal, which is basically a web page designed to be used as a home page that serves as an access point to information in a variety of ways. AltaVistaAltaVista, created by researches at the Digital Equipment Corporation, also debuted in July 1994 and with it brought many important features to the Web scene. At launch, the service had two innovations which set it ahead of the other search engines. It used a fast, multi-threaded crawler, called Scooter and an efficient back-end running on advanced hardware. As well as this they had nearly unlimited bandwidth (for that time) and were the first to allow inbound link checking, natural language queries and users to add or delete their own URL within 24 hours. AltaVista also provided numerous search tips and advanced search features.OvertureOverture was originally released under the name GoTo in 1998 by Bill Gross. Overture is thought to be the pioneer of paid search. Gross is quoted to have said, 'I realized that the true value of the Internet was in its accountability. Performance guarantees had to be the model for paying for media.' The main innovation, which is mirrored by the now giant Google, was pay per click advertising and sponsored results. While Overture was very successful, there were two major reasons that prevented them from taking Google's market position. Firstly, Bill Gross decided not to overgrow the Overture brand name because he feared that would cost him distribution partnerships. When AOL selected Google as an ad partner, in spite of Google massive brand strength, this became the nails in the coffin for Overture becoming a premiere search ad platform. Secondly, the advertising was nowhere nearly as well target as Google does it today.