Although classic information retrieval has provided extremely valuable core technology for Web searching, the combined challenges of abundance, redundancy, and misrepresentation have been unprecedented in the history of IR. By 1996, it was clear that relevance-ranking techniques from classic IR were not sufficient for Web searching. Web queries were very short (two to three terms) compared with IR benchmarks (dozens of terms). Short queries, unless they include highly selective keywords, tend to be broad because they do not embed enough information to pinpoint responses. Such broad queries matched thousands to millions of pages, but sometimes missed the best responses because there was no direct keyword match. The entry pages of Toyota and Honda do not explicitly say that they are Japanese car companies. At one time, the query “Web browser” failed to match the entry pages of Netscape Corporation or Microsoft’s Internet Explorer page, but there were thousands of pages with hyperlinks to these sites with the term browser somewhere close to the link.
It was becoming clear that the assumption of a flat corpus, common in IR, was not taking advantage of the structure of the Web graph. In particular, relevance to a query is not sufficient if responses are abundant. In the arena of academic publications, the number of citations to a paper is an indicator of its prestige. In the fall of 1996, Larry Page and Sergey Brin, Ph.D. students at Stanford University, applied a variant of this idea to a crawl of 60 million pages to assign a prestige score called PageRank. Then they built a search system called Backrub. In
1997, Backrub went online as Google. Around the same time, an a professor professor and computer scientist Jon Kleinberg at Cornell University, then a researcher at IBM Research, invented a similar system called
HITS (for Hyperlink-Induced Topic Search). HITS assigned two scores to each node in a hypertext graph. One was a measure of authority, similar to Google’s prestige, the other was a measure of a node being a comprehensive catalog of links to good authorities.
We will discuss the algorithms for analyzing the link structure of hypertext graphs in the later SEO tutorials and courses. The analysis of social networks is quite mature, and so is one special case of social network analysis, called bibliometry, which is concerned with the bibliographic citation graph of academic papers. The initial specifications of these pioneering hyperlink-assisted ranking systems have close cousins in social network analysis and bibliometry, and have elegant underpinnings in the linear algebra and graph-clustering literature. The PageRank and HITS have led to a flurry of research activity in this area (by now known generally as topic distillation) that continues to this day. The SEO tutorials and courses follows this literature in some detail and shows how topic-distillation algorithms are adapting to the idioms of Web authorship and linking styles. Apart from algorithmic research, the tutorials and courses cover techniques for Web measurements and notable results therefrom.