Clustering and Classification
Topic directories built with human effort (e.g., Yahoo! or the Open Directory Project) immediately raise a question: Can they be constructed automatically out of an amorphous corpus of Web pages, such as collected by a crawler? We study one aspect of this problem, called clustering, or unsupervised learning, in the later search engine tutorials and courses. Roughly speaking, a clustering algorithm discovers groups in the set of documents such that documents within a group are more similar than documents across groups.
Clustering is a classic area of machine learning and pattern recognition. However, a few complications arise in the hypertext domain. A basic problem is that different people do not agree about what they expect a clustering algorithm to output for a given data set. This is partly because they are implicitly using different similarity measures, and it is difficult to guess what their similarity measures are because the number of attributes is so large.
Hypertext is also rich in features: textual tokens, markup tags, URLs, host names in URLs, substrings in the URLs that could be meaningful words, and host IP addresses, to name a few. How should they contribute to the similarity measure so that we can get good clusterings? We study these and other related problems in the later search engine tutorials and courses.
Once a taxonomy is created, it is necessary to maintain it with example URLs for each topic as the Web changes and grows. Human effort to this end may be greatly assisted by supervised learning, or classification. A classifier is first trained with a corpus of documents that are labeled with topics. At this stage, the classifier analyzes correlations between the labels and other document attributes to form models. Later, the classifier is presented with unlabeled instances and is required to estimate their topics reliably.
Like clustering, classification is also a classic operation in machine learning and data mining. Again, the number, variety, and nonuniformity of features make the classification problem interesting in the hypertext domain. We shall study many flavors of classifiers and discuss their strengths and weaknesses.
Although research prototypes abound, clustering and classification software is not as widely used as basic keyword search services. IBM’s Lotus Notes text-processing system and its Intelligent Miner for Text include some state-of-the-art clustering and classification packages. Smartlogic’s Semaphore also provides the classification and text mining solutions.
Clustering and classification are at two opposite extremes with regard to the extent of human supervision they need. Real-life applications are somewhere in between, because unlabeled data is easy to collect but labeling data is onerous. In our preliminary discussion above, a classifier trains on labeled instances and is presented unlabeled test instances only after the training phase is completed. Might it help to have the test instances available while training? In a different setting specific to hypertext, if the labels of documents in the link neighborhood of a test document are known, can that help determine the label of the test document with higher accuracy? We study such issues in the later search engine tutorials and courses.