Engineering Large-Scale Crawlers
Engineering Large-Scale Crawlers: Overview
In the previous Crawling Basics: Tutorials and Courses we discussed a basic crawler. Large-scale crawlers that send requests to millions of Web sites and collect hundreds of millions of pages need a great deal of care to achieve high performance. In this Web Crawling Tutorial we will discuss the important performance and reliability considerations for a large-scale crawler. Before we dive into the details, it will help to list the main concerns:
- Because a single page fetch may involve several seconds of network latency, it is essential to fetch many pages (typically hundreds to thousands) at the same time to utilize the network bandwidth available.
- Many simultaneous fetches are possible only if the DNS lookup is streamlined to be highly concurrent, possibly replicated on a few DNS servers.
- Multiprocessing or multithreading provided by the operating system is not the best way to manage the multiple fetches owing to high overheads. The best bet is to explicitly encode the state of a fetch context in a data structure and use asynchronous sockets, which do not block the process/thread using it, but can be polled to check for completion of network transfers.
- Care is needed to extract URLs and eliminate duplicates to reduce redundant fetches and to avoid “spider traps” — hyperlink graphs constructed carelessly or malevolently to keep a crawler trapped in that graph, fetching what can potentially be an infinite set of “fake” URLs.
DNS Caching, Prefetching, and Resolution
Address resolution is a significant bottleneck that needs to be overlapped with other activities of the crawler to maintain throughput. In an ordinary local area network, a DNS server running on a modest PC can perform name mappings for hundreds of workstations. A crawler is much more demanding as it may generate dozens of mapping requests per second. Moreover, many crawlers avoid fetching too many pages from one server, which might overload it; rather, they spread their access over many servers at a time. This lowers the locality of access to the DNS cache. For all these reasons, large-scale crawlers usually include a customized DNS component for better performance. This comprises a custom client for address resolution and possibly a caching server and a prefetching client.
First, the DNS caching server should have a large cache that should be persistent across DNS restarts, but residing largely in memory if possible. A desktop PC with 256 MB of RAM and a disk cache of a few GB will be adequate for a caching DNS, but it may help to have a few (say, two to three) of these. Normally, a DNS cache has to honor an expiration date set on mappings provided by its upstream DNS server or peer. For a crawler, strict adherence to expiration dates is not too important. (However, the DNS server should try to keep its mapping as up to date as possible by remapping the entries in cache during relatively idle time intervals.) Second, many clients for DNS resolution are coded poorly. Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator Crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library1 is ideal for use in crawlers.
In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for HTTP data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP (user datagram protocol, a connectionless, packet-based communication protocol that does not guarantee packet delivery) instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.
Multiple Concurrent Fetches
Research-scale crawlers fetch up to hundreds of pages per second. Web-scale crawlers fetch hundreds to thousands of pages per second. Because a single download may take several seconds, crawlers need to open many socket connections to different HTTP servers at the same time. There are two approaches to managing multiple concurrent connections: using multithreading and using nonblocking sockets with event handlers. Since crawling performance is usually limited by network and disk, multi-CPU machines generally do not help much.
After name resolution, each logical thread creates a client socket, connects the socket to the HTTP service on a server, sends the HTTP request header, then reads the socket (by calling recv) until no more characters are available, and finally closes the socket. The simplest programming paradigm is to use blocking system calls, which suspend the client process until the call completes and data is available in user-specified buffers.
This programming paradigm remains unchanged when each logical thread is assigned to a physical thread of control provided by the operating system, for example, through the pthreads multithreading library available on most UNIX systems. When one thread is suspended waiting for a connect, send, or recv to complete, other threads can execute. Threads are not generated dynamically for each request; rather, a fixed number of threads is allocated in advance. These threads use a shared concurrent work-queue to find pages to fetch. Each thread manages its own control state and stack, but shares data areas. Therefore, some implementers prefer to use processes rather than threads so that a disastrous crash of one process does not corrupt the state of other processes.
There are two problems with the concurrent thread/process approach. First, mutual exclusion and concurrent access to data structures exact some performance penalty. Second, as threads/processes complete page fetches and start modifying the document repository and index concurrently, they may lead to a great deal of interleaved, random input-output on disk, which results in slow disk seeks.
The second performance problem may be severe. To choreograph disk access and to transfer URLs and page buffers between the work pool, threads, and the repository writer, the numerous fetching threads/processes must use one of shared memory buffers, interprocess communication, semaphores, locks, or short files. The exclusion and serialization overheads can become serious bottlenecks.
Nonblocking Sockets and Event Handlers
Another approach is to use nonblocking sockets. With nonblocking sockets, a connect, send, or recv call returns immediately without waiting for the network operation to complete. The status of the network operation may be polled separately. In particular, a nonblocking socket provides the select system call, which lets the application suspend and wait until more data can be read from or written to the socket, timing out after a prespecified deadline. select can in fact monitor several sockets at the same time, suspending the calling process until any one of the sockets can be read or written.
Each active socket can be associated with a data structure that maintains the state of the logical thread waiting for some operation to complete on that socket, and callback routines that complete the processing once the fetch is completed. When a select call returns with a socket identifier, the corresponding state record is used to continue processing. The data structure also contains the page in memory as it is being fetched from the network. This is not very expensive in terms of RAM. One thousand concurrent fetches on 10 KB pages would still use only 10 MB of RAM.
Why is using select more efficient? The completion of page fetching threads is serialized, and the code that completes processing the page (scanning for outlinks, saving to disk) is not interrupted by other completions (which may happen but are not detected until we explicitly select again). Consider the pool of freshly discovered URLs. If we used threads or processes, we would need to protect this pool against simultaneous access with some sort of mutual exclusion device. With selects, there is no need for locks and semaphores on this pool. With processes or threads writing to a sequential dump of pages, we need to make sure disk writes are not interleaved. With select, we only append complete pages to the log, again without the fear of interruption.
Link Extraction and Normalization
It is straightforward to search an HTML page for hyperlinks, but URLs extracted from crawled pages must be processed and filtered in a number of ways before throwing them back into the work pool. It is important to clean up and canonicalize URLs so that pages known by different URLs are not fetched multiple times. However, such duplication cannot be eliminated altogether, because the mapping between hostnames and IP addresses is many-to-many, and a “site” is not necessarily the same as a “host.”
A computer can have many IP addresses and many hostnames. The reply to a DNS request includes an IP address and a canonical hostname. For large sites, many IP addresses may be used for load balancing. Content on these hosts will be mirrors, or may even come from the same file system or database. On the other hand, for organizations with few IP addresses and a need to publish many logical sites, virtual hosting or proxy pass may be used2 to map many different sites (hostnames) to a single IP address (but a browser will show different content for the different sites). The best bet is to avoid IP mapping for canonicalization and stick to the canonical hostname provided by the DNS response.
Extracted URLs may be absolute or relative. An example of an absolute URL is http://seouniv.com/p/seo-tutorials-and-courses.html, whereas a relative URL may look like photo.jpg or /~tutorials/. Relative URLs need to be interpreted with reference to an absolute base URL.
Thus, a canonical URL is formed by the following steps:
1. A standard string is used for the protocol (most browsers tolerate HTTP, which should be converted to lowercase).
2. The hostname is canonicalized as mentioned above.
3. An explicit port number is added if necessary.
4. The path is normalized and cleaned up.
Another necessary step is to check whether the server prohibits crawling a normalized URL using the robots.txt mechanism. This file is usually found in the HTTP root directory of the server (such as http://seouniv.com/robots.txt). This file specifies a list of path prefixes that crawlers should not attempt to fetch. The robots.txt file is meant for crawlers only and does not apply to ordinary browsers. This distinction is made based on the User-agent specification that clients send to the HTTP server (but this can be easily spoofed).
Eliminating Already-Visited URLs
Before adding a new URL to the work pool, we must check if it has already been fetched at least once, by invoking the "Is Url Visited?" module. Many sites are quite densely and redundantly linked, and a page is reached via many paths; hence, the "Is Url Visited?" check needs to be very quick. This is usually achieved by computing a hash function on the URL.
For compactness and uniform size, canonical URLs are usually hashed using a hash function such as MD5. (The MD5 algorithm takes as input a message of arbitrary length and produces as output a 128-bit fingerprint or message digest of the input. It is conjectured that it is computationally hard to produce two messages having the same message digest, or to produce any message having a prespecified message digest value.) Depending on the number of distinct URLs that must be supported, the MD5 may be collapsed into anything between 32 and 128 bits, and a database of these hash values is maintained. Assuming each URL costs just 8 bytes of hash value (ignoring search structure costs), a billion URLs will still cost 8 GB, a substantial amount of storage that usually cannot fit in main memory.
Storing the set of hash values on disk unfortunately makes the "Is Url Visited?" check slower, but luckily, there is some locality of access on URLs. Some URLs (such as www.netscape.com/) seem to be repeatedly encountered no matter which part of the Web the crawler is traversing. Thanks to relative URLs within sites, there is also some spatiotemporal locality of access: once the crawler starts exploring a site, URLs within the site are frequently checked for a while.
To exploit locality, we cannot hash the whole URL to a single hash value, because a good hash function will map the domain strings uniformly over the range. This will jeopardize the second kind of locality mentioned above, because paths on the same host will be hashed over the range uniformly. This calls for a two-block or two-level hash function. The most significant bits (say, 24 bits) are derived by hashing the hostname plus port only, whereas the lower-order bits (say, 40 bits) are derived by hashing the path. The hash values of URLs on the same host will therefore match in the 24 most significant bits. Therefore, if the concatenated bits are used as a key in a B-tree that is cached at page level, spatiotemporal locality is exploited.
Finally, the qualifying URLs (i.e., those whose hash values are not found in the B-tree) are added to the pending work set on disk, also called the frontier of the crawl. The hash values are also added to the B-tree.
Because there is no editorial control on Web content, careful attention to coding details is needed to render crawlers immune to inadvertent or malicious quirks in sites and pages. Classic lexical scanning and parsing tools are almost useless. I have encountered a page with 68 KB of null characters in the middle of a URL that crashed a lexical analyzer generated by flex.3 Hardly any page follows the HTML standard to a level where a context-free parser like yacc or bison can parse it well. Commercial crawlers need to protect themselves from crashing on ill-formed HTML or misleading sites. HTML scanners have to be custom-built to handle errors in a robust manner, discarding the page summarily if necessary.
Using soft directory links and path remapping features in an HTTP server, it is possible to create an infinitely “deep” Web site, in the sense that there are paths of arbitrary depth (in terms of the number of slashes in the path or the number of characters). CGI (common gateway interface) scripts can be used to generate an infinite number of pages dynamically (e.g., by embedding the current time or a random number). A simple check for URL length (or the number of slashes in the URL) prevents many “infinite site” problems, but even at finite depth, HTTP servers can generate a large number of dummy pages dynamically. The following are real URLs encountered in a recent crawl:
Certain classes of traps can be detected (see the following section), but no automatic technique can be foolproof. The best policy is to prepare regular statistics about the crawl. If a site starts dominating the collection, it can be added to the guard module, which will remove from consideration any URL from that site. Guards may also be used to disable crawling active content such as CGI form queries, or to eliminate URLs whose data types are clearly not textual (e.g., not one of HTML, plain text, PostScript, PDF, or Microsoft Word).
Avoiding Repeated Expansion of Links on Duplicate Pages
It is desirable to avoid fetching a page multiple times under different names (e.g. u1 and u2), not only to reduce redundant storage and processing costs but also to avoid adding a relative outlink ν multiple times to the work pool as u1/ν and u2/ν. Even if u1 and u2 have been fetched already, we should control the damage at least at this point. Otherwise there could be quite a bit of redundancy in the crawl, or worse, the crawler could succumb to the kind of spider traps illustrated in the previous Crawling Basics: Tutorials and Courses.
Duplicate detection is essential for Web crawlers owing to the practice of mirroring Web pages and sites—that is, copying them to a different host to speed up access to a remote user community. If u1 and u2 are exact duplicates, this can be detected easily. When the page contents are stored, a digest (e.g., MD5) is also stored in an index. When a page is crawled, its digest is checked against the index (shown as "Is Page Known?" in the figure of Typical Anatomy of a Large-Scale Crawler at Crawling Basics: Tutorials and Courses.). This can be implemented to cost one seek per test. Another way to catch such duplicates is to take the contents of pages u1 and u2, hash them to h(u1) and h(u2), and represent the relative link ν as tuples (h(u1), ν) and (h(u2), ν). If u1 and u2 are aliases, the two outlink representations will be the same, and we can avoid the "Is Page Known?" implementation.
Detecting exact duplicates this way is not always enough, because mirrors may have minor syntactic differences, for example, the date of update, or the name and email of the site administrator may be embedded in the page. Unfortunately, even a single altered character will completely change the digest. Shingling, a more complex and robust way to detect near duplicates, is also useful for eliminating annoying duplicates from search engine responses.
Load Monitor and Manager
Network requests are orchestrated by the load monitor and thread manager shown
in the figure of Typical Anatomy of a Large-Scale Crawler at Crawling Basics: Tutorials and Courses. The load monitor keeps track of various system statistics:
- Recent performance of the wide area network (WAN) connection, say, latency and bandwidth estimates. Large crawlers may need WAN connections from multiple Internet service providers (ISPs); in such cases their performance parameters are individually monitored.
- An operator-provided or estimated maximum number of open sockets that the crawler should not exceed.
- The current number of active sockets.
The load manager uses these statistics to choose units of work from the pending work pool or frontier, schedule the issue of network resources, and distribute these requests over multiple ISPs if appropriate.
Many commercial HTTP servers safeguard against denial of service (DoS) attacks. DoS attackers swamp the target server with frequent requests that prevent it from serving requests from bona fide clients. A common first line of defense is to limit the speed or frequency of responses to any fixed client IP address (to, say, at most three pages per second). Servers that have to execute code in response to requests (e.g. , search engines) are even more sensitive; frequent requests from one IP address are in fact actively penalized.
As an HTTP client, a crawler needs to avoid such situations, not only for high performance but also to avoid legal action. Well-written crawlers limit the number of active requests to a given server IP address at any time. This is done by maintaining a queue of requests for each server (see the figure of Typical Anatomy of a Large-Scale Crawler at Crawling Basics: Tutorials and Courses). Requests are removed from the head of the queue, and network activity is initiated at a specified maximum rate. This technique also reduces the exposure to spider traps: no matter how large or deep a site is made to appear, the crawler fetches pages from it at some maximum rate and distributes its attention relatively evenly between a large number of sites.
From version 1.1 onward, HTTP has defined a mechanism for opening one connection to a server and keeping it open for several requests and responses in succession. Per-server host queues are usually equipped with HTTP version 1.1 persistent socket capability. This reduces overheads of DNS access and HTTP connection setup. On the other hand, to be polite to servers (and also because servers protect themselves by closing the connection after some maximum number of transfers), the crawler must move from server to server often. This tension between access locality and politeness (or protection against traps) is inherent in designing crawling policies.
The crawler’s role usually ends with dumping the contents of the pages it fetches into a repository. The repository can then be used by a variety of systems and services which may, for instance, build a keyword index on the documents, classify the documents into a topic directory like Yahoo! topic directory, or construct a hyperlink graph to perform link-based ranking and social network analysis. Some of these functions can be initiated within the crawler itself without the need for preserving the page contents, but implementers often prefer to decouple the crawler from these other functions for efficiency and reliability, provided there is enough storage space for the pages. Sometimes page contents need to be stored to be able to provide, along with responses, short blurbs from the matched pages that contain the query terms.
Page-related information is stored in two parts: metadata and page contents. The metadata includes fields like content type, last modified date, content length, HTTP status code, and so on. The metadata is relational in nature but is usually managed by custom software rather than a relational database. Conventional relational databases pay some overheads to support concurrent updates, transactions, and recovery. These features are not needed for a text index, which is usually managed by bulk updates with permissible downtime.
HTML page contents are usually stored compressed using, for example, the popular compression library zlib. Since the typical text or HTML Web page is 10 KB long 4 and compresses down to 2 to 4 KB, using one file per crawled page is ruled out by file block fragmentation (most file systems have a 4 to 8 KB file block size). Consequently, page storage is usually relegated to a custom storage manager that provides simple access methods for the crawler to add pages and for programs that run subsequently (e.g., the indexer) to retrieve documents.
For small-scale systems where the repository is expected to fit within the disks of a single machine, one may use the popular public domain storage manager Berkeley DB, which manages disk-based databases within a single file. Berkeley DB provides several access methods. If pages need to be accessed using a key such as their URLs, the database can be configured as a hash table or a B-tree, but updates will involve expensive disk seeks, and a fragmentation loss between 15% and 25% will accrue. If subsequent page processors can handle pages in any order, which is the case with search engine indexing, the database can be configured as a sequential log of page records. The crawler only appends to this log, which involves no seek and negligible space overhead. It is also possible to first concatenate several pages and then compress them for a better compression factor.
For larger systems, the repository may be distributed over a number of storage servers connected to the crawler through a fast local network (such as gigabit Ethernet). The crawler may hash each URL to a specific storage server and send it the URL and the page contents. The storage server simply appends it to its own sequential repository, which may even be a tape drive, for archival. High-end tapes can transfer over 40 GB per hour,5 which is about 10 million pages per hour, or about 200 hours for the whole Web (about 2 billion pages) at the time of writing. This is comparable to the time it takes today for the large Web search companies to crawl a substantial portion of the Web. Obviously, to complete the crawl in as much time requires the aggregate network bandwidth to the crawler to match the 40 GB per hour number, which is about 100 Mb per second, which amounts to about two T3-grade leased lines.
Refreshing Crawled Pages
Ideally, a search engine’s index should be fresh — that is, it should reflect the most recent version of all documents crawled. Because there is no general mechanism of updates and notifications, the ideal cannot be attained in practice. In fact, a Web-scale crawler never “completes” its job; it is simply stopped when it has collected “enough” pages. Most large search engines then index the collected pages and start a fresh crawl. Depending on the bandwidth available, a round of crawling may run up to a few weeks. Many crawled pages do not change during a round — or ever, for that matter — but some sites may change many times.
How to use the HTTP protocol to check if a page changed since a specified time and, if so, to fetch the page contents. Otherwise the server sends a “not modified” response code and does not send the page. For a browser this may be useful, but for a crawler it is not as helpful, because, as I have noted, resolving the server address and connecting a TCP socket to the server already take a large chunk of crawling time.
When a new crawling round starts, it would clearly be ideal to know which pages have changed since the last crawl and refetch only those pages. This is possible in a very small number of cases, using the Expires HTTP response header. For each page that did not come with an expiration date, we have to guess if revisiting that page will yield a modified version. If the crawler had access to some sort of score reflecting the probability that each page has been modified, it could simply fetch URLs in decreasing order of that score. Even a crawler that runs continuously would benefit from an estimate of the expiration date of each page that has been crawled.
We can build such an estimate by assuming that the recent update rate will remain valid for the next crawling round—that is, that the recent past predicts the future. If the average interval at which the crawler checks for changes is smaller than the intermodification times of a page, we can build a reasonable estimate of the time to the next modification. The estimate could be way off, however, if the page is changed more frequently than the poll rate: we might have no idea how many versions successive crawls have missed. Another issue is that in an expanding Web, more pages appear young as time proceeds. These issues are discussed by Brewington and Cybenko, who also provide algorithms for maintaining a crawl in which most pages are fresher than a specified epoch. Cho has also designed incremental crawlers based on the same basic principle.
Most search engines cannot afford to wait for a full new round of crawling to update their indices. Between every two complete crawling rounds, they run a crawler at a smaller scale to monitor fast-changing sites, especially related to current news, weather, and the like, so that results from this index can be patched into the master index.