HTML and HTTP Basics
Web pages are written in a tagged markup language called the hypertext markup language (HTML). HTML lets the author specify layout and typeface, embed diagrams, and create hyperlinks. A hyperlink is expressed as an anchor tag with an href attribute, which names another page using a uniform resource locator (URL), like this:
<a href="http://seouniv.com/">SEO University</a>
In its simplest form, the target URL contains a protocol field (HTTP), a server hostname (seouniv.com), and a file path (/, the “root” of the published file system).
A Web browser such as Chrome, , Firefox, Internet Explorer, Opera, and Safari will let the reader click the computer mouse on the hyperlink. The click is translated transparently by the browser into a network request to fetch the target page using HTTP.
The primary purpose of a web browser is to bring information resources to the user ("retrieval" or "fetching"), allow them to view the information ("display", "rendering"), and then access other information ("navigation", "following links").
This process begins when the user inputs a Uniform Resource Locator (URL), for example http://seouniv.com/, into the browser. The prefix of the URL, the Uniform Resource Identifier or URI, determines how the URL will be interpreted. The most commonly used kind of URI starts with http: and identifies a resource to be retrieved over the Hypertext Transfer Protocol (HTTP). Many browsers also support a variety of other prefixes, such as https: for the Hypertext Transfer Protocol Secure (HTTPS), ftp: for the File Transfer Protocol (FTP), and file: for local files. Prefixes that the web browser cannot directly handle are often handed off to another application entirely. For example, mailto: URIs are usually passed to the user's default e-mail application, and news: URIs are passed to the user's default newsgroup reader.
In the case of http, https, file, and others, once the resource has been retrieved the web browser will display it. HTML and associated content (image files, formatting information such as CSS, etc.) is passed to the browser's layout engine to be transformed from markup to an interactive document, a process known as "rendering". Aside from HTML, web browsers can generally display any kind of content that can be part of a web page. Most browsers can display images, audio, video, and XML files, and often have plug-ins to support Flash applications and Java applets. Upon encountering a file of an unsupported type or a file that is set up to be downloaded rather than displayed, the browser prompts the user to save the file to disk.
Information resources may contain hyperlinks to other information resources. Each link contains the URI of a resource to go to. When a link is clicked, the browser navigates to the resource indicated by the link's target URI, and the process of bringing content to the user begins again.
A browser will fetch and display a Web page given a complete URL like the one above. to reveal the underlying network protocol, we use the telnet command available on UNIX or Linux machines, First the Web browser has to resolve the server hostname seouniv.com to an Internet address of the form 188.8.131.52 (called an IP address, IP standing for Internet Protocol) to be able to contact the server using TCP. The mapping from name to address is done using the Domain Name System (DNS), a distributed database of name-to-IP mappings maintained at known servers. Next, the client connects to port 80, the default HTTP port, on the server. The ends of the request and response headers are indicated by the sequence CR-LF-CR-LF (double newline, written in C/C++ code as "\r\n\r\n" and shown as the blank lines).
$telnet seouniv.com 80
Connected to seouniv.com.
Escape character is ’ˆ]’.
Connection closed by foreign host.
Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need. A better option is to index all the text so that information needs may be satisfied by keyword searches (as in library catalogs). To perform indexing, we need to fetch all the pages to be indexed using a crawler.