How Does a Search Engine Work
Search engine technology relies on three different elements: • Crawling – search engines deploy automated agents, known as spiders or crawlers, which travel from page to page, scanning content and cataloging the images, text, links and meta data from each. • Indexes – the spiders organize the information into an index that represents the entire universe of pages that can appear in a search engines’ result set. The index for a typical general search engine, such as Google or Yahoo, contains information on billions of documents. • Algorithms – each engine deploys a complex formula, or algorithm, which is used to determine the relevance of each individual page. When a user conducts a search, the search engine retrieves all pages that match it and then uses the algorithm to rank the pages in order of relevance. Search engines continually update their algorithms in an attempt to further refine the relevance of their search results.
• Internet search engines are web search engines that search and retrieve information on the web. Most of them use crawler indexer architecture. They depend on their crawler modules. Crawlers also referred to as spiders are small programs that browse the web. • Crawlers are given an initial set of URLs whose pages they retrieve. They extract the URLs that appear on the crawled pages and give this information to the crawler control module. The crawler module decides which pages to visit next and gives their URLs back to the crawlers. • The topics covered by different search engines vary according to the algorithms they use. Some search engines are programmed to search sites on a particular topic while the crawlers in others may be visiting as many sites as possible. • The crawl control module may use the link graph of a previous crawl or may use usage patterns to help in its crawling strategy. • The indexer module extracts the words form each page it visits and records its URLs. It resu