If you know anything about the internet, you’re probably familiar with what a search engine is and what it’s designed to do. Each and every day there are millions upon millions of websites and webpages uploaded onto the internet and search engines are virtually just a large index or library of these millions of webpages, but what many people don’t understand is the process of how search engines find these pages.
Search engines use a process called ‘crawling’ carried out by a programme called a ‘search engine crawler’ to discover new pages to add to its index and update existing pages. But what exactly is a search engine crawler and how does it work?
What Are Search Engine Crawlers?
Search engine crawlers, also commonly referred to as ‘web crawlers’, are programmes designed by search engines to browse websites in their entirety to provide them with all the information they need for their index.
Crawlers, often known as ‘bots’ or ‘spiders’, use robots.txt, sitemaps and internal links found on your website to read all the information on each of your webpages, including content, links, HTML and metadata.
All of the major search engines will use web crawlers for the process of crawling and indexing pages to index and update existing pages on the search engine index, for instance, Google uses ‘Googlebot’ and Yahoo! uses ‘Slurp’ for crawling websites.
How Do They Work?
The process that website crawlers do is called ‘crawling’ in which begins with the crawler downloading the robots.txt file from your website. This information contains a certain set of ‘rules’ for the crawler to follow, including what pages are to be crawled and which aren’t, pages that the robot.txt doesn’t want a search engine crawler to crawl will not be indexed by search engines. In addition to telling crawlers what pages they which they shouldn’t crawl, robots.txt can also provide crawlers with information about the sitemap of the website, which details what pages of the site are to be indexed.
How do they index pages?
How these crawlers and bots index your page is by first visiting the page and then making a copy of both the page and the URL to add to the search engine index. Once it has completed this first step, it will then proceed to follow all of the links that are on your website to find different pages on the site, also making a copy of that page and its URL to add to the search engine’s index. However, if a website contains a robots.txt which has told the web crawler not to crawl a page, it won’t visit or index the page.
What Are They Crawling For?
Websites are crawled for both of the purposes of discovering pages for the index as well as to update the information they have about the existing pages in the search engine index. Website crawlers visit pages and extract all the information on the page including links to discover any new pages, whereas pages which are already known to the search engines continue to be systematically crawled to check for any updates to a page.
Webpages which are updated regularly are more likely to be crawled more often by search engines and bots, whereas pages which aren’t automatically updated and modified will be crawled less frequently.