Web Crawler is a part of search engines which crawls the web pages on internet and gathers necessary information about webpage. Web Crawler is type of bot which identifies urls to visit and according to list it visits these pages (we should have robot.txt file to allow or disallow crawling of webpages for Web Crawler).Web Crawler make visits and stores all necessary information about site like Meta description ,Title ,keywords etc.
When it comes to a particular webpage it makes a list of all the links of that page and arranges them into follow links and no follow links (these links have ‘rel’ tag associated with it e.g. rel=”no-follow”).It visits each and every page that is linked to your page.
Always remember to avoid dangling link where crawlers go in infinite loop
e.g. If you have Page A and it contains link to Page B,if Page B also links to Page A ,so it becomes a infinite loop for web crawlers .
In focused crawling crawlers determine page content relevancy with query and downloads its information, it focuses only for relevant content and hence it is called as focused crawling. Many times what happens it crawl page after query so there might be possibility if using so predictor to predict the pages those are relevant to query.
Crawlers mostly want to look out for HTML properties and avoid internet media type. Mostly crawler requests HTML Head tags to fetch necessary resources a crawler may examine the URL and only request a resource if the URL ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp or a slash.
URL Normalization :
Crawlers mostly normalizes url for avoid repeat visit to page.it is also called as canonicalization
Following figure gives overall process of web crawling
Tips to improve web for Web crawlers :
- Use xml sitemap so that crawlers will come to know about present pages and will reduce time for crawl
- Use robot.txt files to have better security for the web ,so that web crawlers can be allowed or dis allowed
- Link sitemap in robot.txt file so while reading crawlers can move to your sitemap
- Use best internal linking structure
- Update sitemap regularly