Fabrice Canel, part of the Live Search Crawling Team, announced on Tuesday (February 12, 2008) significant updates to the Live Search Engine crawler, MSNBot. The updates significantly improve the efficiency with which they crawl and index websites. The two biggest updates were improvements with HTTP Compression and Conditional Get.
We support conditional get as defined by RFC 2616 (Section 14.25), generally we will not download the page unless it has changed since the last time we crawled it. As per the standard, our crawler will include the “If-Modified-Since” header & time of last download in the GET request and when available, our crawler will include the “If-None-Match” header and the ETag value in the GET request. If the content hasn’t changed the web server will respond with a 304 HTTP response.
They also updated their user agent.
In addition to these two features there are many more improvements in performance that should help further optimize our crawling. As a result, we’ve also upgraded our user agent to reflect the changes, it is now “msnbot/1.1″. If you think you are experiencing any issues with MSNbot, or have any questions about the updates, please use our Crawler Feedback & Discussion form.
When Nathan Buggia, Lead Program Manager at Live Search Webmaster Center, first brought this news to my attention, one of the first questions I asked him was regarding the respect (or disrespect) of robots.txt. Some people have the impression that MSNBot doesn’t respect robots.txt, because they often see their content in their index when they’ve specifically requested that it not be crawled. Nathan replied with:
We do read and respect the robots.txt file, however, if there is a link on a 3rd party site, that points to a page blocked by the REP on your site, we may still put that link (& associated anchor text) into our index. And we may surface that link (and anchor text) in our search results if it appears to be relevant, but we still won’t go and crawl/index the actual page.
This is something that we spend a lot of time debating about internally, I would love to hear your thoughts on this.
I thought that was interesting, and the fact that they occasionally include links in their SERPs to pages they don’t actually crawl, might be the reason for all of the confusion. For me, it’s a foreign concept that a search engine result would contain links to websites that they haven’t even crawled. However, I can see the case for that if many trusted websites keep referencing a resource and it’s determined that the destination URL is an appropriate search result — regardless of whether or not the destination content has been crawled or analyzed.