Who Crawls Best, And Why Page Rank Matters

Spotted a very interesting crawler experiment on the fantastically-named DrunkMenWorkHere.org:

a large scale experiment on search engine behaviour was staged with more than two billion different web pages. This experiment lasted exactly one year, until April 13th. In this period the three major search engines requested more than one million pages of the tree, from more than hundred thousand different URLs. The home page of drunkmenworkhere.org grew from 1.6 kB to over 4 MB due to the visit log and the comment spam displayed there.”

Revelations include:

  • the frequency of visiting a page seems to be related to the PageRank of a page
  • Google visited nodes at deeper levels less frequently than their parent nodes
  • Yahoo! Slurp was the first search engine to discover Binary Search Tree 2, and crawled most vigorously early on
  • Over the last six months Googlebot requested pages at a fixed rate
  • msnbot virtually ceased to crawl Binary Search Tree 2 after five months
  • most spam was related to pharmaceutical products – many comment nodes weren’t crawled, possibly as a result.

Now, I recall remarking on some forum discussions about crawl issues, and the fact that many posters were jumping to some pretty wild conclusions. If you’re not getting crawl depth, site structure might well be the cause.

Worth testing ;)

Leave a Reply