How to crawl a quarter billion webpages in 40 hours

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I… Continue reading How to crawl a quarter billion webpages in 40 hours