In this post I describe a simple, single-machine web crawler that I’ve written, and do some simple profiling and benchmarking. In the next post I intend to benchmark it against two popular open source crawlers, the scrapy and Nutch crawlers. I’m doing this as part of an attempt to answer a big, broad question: if… Continue reading Benchmarking a simple crawler (working notes)
Month: July 2011
A problem with the standard importance function? Trading off query terms against one another
Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the… Continue reading A problem with the standard importance function? Trading off query terms against one another
Documents as geometric objects: how to rank documents for full-text search
When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine… Continue reading Documents as geometric objects: how to rank documents for full-text search