July 2011 – DDI

Benchmarking a simple crawler (working notes)

In this post I describe a simple, single-machine web crawler that I’ve written, and do some simple profiling and benchmarking. In the next post I intend to benchmark it against two popular open source crawlers, the scrapy and Nutch crawlers. I’m doing this as part of an attempt to answer a big, broad question: if… Continue reading Benchmarking a simple crawler (working notes)

A problem with the standard importance function? Trading off query terms against one another

Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the… Continue reading A problem with the standard importance function? Trading off query terms against one another

Documents as geometric objects: how to rank documents for full-text search

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine… Continue reading Documents as geometric objects: how to rank documents for full-text search