In a recent post I discussed Judea Pearl’s work on causal inference, and in particular on the causal calculus. The post contained a number of “problems for the author” (i.e., me, Michael Nielsen). Judea Pearl has been kind enough to reply with some enlightening comments on four of those problems, as well as contributing a… Continue reading Guest post: Judea Pearl on correlation, causation and the psychology of Simpson’s paradox
In the last post I described how to use support vector machines (SVMs) to combine multiple notions of search relevance. After posting I realized I could greatly simplify my discussion of an important subject in the final section of that post. What I can simplify is this: once you’ve found the SVM parameters, how should… Continue reading Addendum on search and support vector machines
In earlier posts I’ve described two different ways we can assess how relevant a given webpage is to a search query: (1) the cosine similarity measure; and (2) the PageRank, which is a query-independent measure of the importance of a page. While it’s good that we have multiple insights into what makes a webpage relevant,… Continue reading How to combine multiple notions of relevance in search?
It is a commonplace of scientific discussion that correlation does not imply causation. Business Week recently ran an spoof article pointing out some amusing examples of the dangers of inferring causation from correlation. For example, the article points out that Facebook’s growth has been strongly correlated with the yield on Greek government bonds: (credit) Despite… Continue reading If correlation doesn’t imply causation, then what does?
In this post I describe a simple, single-machine web crawler that I’ve written, and do some simple profiling and benchmarking. In the next post I intend to benchmark it against two popular open source crawlers, the scrapy and Nutch crawlers. I’m doing this as part of an attempt to answer a big, broad question: if… Continue reading Benchmarking a simple crawler (working notes)
Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the… Continue reading A problem with the standard importance function? Trading off query terms against one another
When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine… Continue reading Documents as geometric objects: how to rank documents for full-text search
In this post, I describe a simple but powerful framework for distributed computing called Pregel. Pregel was developed by Google, and is described in a 2010 paper written by seven Googlers. In 2009, the Google Research blog announced that the Pregel system was being used in dozens of applications within Google. Pregel is a framework… Continue reading Pregel
I did a little experiment to see what people discuss on the most popular blogs. I downloaded Technorati’s list of the top 1,000 blogs, and crawled 50,000 pages from those blogs. Then I determined the percentage of pages containing words such as “sex”, “Einstein” and “Gaga”. As a baseline for comparison and sanity check, I… Continue reading Sex, Einstein, and Lady Gaga: what’s discussed on the most popular blogs