Biweekly links for 12/22/2008

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 12/19/2008

  • Do Tank and the Democracy Design Workshop
    • “The goal of the Virtual Company Project is to build online tools to help groups create and implement governance rules necessary for successful collaboration. The project is premised on the belief that the right graphical interfaces can translate the structures of the group into clear and intelligible procedures that will enable teams to make decisions, control assets and enter into contractual relationships with third parties. The Virtual Company project is creating the interfaces and designing the back-end functionality that is needed to enable group participants to see themselves in relation to the group as a whole, allocate roles, establish accountability to the group, make collective decisions, and administer group assets, expenditures and distributions. “
  • OnTheCommons.org » Online Collaboration Goes Legit
    • Good discussion of the recently passed Vermont law which allows online-only organizations to file for status as corporations.
  • Eddie A. Tejeda: Google might be doing more than making us stupid
    • Nice framing of the question, even if he ultimately goes off on a tangent: “While Socrates was right that writing made us lose our ability to memorize, he was unable to foresee (or maybe foresaw, but still did not approve) the rise of engaged and literate societies, which have been responsible for many of the great creations of humanity.

      So what type of world are we unable to yet see by high level and data extracting reading that the Internet so easily provides? Well, it’s hard to say, but there is another angle we can approach this to give us an idea.”

  • Benjamin Zander
    • Remarkable. A music lesson from the conductor of the Boston Philharmonic, live on stage.
  • Native Client: Google’s craziest idea yet |Fatal Exception | Neil McAllister | InfoWorld
    • Google’s Native Client uses sandboxing / virtualization to allow desktop(ish) software to run inside the browser. They apparently have a demo with Quake running, which is pretty impressive. Try doing that with Javascript!
  • Paul Allen: Piece of mind | The Economist
    • “The scientists used state-of-the-art technology to dissect a mouse brain, photographed it sliver section by section, then reassembled it in a computer database that would allow easy access. But it was the speed at which the project was accomplished and what they did with it afterwards that changed the game.

      They released it to the public. Over the internet. Free.

      When we first put the mouse-brain atlas online free, it was met by the research world with suspicion. People wondered what the catch was. Scientific research has long been a solitary endeavour—one researcher, one microscope. Findings are protected so that discovery credit can be clearly defined and awarded…

      … Today we have many scientists using the atlas for their research into Alzheimer’s, bipolar disorders, Down’s syndrome, Parkinson’s, fragile x mental retardation and epilepsy. The atlas is also giving scientists insight into alcoholism, obesity, sleep, hearing and memory.”

  • Synthese Recommender Service
    • Recommendation engine for scientific journal articles. The areas covered are too far outside my area of scientific expertise to know how good the recommendations are.
  • Publish in Wikipedia or perish : Nature News
    • “Wikipedia, meet RNA. Anyone submitting to a section of the journal RNA Biology will, in the future, be required to also submit a Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.”
  • Caveat Lector » Proto-librarians and computers
  • OpenNet Initiative
    • Tracks which governments are currently blocking what.

Click here for all of my del.icio.us bookmarks.

Published

Building intuition about PageRank (video)

This is a video accompaniment to my post on building intuition about PageRank:




The video is a re-recording. The first recording, which had a live audience, failed, and so I decided to re-record. I was surprised at how difficult it is to speak to an empty room – I prefer a good discussion, with people asking clarifying questions and so on. In any case, the video for the next post should be with an audience.

The post is part of a series about the Google Technology Stack. The posts are designed to be self-contained, even as they build toward the overall goal of understanding may of the fundamental technologies used at Google, and so the video will hopefully make sense without watching the video for the earlier post. There’s also a FriendFeed room for discussion here.

Published
Categorized as GTS, Video

Biweekly links for 12/15/2008

  • Mark Newman:The structure of scientific collaboration networks
    • “We investigate the structure of scientific collaboration networks. We consider two scientists to be connected if they have authored a paper together, and construct explicit networks of such connections using data drawn from a number of databases, including MEDLINE (biomedical research), the Los Alamos e-Print Archive (physics), and NCSTRL (computer science). We show that these collaboration networks form “small worlds” in which randomly chosen pairs of scientists are typically separated by only a short path of intermediate acquaintances. We further give results for mean and distribution of numbers of collaborators of authors, demonstrate the presence of clustering in the networks, and highlight a number of apparent differences in the patterns of collaboration between the fields studied. “

Click here for all of my del.icio.us bookmarks.

Published

The PageRank distribution for the web

The PageRank for a webpage is a probability between 0 and 1. The general idea is that PageRank quantifies the importance of the page: the bigger the probability the more important the page. I was curious about what the PageRank distribution is across the web as a whole, and so I built a very simple model to investigate the question.

Here’s the three assumptions that went into the model.

1. I use just 200 webpages. The qualitative results don’t seem to vary much as the number of webpages increases further, and the results get both harder to interpret and harder to generate, so 200 seems like a good size.

2. I assume that the number [tex]L[/tex] of inbound links to any given webpage is governed by a Pareto probability distribution [tex]p(L) \propto 1/L^a[/tex], with [tex]a = 2[/tex]. This assumption is based on a paper by Adamic and Huberman. Note that the data in that paper is a decade old, and more recent data should really be used. (An independently interesting question is how that exponent is changing over time, and why it changes.)

3. I assume that the number of inbound links to each webpage is an independent random variable.

With these assumptions, the histogram of PageRanks for a typical random web looks like this:

Aggregating over multiple runs gives:

There are a few notable things about these histograms.

First, most pages have PageRank near the minimal possible value of [tex]0.15 * 1/200 = 0.00075[/tex] (see here for an explanation of why that’s the minimal possible value.

Second, the distribution of PageRanks drops off very fast. Because there are 200 webpages, and the PageRanks must sum to one (being probabilities), the “average” PageRank must be [tex]0.005[/tex]. You can see from the histograms that the distribution has already dropped off quite a bit by the time you get to this PageRank: most pages have a PageRank quite a bit below the average. It’s a few very high PageRank pages that restore the average.

Third, the page with the highest PageRank had a PageRank approximately 20 times higher than average PageRank.

I haven’t done the analysis, but it looks pretty likely that the distribution of PageRanks for this model is itself approximated by a power law distribution. Curious.

This post is part of an ongoing series about the Google Technology Stack, covering technologies such as PageRank, MapReduce, the Google File System, and Bigtable. Posts appear on this blog once a week; there is an associated FriendFeed room for discussion.

Published
Categorized as GTS

Biweekly links for 12/12/2008

Click here for all of my del.icio.us bookmarks.

Published

Lectures on the Google Technology Stack 2: Building our PageRank Intuition

This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on – based on a lecture series I’m giving in 2008 and 2009. If you’d like to know more about the series, please see the course syllabus. There is a FriendFeed room for the series here.

In today’s lecture, we’ll build our intuition about PageRank. We’ll figure out the minimal and maximal possible values for PageRank; study how PageRank works for a web with a random link structure, seeing a sort of “central limit theorem” for PageRank emerge; and find out how spammers can spam PageRank, and how to defeat them.

Building our PageRank Intuition

In the last lecture we defined PageRank, but we haven’t yet developed a deep understanding. In this lecture we’ll build our intuition, working through many basic examples of PageRank in action, and answering many fundamental questions.

The minimal and maximal values for PageRank

Recall from last lecture that the PageRank vector [tex]q[/tex] is an [tex]N[/tex]-dimensional probability distribution, with the probability [tex]q_j[/tex] measuring the importance of webpage [tex]j[/tex]. Because [tex]q[/tex] is a probability distribution, the total of all the probabilities must add up to one, and so the average PageRank is [tex]1/N[/tex].

It’s interesting that we can get this precise value for the average PageRank, but what we’d really like is more detailed information about the way PageRank behaves for typical pages. Let’s start by studying the minimal and maximal possible values for PageRank.

Intuitively, we expect that the minimal value for PageRank occurs when a page isn’t linked to by any other page. This intuition can be made more rigorous as follows. Recall our definition of PageRank using a websurfer randomly surfing through webpages. At any given step of their random walk, the crazy websurfer has a probability at least [tex]t P_j[/tex] of teleporting to vertex [tex]j[/tex], and thus we must have

[tex] q_j \geq t P_j [/tex]

for all pages [tex]j[/tex]. Furthermore, it’s possible for the PageRank to be equal to this value if, for example, no other pages link to page [tex]j[/tex]. Thus [tex]tP_j[/tex] is the minimal value for the PageRank of page [tex]j[/tex]. A more algebraic way of putting this argument is to recall from the last lecture that [tex]tP = (I-sG)q[/tex], and thus [tex]t P_j = q_j – s(Gq)_j \leq q_j[/tex], since [tex](Gq)_j \geq 0[/tex]. For the original PageRank algorithm, [tex]P_j = 1/N[/tex], and so the minimal PageRank is [tex]t/N[/tex] for all webpages, a factor [tex]t[/tex] smaller than the average PageRank.

Exercises

  • We will say that webpage [tex]k[/tex] eventually leads to webpage [tex]j[/tex] if there is a link path leading from [tex]k[/tex] to [tex]j[/tex]. Show that [tex]q_j = tP_j[/tex] if and only if [tex]P_k = 0[/tex] for all pages [tex]k[/tex] that eventually lead to page [tex]j[/tex]. Obviously, this can’t happen for the original PageRank algorithm, where [tex]P_j = 1/N[/tex], unless there are no pages at all linking to page [tex]j[/tex]. Generically, therefore, most pages have PageRank greater than this minimum.

We can use the value for the minimal PageRank to deduce the maximal possible PageRank. Because the sum of all PageRanks is [tex]1[/tex], the maximal possible PageRank for page [tex]j[/tex] occurs if all other PageRanks have their minimal value, [tex]tP_k[/tex]. In this case [tex]q_j = 1-\sum_{k\neq j} t P_k = 1-t(1-P_j) = s+tP_j[/tex], and so we have

[tex] q_j \leq s + t P_j. [/tex]

An example where maximal PageRank occurs is a web with a “star” topology of links, with the central webpage, which we’ll label [tex]1[/tex], having only a single outgoing link, to itself, and every other webpage [tex]2,\ldots,N[/tex] having a single link going to page [tex]1[/tex]. Pages [tex]2,\ldots,N[/tex] have no incoming links, and thus have minimal PageRank [tex]tP_j[/tex], and so page [tex]1[/tex] has maximal PageRank, [tex]s+tP_1[/tex].

Exercises

  • In the previous example show explicitly that [tex]q = (s+tP_1,tP_2,\ldots,tP_N)[/tex] satisfies the equation [tex]Mq = q[/tex].
  • In the star topology example, we required that webpage [tex]1[/tex] link to itself. The reason was so that page [tex]1[/tex] was not a dangling page, and thus treated by PageRank as effectively linked to every page. What happens to the PageRank of page [tex]1[/tex] if we make it dangling?

Typical values of PageRank

It’s all very well to know the maximal and minimal values of PageRank, but in practice those situations are rather unusual. What can we say about typical values of PageRank? To answer this question, let’s build a model of the web, and study its properties. We’ll assume a very simple model of a web with just 5,000 pages, with each page randomly linking to 10 others. The following code plots a histogram of the PageRanks, for the case [tex]s = 0.85[/tex].

 
# Generates a 5,000 page web, with each page randomly 
# linked to 10 others, computes the PageRank vector, and 
# plots a histogram of PageRanks.
# 
# Runs under python 2.5 with the numpy and matplotlib 
# libraries.  If these are not already installed, you'll 
# need to install them.

from numpy import * 
import random 
import matplotlib.pyplot as plt

# Returns an n-dimensional column vector with m random 
# entries set to 1.  There must be a more elegant way to 
# do this! 
def randomcolumn(n,m):
  values = random.sample(range(0,n-1),m)
  rc = mat(zeros((n,1)))
  for value in values:
    rc[value,0] = 1
  return rc

# Returns the G matrix for an n-webpage web where every 
# page is linked to m other pages. 
def randomG(n,m):   
  G = mat(zeros((n,n)))
  for j in range(0,n-1):
    G[:,j] = randomcolumn(n,m)
  return (1.0/m)*G

# Returns the PageRank vector for a web with parameter s 
# and whose link structure is described by the matrix G. 
def pagerank(G,s):
  n = G.shape[0]
  id = mat(eye(n))
  teleport = mat(ones((n,1)))/n
  return (1-s)*((id-s*G).I)*teleport

G = randomG(5000,20)
pr = pagerank(G,0.85)
print std(pr)
plt.hist(pr,50) 
plt.show() 

Problems

  • The randomcolumn function in the code above isn’t very elegant. Can you find a better way of implementing it?

The result of running the code is the following histogram of PageRank values – the horizontal axis is PageRank, while the vertical axis is the frequency with which it occurs:

We see from this graph that PageRank has a mean of [tex]1/5000[/tex], as we expect, and an empirical standard deviation of [tex]0.000055[/tex]. Roughly speaking, it looks Gaussian, although, of course, PageRank is bounded above and below, and so can’t literally be Gaussian.

Suppose now that we change the number of links per page to [tex]100[/tex], but keep everything else the same. Then the resulting histogram is:

This also looks Gaussian, but has a smaller standard deviation of [tex]0.000017[/tex]. Notice that while we approach the minimal possible value of PageRank ([tex]t/N = 0.00003[/tex]) in the first graph, in neither graph do we come anywhere near the maximal possible value of PageRank, which is just a tad over [tex]s = 0.85[/tex].

Looking at these results, and running more numerical experiments in a similar vein, we see that as the number of outgoing links is increased, the standard deviation in PageRank seems to decrease. A bit of playing around suggests that the standard deviation decreases as one over the square root of the number of outgoing links, at least when the number of outgoing links is small compared with the total number of pages. More numerical experimentation suggests that the standard deviation is also proportional to [tex]s/N[/tex]. As a result, I conjecture:

Conjecture: Consider a random web with [tex]n[/tex] pages and [tex]m[/tex] outgoing links per page. Then for [tex]1 \ll m \ll n[/tex] the PageRank distribution approaches a Gaussian with mean [tex]1/n[/tex] and standard deviation [tex]s/(n\sqrt{m})[/tex].

For the case [tex]n = 5000, m = 10, s = 0.85[/tex] this suggests a standard deviation of [tex]0.000054[/tex], while for the case [tex]n = 5000, m = 100, s = 0.85[/tex] it suggests a standard deviation of [tex]0.000017[/tex]. Both these results are in close accord with the earlier empirical findings. The further numerical investigations I’ve done (not a lot, it must be said) also confirm it.

Incidentally, you might wonder why I chose to use a web containing [tex]5,000[/tex] pages. The way I’ve computed PageRank in the program above, we compute the matrix inverse of a [tex]5,000[/tex] by [tex]5,000[/tex] matrix. Assuming we’re using [tex]64[/tex]-bit floating point arithmetic, that means just storing the matrix requires [tex]200[/tex] megabytes. Not surprisingly, it turns out that my small laptop can’t cope with [tex]10,000[/tex] webpages, so I stuck with [tex]5,000[/tex]. Still, it’s notable that at that size the program still ran quite quickly. In the next lecture we’ll see how to cope with much larger sets of webpages.

Exercises

  • Modify the program above to run with [tex]m=1[/tex]. You’ll find that the PageRank distribution does not look Gaussian. What conjecture do you think describes the PageRank distribution in this case? What’s a good reason we might not expect Gaussian behaviour for small values of [tex]m[/tex]?
  • Do you trust the Python code to produce the correct outcomes? One of the problems in using numeric libraries as black boxes is that it can be hard to distinguish between real results and numerical artifacts. How would you determine which is the case for the code above? (For the record, I do believe the results, but have plans to do more checks in later lectures!)

Problems

  • Can you prove or disprove the above conjecture? If it’s wrong, is there another similar type of result you can prove for PageRank, a sort of “central limit theorem” for PageRank? Certainly, the empirical results strongly suggest that some type of result in this vein is true! I expect that a good strategy for attacking this problem is to first think about the problem from first principles, and, if you’re getting stuck, to consult some of the literature about random walks on random graphs.
  • What happens if instead of linking to a fixed number of webpages, the random link structure is governed by a probability distribution? You might like to do some computational experiments, and perhaps formulate (and, ideally, prove) a conjecture for this case.
  • Following up on the last problem, I believe that a much better model of the web than the one we’ve used is to assume a power-law distribution of incoming (not outgoing) links to every page. Can you formulate (and, ideally, prove) a type of central limit theorem for PageRank in this case? I expect to see a much broader distribution of PageRanks than the Gaussian we saw in the case of a fixed number of outgoing links.

A simple technique to spam PageRank

One of the difficulties faced by search engines is spammers who try to artificially inflate the prominence of webpages by pumping up their PageRank. To defeat this problem, we need to understand how such inflation can be done, and find ways of avoiding it. Here’s one simple technique for getting a high PageRank.

Suppose the web contains [tex]N[/tex] pages, and we build a link farm containing [tex]M[/tex] additional pages, in the star topology described earlier. That is, we number our link farm pages [tex]1,\ldots,M[/tex], make page [tex]1[/tex] link only to itself, and all the other pages link only to page [tex]1[/tex]. Then if we use the original PageRank, with uniform teleportation probabilities [tex]1/(N+M)[/tex], a similar argument to earlier shows that the PageRank for page [tex]1[/tex] is

[tex] (s + t/M) \frac{M}{M+N}. [/tex]

As we increase the size, [tex]M[/tex], of the link farm it eventually becomes comparable to the size of the rest of the web, [tex]N[/tex], and this results in an enormous PageRank for page [tex]1[/tex] of the link farm. This is not just a theoretical scenario. In the past, spammers have built link farms containing billions of pages, using a technique similar to that described above to inflate their PageRank. They can then sell links from page 1 to other pages on the web, essentially renting out some of their PageRank.

How could we defeat this strategy? If we assume that we can detect the link farm being built, then a straightforward way of dealing with it is to modify the teleportation vector so that all pages in the link farm are assigned teleportation probabilities of zero, eliminating all the spurious PageRank accumulated due to the large number of pages in the link farm. Unfortunately, this still leaves the problem of detecting the link farm to begin with. That’s not so easy, and I won’t discuss it in this lecture, although I hope to come back to it in a later lecture.

Exercises

  • Prove that the PageRank for page [tex]1[/tex] of the link farm is, as claimed above, [tex](s+t/M) M/(M+N)[/tex].
  • The link topology described for the link farm is not very realistic. To get all the pages into the Google index the link farm would need to be crawled by Google’s webcrawler. But the crawler would never discover most of the pages, since they aren’t linked to by any other pages. Can you think of a more easily discoverable link farm topology that still gains at least one of the pages a large PageRank?

That concludes the second lecture. Next lecture, we’ll look at how to build a more serious implementation of PageRank, one that can be applied to a web containing millions of pages, not just thousands of pages.

Published
Categorized as GTS

Biweekly links for 12/08/2008

  • The basic dichotomy of quantum gravity « Shores of the Dirac Sea
    • Moshe Rozali explaining some of the problems of quantum gravity.
  • Eric Drexler: Condensed-Matter Physics Condensed
    • “In reading up on metal oxides, spin systems, and computation, I found a wonderful 19-page “Perspective of Frontiers in Modern Condensed Matter Physics” [pdf], published in the AAPPS Bulletin last April by Caltech physicist Nai-Chang Yeh. Its scope ranges from symmetry, Landau, and intellectual history, to topological orders and spin liquids, even touching on solid-phase chemistry and nanofabrication. I think it will reward physics-oriented readers with a taste for the broad knowledge that puts deep knowledge into context.”
  • John Walker: Mortgage Bailout, Roman Style (Fourmilog: None Dare Call It Reason)
    • “In Book VI of The Annals, Tacitus describes how runaway mortgage lending, combined with government meddling with interest rates and loan terms resulted in a credit crunch and an eventual collapse in real estate prices. All of this happened in A.D. 32 during the reign of the Roman Emperor Tiberius. “
  • NORAD Tracks Santa in 2007
    • See also http://noradsanta.org, the official site for NORAD’s 2008 tracking
  • American Physical Society News: The Future of Science
    • An abridgement of my essay “The Future of Science” appeared as the back page of the American Physical Society’s November newsletter. The abridgement is by Jennifer Ouellette.

Click here for all of my del.icio.us bookmarks.

Published