Expander graphs V: random walks on expanders

Back to expanders again. Today’s post looks at a simple application of expanders, showing that a random walk on an expander graph is likely to quickly escape from any sufficiently small subset of vertices. Intuitively, of course, this result is not surprising, but the exact quantitative form of the result turns out to be extremely useful in the next post, which is about decreasing the number of random bits used by a randomized algorithm.

Note: This post is one in a series introducing one of the deepest ideas in modern computer science, expander graphs. Expanders are one of those powerful ideas that crop up in many apparently unrelated contexts, and that have a phenomenally wide range of uses. The goal of the posts is to explain what an expander is, and to learn just enough about them that we can start to understand some of their amazing uses. The posts require a little background in graph theory, computer science, linear algebra and Markov chains (all at about the level of a first course) to be comprehensible. I am not an expert on expanders, and the posts are just an introduction. They are are mostly based on some very nice 2003 lecture notes by Nati Linial and Avi Wigderson, available on the web at http://www.math.ias.edu/~boaz/ExpanderCourse/.

Random walks on expanders

Many applications of expanders involve doing a random walk on the expander. We start at some chosen vertex, and then repeatedly move to any one of the [tex]d[/tex] neighbours, each time choosing a neighbour uniformly at random, and independently of prior choices.

To describe this random walk, suppose at some given time we have a probability distribution [tex]p[/tex] describing the probability of being at any given vertex in the graph [tex]G[/tex]. We then apply one step of the random walk procedure described above, i.e., selecting a neighbour of the current vertex uniformly at random. The updated probability distribution is easily verified to be:

[tex] \tilde p = \frac{A(G)}{d} p. [/tex]

That is, the Markov transition matrix describing this random walk is just [tex]\hat A(G) \equiv A(G)/d[/tex], i.e., up to a constant of proportionality the transition matrix is just the adjacency matrix. This relationship between the adjacency matrix and random walks opens up a whole new world of connections between graphs and Markov chains.

One of the most important connections is between the eigenvalues of Markov transition matrices and the rate at which the Markov chain converges to its stationary distribution. In particular, the following beautiful theorem tells us that when the uniform distribution is a stationary distribution for the chain, then the Markov chain converges to the uniform distribution exponentially quickly, at a rate determined by the second largest eigenvalue of [tex]M[/tex].

Exercise: Show that if [tex]M[/tex] is a normal transition matrix for a Markov chain then [tex]1 = \lambda_1(M) \geq \lambda_2(M) \geq \ldots[/tex].

Theorem: Suppose [tex]M[/tex] is a normal transition matrix for a Markov chain on [tex]n[/tex] states, with the uniform distribution [tex]u = \vec 1/n[/tex] as a stationary point, [tex]M u = u[/tex]. Then for any starting distribution [tex]p[/tex],

[tex] \| M^t p – u \|_1 \leq \sqrt{n} \lambda_2(M)^t, [/tex]

where [tex]\| \cdot \|_1[/tex] denotes the [tex]l_1[/tex] norm.

The normality condition in this theorem may appear a little surprising. The reason it’s there is to ensure that [tex]M[/tex] can be diagonalized. The theorem can be made to work for general [tex]M[/tex], with the second largest eigenvalue replaced by the second largest singular value. However, in our situation [tex]M[/tex] is symmetric, and thus automatically normal, and we prefer the statement in terms of eigenvalues, since it allows us to make a connection to the expansion parameter of a graph. In particular, when [tex]M = \hat A(G)[/tex] we obtain:

[tex] \| \hat A(G)^t p-u\|_1 \leq \sqrt{n} \left(\frac{\lambda_2(G)}{d}\right)^t. [/tex]

Combining this with our earlier results connecting the gap to the expansion parameter, we deduce that

[tex] \| \hat A(G)^t p-u\|_1 \leq \sqrt{n} \left(1-\frac{h(G)^2}{2d^2}\right)^t. [/tex]

Thus, for a family of expander graphs, the rate of convergence of the Markov chain is exponentially fast in the number of time steps [tex]t[/tex].

Exercise: Suppose [tex]M[/tex] is a transition matrix for a Markov chain. Show that the uniform distribution [tex]u[/tex] is a stationary point point for the chain, i.e., [tex]Mu = u[/tex], if and only if [tex]M[/tex] is doubly stochastic, i.e., has non-zero entries, and all rows and columns of the matrix sum to [tex]1[/tex].

Proof: We start by working with the [tex]l_2[/tex] norm [tex]\| \cdot \|_2[/tex]. Since [tex]Mu = u[/tex] we have [tex]M^t u = u[/tex], and so:

[tex] \|M^t p – u \|_2 = \|M^t(p-u) \|_2. [/tex]

A computation shows that [tex]p-u[/tex] is orthogonal to [tex]u[/tex]. But [tex]u[/tex] is an eigenvector of [tex]M[/tex] with the maximum eigenvalue, [tex]1[/tex], and thus [tex]p-u[/tex] must lie in the span of the eigenspaces with eigenvalues [tex]\lambda_2(M),\lambda_3(M),\ldots[/tex]. It follows that

[tex] \|M^t(p-u)\|_2 \leq \lambda_2(M)^t \|p-u\|_2 \leq \lambda_2(M)^t, [/tex]

where we used the fact that [tex]\| p-u\|_2 \leq 1[/tex], easily established by observing that [tex]\|p-u\|_2[/tex] is convex in [tex]p[/tex], and thus must be maximized at an extreme point in the space of probability distributions; the symmetry of [tex]u[/tex] ensures that without loss of generality we may take [tex]p = (1,0,\ldots,0)[/tex]. To convert this into a result about the [tex]l_1[/tex] norm, we use the fact that in [tex]n[/tex] dimensions [tex]\|v\|_1 \leq \sqrt{n} \|v\|_2[/tex], and thus we obtain

[tex] \|M^t(p-u)\|_1 \leq \sqrt{n} \lambda_2(M)^t, [/tex]

which was the desired result. QED

What other properties do random walks on expanders have? We now prove another beautiful theorem which tells us that they “move around quickly”, in the sense that they are exponentially unlikely to stay for long within a given subset of vertices, [tex]B[/tex], unless [tex]B[/tex] is very large.

More precisely, suppose [tex]B[/tex] is a subset of vertices, and we choose some vertex [tex]X_0[/tex] uniformly at random from the graph. Suppose we use [tex]X_0[/tex] as the starting point for a random walk, [tex]X_0,\ldots,X_t[/tex], where [tex]X_t[/tex] is the vertex after the [tex]t[/tex]th step. Let [tex]B(t)[/tex] be the event that [tex]X_j \in B[/tex] for all [tex]j[/tex] in the range [tex]0,\ldots,t[/tex]. Then we will prove that:

[tex] \mbox{Pr}(B(t)) \leq \left( \frac{\lambda_2(G)}{d} + \frac{|B|}{n} \right)^t [/tex]

Provided [tex]\lambda_2(G)/d + |B|/n < 1[/tex], we get the desired exponential decrease in probability. For a family of expander graphs it follows that there is some constant [tex]\epsilon > 0[/tex] such that we get an exponential decrease for any [tex]B[/tex] such that [tex]|B|/n < \epsilon[/tex]. These results are special cases of the following more general theorem about Markov chains. Theorem: Let [tex]X_0[/tex] be uniformly distributed on [tex]n[/tex] states, and let [tex]X_0,\ldots,X_t[/tex] be a time-homogeneous Markov chain with transition matrix [tex]M[/tex]. Suppose the uniform distribution [tex]u[/tex] is a stationary point of [tex]M[/tex], i.e., [tex]Mu = u[/tex]. Let [tex]B[/tex] be a subset of the states, and let [tex]B(t)[/tex] be the event that [tex]X_j \in B[/tex] for all [tex]j \in 0,\ldots,t[/tex]. Then [tex] \mbox{Pr}(B(t)) \leq \left( \lambda_2(M) + \frac{|B|}{n} \right)^t. [/tex] Proof: The first step in the proof is to observe that: [tex] \mbox{Pr}(B(t)) = \|(PMP)^t P u \|_1, [/tex] where the operator [tex]P[/tex] projects onto the vector space spanned by those basis vectors corresponding to elements of [tex]B[/tex]. This equation is not entirely obvious, and proving it is a good exercise for the reader. The next step is to prove that [tex]\| PMP \| \leq \lambda_2(M)+|B|/n[/tex], where the norm here is the operator norm. We will do this below, but note first that once this is done, the result follows, for we have [tex] \mbox{Pr}(B(t)) = \| (PMP)^t P u \|_1 \leq \sqrt{n} \| (PMP)^t P u \|_2 [/tex] by the standard inequality relating [tex]l_1[/tex] and [tex]l_2[/tex] norms, and thus [tex] \mbox{Pr}(B(t)) \leq \sqrt{n} \| PMP \|^t \| P u \|_2, [/tex] by definition of the operator norm, and finally [tex] \mbox{Pr}(B(t)) \leq \left( \lambda_2(M)+\frac{|B|}{n} \right)^t, [/tex] where we used the assumed inequality for the operator norm, and the observation that [tex]\| P u \|_2 = \sqrt{|B|}/n \leq 1/\sqrt{n}[/tex]. To prove the desired operator norm inequality, [tex]\| PMP \| \leq \lambda_2(M)+|B|/n[/tex], suppose [tex]v[/tex] is a normalized state such that [tex]\| PMP \| = |v^T PMP v|[/tex]. Decompose [tex]Pv = \alpha u + \beta u_\perp[/tex], where [tex]u_\perp[/tex] is a normalized state orthogonal to [tex]u[/tex]. Since [tex]\|P v \|_2 \leq \|v \|_2 = 1[/tex] we must have [tex]|\beta| \leq 1[/tex]. Furthermore, multiplying [tex]Pv = \alpha u + \beta u_\perp[/tex] on the left by [tex]nu^T[/tex] shows that [tex]\alpha = n u^T P v[/tex]. It follows that [tex]|\alpha|[/tex] is maximized by choosing [tex]v[/tex] to be uniformly distributed over [tex]B[/tex], from which it follows that [tex]|\alpha| \leq \sqrt{|B|}[/tex]. A little algebra shows that [tex] v^T PMP v = \alpha^2 u^T M u + \beta^2 u_\perp^T M u_\perp. [/tex] Applying [tex]|\alpha| \leq \sqrt{|B|}[/tex], [tex]u^T M u = u^Tu = 1/n[/tex], [tex]|\beta| \leq 1[/tex], and [tex]u_\perp^T M u_\perp \leq \lambda_2(M)[/tex] gives [tex] v^T P M P v \leq \frac{|B|}{n} + \lambda_2(M), [/tex] which completes the proof. QED The results in today's post are elegant, but qualitatively unsurprising. (Of course, having elegant quantitative statements of results that are qualitatively clear is worthwhile in its own right!) In the next post we'll use these ideas to develop a genuinely surprising application of expanders, to reducing the number of random bits required by a probabilistic algorithm in order to achieve a desired success probability.