Random k-SAT

From Polymath Wiki
Revision as of 12:26, 11 August 2010 by Myslenka (talk | contribs)
Jump to navigationJump to search

Below are some notes of Dimitris Achlioptas, with additions from Lenka Zdeborova.

This is an attempt to clarify some issues regarding "d1RSB", "clustering", "freezing", "condensation" in random k-SAT. An effort is made to keep the discussion rather generic and, less successfully, short. This picture originates from statistical physics theories, but we stress results that are known rigorously. References to original literature are mostly not given and this should be fixed at some later point.

The setting:

  • A set of n variables all with the same finite domain of size d, e.g., d=2 in k-SAT.
  • A collection of constraints on the variables, each constraint binding k of the n variables, e.g., k=2 in graph coloring.

Typically, the collection consists of "all possible constraints of a certain kind on the n variables", e.g., all [math]\displaystyle{ 2^k \binom{n}{k} }[/math] clauses of length k in k-SAT, or all [math]\displaystyle{ \binom{n}{2} }[/math] edges, i.e., "not equal" constraints, in graph coloring.

  • It will be easier to think of the formation of random instances as a "process", in which constraints are added one by one, by selecting one uniformly random constraint from the collection, among those that have not yet been selected (for technical reasons, it often helps to allow selection with replacement, but this turns out to be inconsequential.)
  • Finally, it also helps to think of what is called in physics the "energy landscape", i.e., the function counting the number of violated constraints for each value assignment. So, in particular, solutions correspond to the value assignments (points) where the energy is zero.
  • For any given problem, let [math]\displaystyle{ S_n(t) }[/math] denote the set of solutions of a random instance after t constraints have been added. (Clearly, [math]\displaystyle{ S_n(t) }[/math] is a random variable, so everything below refers to the "typical" behavior of [math]\displaystyle{ S_n }[/math].)
  • Say that two value assignments are adjacent if they have Hamming distance 1. Under this notion of adjacency, we will look at the "connected components" of [math]\displaystyle{ S_n }[/math], also called "clusters". (Sometimes one replaces "Hamming distance 1" with "subextensive Hamming distance" with no essential difference. This does not seem to be a consequential issue.)

With the above in mind, below is a "voiceover" for the video [math]\displaystyle{ S_n(0), S_n(1), S_n(2),\ldots }[/math].

  • Clearly, at t=0, every value assignment is a solution, i.e., [math]\displaystyle{ S_n(0) }[/math] is everything, and there is only one cluster.
  • Now, fix [math]\displaystyle{ \delta \gt 0 }[/math] and let [math]\displaystyle{ X=X(\delta) }[/math] be the random variable equal to: the smallest number of clusters such that their union contains all but a delta-fraction of [math]\displaystyle{ S_n }[/math].

Clustering

  • It is predicted, for any finite [math]\displaystyle{ \delta }[/math], X undergoes a "phase transition" and goes from 1 to "exponentially many" during the addition of [math]\displaystyle{ o(n) }[/math] constraints around a critical, problem-specific constraints-to-variables ratio. The later phase is called clustered. Clustering is a very generic phenomenon for random CSPs.
  • Existence of the clustered phase has been rigorously established for random k-SAT, random graph k-coloring, random NAE k-SAT, and a number of other problems, in each case for all values of [math]\displaystyle{ k \gt k_0 }[/math] (where [math]\displaystyle{ k_0 }[/math] is finite and problem dependent) there is a constraint density t/n beyond which [math]\displaystyle{ S_n(0) }[/math] is clustered.
  • If one thinks in terms of the energy landscape, clustering means that we go from a world with a single big energy-zero valley, to a world with exponentially many smaller energy-zero valleys (and no big energy zero valley).

Of course, to have valleys you need mountains. In particular, another thing that has been rigorously established is that the valleys are "landlocked". Namely:

  • For some region of parameters every pair of valleys has Hamming distance [math]\displaystyle{ \Omega(n) }[/math], and every path connecting solutions in distinct valleys, must go through a point in which the energy function has value [math]\displaystyle{ \Omega(n) }[/math]. [By the Lipschitzness of the energy function, in fact, the path must go through [math]\displaystyle{ \Omega(n) }[/math] such points].

1RSB

  • 1RSB stands for "1-step Replica Symmetry Breaking". Consider a single variable flip Markov process satisfying the detailed balance condition with respect to uniform distribution over [math]\displaystyle{ S_n }[/math] (e.g. Glauber dynamics). Whereas in the "Replica Symmetric" (RS) phase this process is able to sample uniformly [math]\displaystyle{ S_n }[/math] in time [math]\displaystyle{ O(n) }[/math], the 1RSB phase is defined as phase where uniform sampling with this process requires time exponential in n.
  • The uniform measure over [math]\displaystyle{ S_n }[/math] ceases to be an extremal Gibbs measure at the 1RSB transition. In the 1RSB phase [math]\displaystyle{ S_n }[/math] is a union of exponentially many extremal Gibbs measures, sometimes called states in physics. The reconstruction problem of graphs is solvable only and only in the d1RSB phase. Equivalence of the three definitions of d1RSB given above in random K-SAT is predicted in statistical physics theories, and is so far proven only partially. Statistical physics theories predict sharp transition from the RS to the d1RSB phase and their locations.
  • d1RSB is a statistical physics term for a 1RSB phase that is not condensed. For the definition of condensation see below.

Note that clustering implies 1RSB but 1RSB does not imply clustering defined as above.

A word of caution: In statistical physics literature clustering is almost always defined via 1RSB, not as above.

Freezing

  • The next relevant phenomenon is the emergence of "frozen variables". Concretely, a variable is frozen in a cluster if it takes the same value in all assignments of that cluster. Note that the same variable can be frozen to different values in different clusters and can be frozen in some clusters but not in others. Freezing implies clustering, but clustering can (and does) occur without freezing.

Regarding frozen variables we rigorously know for instance the following:

  1. For [math]\displaystyle{ k\gt k_1 }[/math] there is a constraint density in random K-SAT beyond which a uniformly random solution, with high probability, has [math]\displaystyle{ \Omega(n) }[/math] frozen variables.
  2. As k is increased (which in k-SAT means the length of the clauses, while in coloring the number of colors), the constraint density beyond which emergence of frozen variables is proven agrees in the leading order in k with the constraint density beyond which clustering appears.
  3. (Perhaps most importantly) It was proven for significantly larger constraint densities that while [math]\displaystyle{ S_n }[/math] still consists of "exponentially many clusters of roughly equal size", with high probability, EVERY SINGLE cluster contains [math]\displaystyle{ \Omega(n) }[/math] frozen variables. (And indeed, the fraction of variables that are frozen tends to 1 with k.)

To put this a bit more starkly: For a number of random CSP problems, for a sizable fraction of the satisfiable regime of each problem, the space of solutions looks like an "error-correcting code with fuzz". The "fuzz" is the little cloud that forms by jiggling the non-frozen variables. Again, it's important to remember that the set of frozen variables differs from cluster to cluster and that, moreover, a variable can be frozen to different values in different clusters.

  • After [math]\displaystyle{ S_n }[/math] has reached the above "error-correcting code with fuzz" state, it is clear that every single constraint added from now on will "kill", in expectation, a constant fraction of all clusters (since each cluster has a constant fraction of frozen variables and, therefore, a constant probability of being killed).

Condensation

  • Consider two variables that are far away from each other (in terms of the shortest path between them), the correlation between these two variables is defined as [math]\displaystyle{ E(x_1 x_2) - E(x_1) E(x_2) }[/math], where the average is over the uniform measure over [math]\displaystyle{ S_n }[/math]. In the phase called "condensed" the correlation between a typical couple of variables does not go to zero as their distance goes to zero.
  • In the condensed phase only a handful (finitely-many-in-expectation) of Gibbs states, capture all but a vanishing fraction of [math]\displaystyle{ S_n }[/math], while exponentially many smaller Gibbs states still exist. In the regime where the rigorous results about clustering and freezing hold the term "Gibbs state" can be interchanged for "cluster" in this statement.
  • d1RSB is a statistical physics term for a 1RSB phase that is not condensed. The d1RSB stands for "dynamic one-step replica symmetry breaking".

Random 3-SAT versus large k-SAT

  • When k is large, it is rigorously known that the 1RSB, clustering and freezing (of most clusters) happens at the same (in the first order in k) constraint density. Whereas the condensation and the satisfiability threshold happen nearby each other and much beyond clustering. And no algorithm has been known to work in the clustered phase -- an open problem for more than 30 years.
  • The situation is crucially different when k is small. In 3-SAT (and also 3-COL) for instance statistical physics predicts that the 1RSB and condensation transition coincide. Hence there is no d1RSB phase. And the freezing happens relatively nearby to the satisfiability threshold. In 3-coloring the 1RSB threshold is when every variable has on average 4 neighbors. Whereas polynomial algorithms are known to work rigorously up to 4.03.

Locked constraint satisfaction problems

  • Fortunately there are NP-complete CSPs where the description of the space of solutions is much simpler than in random K-SAT. In the random locked CSP's 1RSB, clustering and freezing transitions coincide. And no condensed phase exists.
  • Consider for instance k-in-2k SAT, that is each clause contains 2k variables and is satisfied if and only if k of the variables are TRUE and k are FALSE (no negations present). And consider every variable being included in at least 3 clauses. Then the space of solutions consists of single configurations separated by extensive Hamming distance in the whole satisfiable region.

Where the hard instances are?

  • The empirical finding that hardest k-SAT instances are around the satisfiability threshold is well known but not precise. The example of 3-coloring testifies that neither the 1RSB not the condensation in the space of solutions pose general algorithmic barriers. On the other hand no polynomial algorithm is known to work (not even empirically) in any of the NP-complete problems in the phase where ALL clusters are frozen.

Random k-XOR-SAT

Finally, what about random k-XOR-SAT ?

  • First, a technical point. If one does indeed generate instances by selecting the variables in each constraint uniformly at random (without insisting that each variable has degree at least 2 as one does in LDPC codes), then there will be variables which appears in 0 or 1 constraints. Clearly, such variables can be "repeatedly removed" leaving a "core" structure. It is clear that if one finds a solution to the core, then this solution can be trivially extended to the rest. OK, what about solutions to this core?
  • Well, this core does indeed behave like -- you guessed it -- an error-correcting code, indeed one with absolutely no fuzz: each connected component has size 1. Intuitively, the reason is that if one is at a "codeword" and changes the value of a single variable V, since we are dealing with parity constraints, this will require flipping some other variable in each of the constraints that contain V, and since the underlying (hyper-/factor)graph is an expander, such "forced changes" will proliferate. Note that such a proliferation does not necessarily occur in random k-SAT or graph coloring, as it's possible to change a single variable and still have every constraint be satisfied. This "redundancy" in the solutions is precisely what gives rise to the aforementioned "fuzz" for those problems.
  • Finally, the heart of the matter, and what I believe is meant by Deolalikar when he speaks of "independent parameters". The cores/codewords in random k-XOR-SAT, are linearly related. In contrast, in random k-SAT we have no reason to think that they should be and one can, indeed, think of them as (more or less) randomly placed points inside the set of all value assignments, thus requiring "many" parameters to be specified. Alternatively, one can think of the random CSP instances as taking the randomness used to define the factor graph of the instance and "lifting it" to the high-dimensional space of solutions, by creating "randomly placed" pockets of solutions (clusters).