In my last post I described consistent hashing, a way of distributing a dictionary (i.e., a key-value store) across a cluster of computers so that the distribution of keys changes only slowly as machines are added or removed from the cluster. In this post I’ll describe a different scheme for consistent hashing, a scheme that seems to have some advantages over the scheme described last time. In particular, the new scheme is based on some very well-understood properties of the prime numbers, a fact which makes some properties of the new scheme easier to analyse, and gives us a great deal of fine control over the properties of this approach to consistent hashing. At the same time, the new scheme also has some disadvantages, notably, it will be slower for very large clusters.

So far as I know, the scheme for consistent hashing described in this post is novel. But I’m new to the subject, my reading in the area is limited, and it’s entirely possible, maybe likely, that the scheme is old news, has deep flaws, or is clearly superseded by a better approach. So all I know for sure is that the approach is new to me, and seems at least interesting enough to write up in a blog post. Comments and pointers to more recent work gladly accepted!

The consistent hashing scheme I’ll describe is easy to implement, but before describing it I’m going to start with some slightly simpler approaches which don’t quite work. Strictly speaking, you don’t need to understand these, and could skip ahead to the final scheme. But I think you’ll get more insight into why the final scheme works if you work through these earlier ideas first.

Let’s start by recalling the naive method for distributing keys across a cluster that I described in my last post: the key $$k$$ is sent to machine number $$\mbox{hash}(k) \mod n$$, where $$\mbox{hash}(\cdot)$$ is some hash function, and $$n$$ is the number of machines in the cluster. This method certainly distributes data evenly across the cluster, but has the problem that when machines are added or removed from the cluster, huge amounts of data get redistributed amongst the machines in the cluster. Consistent hashing is an approach to distributing data which also distributes data evenly, but has the additional property that when a machine is added or removed from the cluster the distribution of data changes only slightly.

Let’s modify the naive hashing scheme to get a scheme which requires less redistribution of data. Imagine we have a dictionary distributed across two machines using naive hashing, i.e., we compute $$\mbox{hash}(k) \mod 2$$, and distribute it to machine $$0$$ if the result is $$0$$, and to machine $$1$$ if the result is $$1$$. Now we add a third machine to the cluster. We’ll redistribute data by computing $$\mbox{hash}(k) \mod 3$$, and moving any data for which this value is $$0$$ to the new machine. It should be easy to convince yourself (and is true, as we’ll prove later!) that one third of the data on both machines $$0$$ and $$1$$ will get moved to machine $$3$$. So we end up with the data distributed evenly across the three machines and the redistribution step moves far less data around than in the naive approach.

The success of this approach suggests a general scheme for distributing keys across an $$n$$-machine cluster. The scheme is to allocate the key $$k$$ to machine $$j$$, where $$j$$ is the largest value in the range $$0,\ldots,n-1$$ for which $$\mbox{hash}(k) \mod (j+1) = 0$$. This scheme is certainly easy to implement. Unfortunately, while it works just fine for clusters of size $$n = 2$$ and $$n = 3$$, as we’ve seen, it breaks down when we add a fourth machine. To see why, let’s imagine that we’re expanding from three to four machines. So we imagine the keys are distributed across three machines, and then compute $$\mbox{hash}(k) \mod 4$$. If the result is $$0$$, we move that key to machine $$3$$. The problem is that any key for which $$\mbox{hash}(k) \mod 2 = 0$$ must also have $$\mbox{hash}(k) \mod 4 = 0$$. That means that every single key on machine $$1$$ will necessarily get moved to machine $$3$$! So we end up with an uneven distribution of keys across the cluster, not to mention needing to move a large fraction of our data around when doing the redistribution.

With a little thought it should be clear that the underlying problem here is that $$4$$ and $$2$$ have a common factor. This suggests a somewhat impractical way of resolving the problem: imagine you only allow clusters whose size is a prime number. That is, you allow clusters of size $$2, 3, 5, 7, 11$$, and so on, but not any of the sizes inbetween. You could then apply a similar scheme to that described above, but restricted to values of $$n$$ which are prime. More explicitly, suppose $$p_1 < p_2 < p_3 < \ldots < p_n$$ is an ascending sequence of primes. Let $$p_j$$ be the largest prime in this series for which $$\mbox{hash}(k) \mod p_j \geq p_{j-1}$$. Then key $$k$$ is stored on machine $$\mbox{hash}(k) \mod p_j$$. Note that we use the convention $$p_0 = 0$$ to decide which keys to store on machines $$0,\ldots,p_1-1$$. Another way of understanding how this modified scheme works is to imagine that we have a cluster of size $$p$$ (a prime), and then add some more machines to expand the cluster to size $$q$$ (another prime). We redistribute the keys by computing $$\mbox{hash}(k) \mod q$$. If this is in the range $$p,\ldots,q-1$$, then we move the key to machine number $$\mbox{hash}(k) \mod q$$. Otherwise, the key stays where it is. It should be plausible (and we'll argue this in more detail below) that this results in the keys being redistributed evenly across the machines that have been added to the cluster. Furthermore, each of the machines that starts in the cluster contributes equally to the redistribution. The main difference from our earlier approach is that instead of looking for $$\mbox{hash}(k) \mod q = 0$$, we consider a range of values other than $$0$$, from $$p,\ldots,q-1$$. But that difference is only a superficial matter of labelling, the underlying principle is the same. The reason this scheme works is because the values of $$\mbox{hash}(k) \mod p_j$$ behave as independent random variables for different primes $$p_j$$. To state that a little more formally, we have: Theorem: Suppose $$X$$ is an integer-valued random variable uniformly distributed on the range $$0,\ldots,N$$. Suppose $$p_1,\ldots,p_m$$ are distinct primes, all much smaller than $$N$$, and define $$X_j \equiv X \mod p_j$$. Then in the limit as $$N$$ approaches $$\infty$$, the $$X_j$$ become independent random variables, with $$X_j$$ uniformly distributed on the range $$0,\ldots,p_j-1$$.

The proof of this theorem is an immediate consequence of the Chinese remainder theorem, and I’ll omit it. The theorem guarantees that when we extend the cluster to size $$q$$, a fraction $$1/q$$ of the keys on each of the existing machines is moved to each machine being added to the cluster. This results is both an even distribution of machines across the cluster, and the minimal possible redistribution of keys, which is exactly what we desired.

Obviously this hashing scheme for prime-sized clusters is too restrictive to be practical. Although primes occur quite often (roughly one in every $$\ln(n)$$ numbers near $$n$$ is prime), it’s still quite a restriction. And it’s going to make life more complicated on large clusters, where we’d like to make the scheme tolerant to machines dropping off the cluster.

Fortunately, there’s an extension of prime-sized hashing which does provide a consistent hashing scheme with all the properties we desire. Here it is. We choose primes $$p_0,p_1, \ldots$$, all greater than some large number $$M$$, say $$M = 10^9$$. What matters about $$M$$ is that it be much larger than the largest number of computers we might ever want in the cluster. For each $$j$$ choose an integer $$t_j$$ so that:

$$\frac{t_j}{p_j} \approx \frac{1}{j+1}.$$

Note that it is convenient to set $$t_0 = p_0$$. Our consistent hashing procedure for $$n$$ machines is to send key $$k$$ to machine $$j$$, where $$j$$ is the largest value such that (a) $$j < n$$, and (b) $$\mbox{hash}(k) \mod p_j$$ is in the range $$0$$ through $$t_j-1$$. Put another way, if we add another machine to an $$n$$-machine cluster, then for each key $$k$$ we compute $$\mbox{hash}(k) \mod p_n$$, and redistribute any keys for which this value is in the range $$0$$ through $$t_n-1$$. By the theorem above this is a fraction roughly $$1/(n+1)$$ of the keys. As a result, the keys are distributed evenly across the cluster, and the redistributed keys are drawn evenly from each machine in the existing cluster. Performance-wise, for large clusters this scheme is slower than the consistent hashing scheme described in my last post. The slowdown comes because the current scheme requires the computation of $$\mbox{hash}(k) \mod p_j$$ for $$n$$ different primes. By contrast, the earlier consistent hashing scheme used only a search of a sorted list of (typically) at most $$n^2$$ elements, which can be done in logarithmic time. On the other hand, modular arithmetic can be done very quickly, so I don't expect this to be a serious bottleneck, except for very large clusters. Analytically, the scheme in the current post seems to me to likely be preferable - results like the Chinese remainder theorem give us a lot of control over the solution of congruences, and this makes it very easy to understand the behaviour of this scheme and many natural variants. For instance, if some machines are bigger than others, it's easy to balance the load in proportion to machine capacity by changing the threshold numbers $$t_j$$. This type of balancing can also be achieved using our earlier approach to consistent hashing, by changing the number of replica points, but the effect on things like the evenness of the distribution and redistribution of keys requires more work to analyse in that case. I'll finish this post as I finished the earlier post, by noting that there are many natural followup questions: what's the best way to cope with servers of different sizes; how to add and remove more than one machine at a time; how to cope with replication and fault-tolerance; how to migrate data when jobs are going on (including backups); and how best to backup a distributed dictionary, anyway? Hopefully it's easy to at least get started on answering these questions at this point. About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog to follow future posts in the series. If you’re in Waterloo, and would like to attend fortnightly meetups of a group that discusses the posts and other topics related to distributed computing and data mining, drop me an email at mn@michaelnielsen.org.

From → GTS

One Comment