Biweekly links for 06/05/2009

Click here for all of my del.icio.us bookmarks.

Published

A number-theoretic approach to consistent hashing

In my last post I described consistent hashing, a way of distributing a dictionary (i.e., a key-value store) across a cluster of computers so that the distribution of keys changes only slowly as machines are added or removed from the cluster. In this post I’ll describe a different scheme for consistent hashing, a scheme that seems to have some advantages over the scheme described last time. In particular, the new scheme is based on some very well-understood properties of the prime numbers, a fact which makes some properties of the new scheme easier to analyse, and gives us a great deal of fine control over the properties of this approach to consistent hashing. At the same time, the new scheme also has some disadvantages, notably, it will be slower for very large clusters.

So far as I know, the scheme for consistent hashing described in this post is novel. But I’m new to the subject, my reading in the area is limited, and it’s entirely possible, maybe likely, that the scheme is old news, has deep flaws, or is clearly superseded by a better approach. So all I know for sure is that the approach is new to me, and seems at least interesting enough to write up in a blog post. Comments and pointers to more recent work gladly accepted!

The consistent hashing scheme I’ll describe is easy to implement, but before describing it I’m going to start with some slightly simpler approaches which don’t quite work. Strictly speaking, you don’t need to understand these, and could skip ahead to the final scheme. But I think you’ll get more insight into why the final scheme works if you work through these earlier ideas first.

Let’s start by recalling the naive method for distributing keys across a cluster that I described in my last post: the key [tex]k[/tex] is sent to machine number [tex]\mbox{hash}(k) \mod n[/tex], where [tex]\mbox{hash}(\cdot)[/tex] is some hash function, and [tex]n[/tex] is the number of machines in the cluster. This method certainly distributes data evenly across the cluster, but has the problem that when machines are added or removed from the cluster, huge amounts of data get redistributed amongst the machines in the cluster. Consistent hashing is an approach to distributing data which also distributes data evenly, but has the additional property that when a machine is added or removed from the cluster the distribution of data changes only slightly.

Let’s modify the naive hashing scheme to get a scheme which requires less redistribution of data. Imagine we have a dictionary distributed across two machines using naive hashing, i.e., we compute [tex]\mbox{hash}(k) \mod 2[/tex], and distribute it to machine [tex]0[/tex] if the result is [tex]0[/tex], and to machine [tex]1[/tex] if the result is [tex]1[/tex]. Now we add a third machine to the cluster. We’ll redistribute data by computing [tex]\mbox{hash}(k) \mod 3[/tex], and moving any data for which this value is [tex]0[/tex] to the new machine. It should be easy to convince yourself (and is true, as we’ll prove later!) that one third of the data on both machines [tex]0[/tex] and [tex]1[/tex] will get moved to machine [tex]3[/tex]. So we end up with the data distributed evenly across the three machines and the redistribution step moves far less data around than in the naive approach.

The success of this approach suggests a general scheme for distributing keys across an [tex]n[/tex]-machine cluster. The scheme is to allocate the key [tex]k[/tex] to machine [tex]j[/tex], where [tex]j[/tex] is the largest value in the range [tex]0,\ldots,n-1[/tex] for which [tex]\mbox{hash}(k) \mod (j+1) = 0[/tex]. This scheme is certainly easy to implement. Unfortunately, while it works just fine for clusters of size [tex]n = 2[/tex] and [tex]n = 3[/tex], as we’ve seen, it breaks down when we add a fourth machine. To see why, let’s imagine that we’re expanding from three to four machines. So we imagine the keys are distributed across three machines, and then compute [tex]\mbox{hash}(k) \mod 4[/tex]. If the result is [tex]0[/tex], we move that key to machine [tex]3[/tex]. The problem is that any key for which [tex]\mbox{hash}(k) \mod 2 = 0[/tex] must also have [tex]\mbox{hash}(k) \mod 4 = 0[/tex]. That means that every single key on machine [tex]1[/tex] will necessarily get moved to machine [tex]3[/tex]! So we end up with an uneven distribution of keys across the cluster, not to mention needing to move a large fraction of our data around when doing the redistribution.

With a little thought it should be clear that the underlying problem here is that [tex]4[/tex] and [tex]2[/tex] have a common factor. This suggests a somewhat impractical way of resolving the problem: imagine you only allow clusters whose size is a prime number. That is, you allow clusters of size [tex]2, 3, 5, 7, 11[/tex], and so on, but not any of the sizes inbetween. You could then apply a similar scheme to that described above, but restricted to values of [tex]n[/tex] which are prime. More explicitly, suppose [tex]p_1 < p_2 < p_3 < \ldots < p_n[/tex] is an ascending sequence of primes. Let [tex]p_j[/tex] be the largest prime in this series for which [tex]\mbox{hash}(k) \mod p_j \geq p_{j-1}[/tex]. Then key [tex]k[/tex] is stored on machine [tex]\mbox{hash}(k) \mod p_j[/tex]. Note that we use the convention [tex]p_0 = 0[/tex] to decide which keys to store on machines [tex]0,\ldots,p_1-1[/tex]. Another way of understanding how this modified scheme works is to imagine that we have a cluster of size [tex]p[/tex] (a prime), and then add some more machines to expand the cluster to size [tex]q[/tex] (another prime). We redistribute the keys by computing [tex]\mbox{hash}(k) \mod q[/tex]. If this is in the range [tex]p,\ldots,q-1[/tex], then we move the key to machine number [tex]\mbox{hash}(k) \mod q[/tex]. Otherwise, the key stays where it is. It should be plausible (and we'll argue this in more detail below) that this results in the keys being redistributed evenly across the machines that have been added to the cluster. Furthermore, each of the machines that starts in the cluster contributes equally to the redistribution. The main difference from our earlier approach is that instead of looking for [tex]\mbox{hash}(k) \mod q = 0[/tex], we consider a range of values other than [tex]0[/tex], from [tex]p,\ldots,q-1[/tex]. But that difference is only a superficial matter of labelling, the underlying principle is the same. The reason this scheme works is because the values of [tex]\mbox{hash}(k) \mod p_j[/tex] behave as independent random variables for different primes [tex]p_j[/tex]. To state that a little more formally, we have: Theorem: Suppose [tex]X[/tex] is an integer-valued random variable uniformly distributed on the range [tex]0,\ldots,N[/tex]. Suppose [tex]p_1,\ldots,p_m[/tex] are distinct primes, all much smaller than [tex]N[/tex], and define [tex]X_j \equiv X \mod p_j[/tex]. Then in the limit as [tex]N[/tex] approaches [tex]\infty[/tex], the [tex]X_j[/tex] become independent random variables, with [tex]X_j[/tex] uniformly distributed on the range [tex]0,\ldots,p_j-1[/tex].

The proof of this theorem is an immediate consequence of the Chinese remainder theorem, and I’ll omit it. The theorem guarantees that when we extend the cluster to size [tex]q[/tex], a fraction [tex]1/q[/tex] of the keys on each of the existing machines is moved to each machine being added to the cluster. This results is both an even distribution of machines across the cluster, and the minimal possible redistribution of keys, which is exactly what we desired.

Obviously this hashing scheme for prime-sized clusters is too restrictive to be practical. Although primes occur quite often (roughly one in every [tex]\ln(n)[/tex] numbers near [tex]n[/tex] is prime), it’s still quite a restriction. And it’s going to make life more complicated on large clusters, where we’d like to make the scheme tolerant to machines dropping off the cluster.

Fortunately, there’s an extension of prime-sized hashing which does provide a consistent hashing scheme with all the properties we desire. Here it is. We choose primes [tex]p_0,p_1, \ldots[/tex], all greater than some large number [tex]M[/tex], say [tex]M = 10^9[/tex]. What matters about [tex]M[/tex] is that it be much larger than the largest number of computers we might ever want in the cluster. For each [tex]j[/tex] choose an integer [tex]t_j[/tex] so that:

[tex] \frac{t_j}{p_j} \approx \frac{1}{j+1}. [/tex]

Note that it is convenient to set [tex]t_0 = p_0[/tex]. Our consistent hashing procedure for [tex]n[/tex] machines is to send key [tex]k[/tex] to machine [tex]j[/tex], where [tex]j[/tex] is the largest value such that (a) [tex]j < n[/tex], and (b) [tex]\mbox{hash}(k) \mod p_j[/tex] is in the range [tex]0[/tex] through [tex]t_j-1[/tex]. Put another way, if we add another machine to an [tex]n[/tex]-machine cluster, then for each key [tex]k[/tex] we compute [tex]\mbox{hash}(k) \mod p_n[/tex], and redistribute any keys for which this value is in the range [tex]0[/tex] through [tex]t_n-1[/tex]. By the theorem above this is a fraction roughly [tex]1/(n+1)[/tex] of the keys. As a result, the keys are distributed evenly across the cluster, and the redistributed keys are drawn evenly from each machine in the existing cluster. Performance-wise, for large clusters this scheme is slower than the consistent hashing scheme described in my last post. The slowdown comes because the current scheme requires the computation of [tex]\mbox{hash}(k) \mod p_j[/tex] for [tex]n[/tex] different primes. By contrast, the earlier consistent hashing scheme used only a search of a sorted list of (typically) at most [tex]n^2[/tex] elements, which can be done in logarithmic time. On the other hand, modular arithmetic can be done very quickly, so I don't expect this to be a serious bottleneck, except for very large clusters. Analytically, the scheme in the current post seems to me to likely be preferable - results like the Chinese remainder theorem give us a lot of control over the solution of congruences, and this makes it very easy to understand the behaviour of this scheme and many natural variants. For instance, if some machines are bigger than others, it's easy to balance the load in proportion to machine capacity by changing the threshold numbers [tex]t_j[/tex]. This type of balancing can also be achieved using our earlier approach to consistent hashing, by changing the number of replica points, but the effect on things like the evenness of the distribution and redistribution of keys requires more work to analyse in that case. I'll finish this post as I finished the earlier post, by noting that there are many natural followup questions: what's the best way to cope with servers of different sizes; how to add and remove more than one machine at a time; how to cope with replication and fault-tolerance; how to migrate data when jobs are going on (including backups); and how best to backup a distributed dictionary, anyway? Hopefully it's easy to at least get started on answering these questions at this point. About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog to follow future posts in the series. If you’re in Waterloo, and would like to attend fortnightly meetups of a group that discusses the posts and other topics related to distributed computing and data mining, drop me an email at mn@michaelnielsen.org.

Published
Categorized as GTS

Consistent hashing

Today I get back into my post series about the Google Technology Stack, with a more detailed look at distributed dictionaries, AKA distributed key-value stores, AKA distributed hash tables. What we’d like to do is store a dictionary of key-value pairs [tex](k_1,v_1),(k_2,v_2),\ldots[/tex] across a cluster of computers, preferably in a way that makes it easy to manipulate the dictionary without having to think about the details of the cluster.

The reason we’re interested in distributed dictionaries is because they’re used as input and output to the MapReduce framework for distributed computing. Of course, that’s not the only reason distributed dictionaries are interesting – they’re useful for many other purposes (e.g., distributed caching). But for the purposes of this post, we’ll imagine our distributed dictionaries are being used as the input and output from a MapReduce job.

I’ll describe two ways of implementing distributed dictionaries. The first is a naive method that’s simple and works pretty well. Then I’ll introduce an improved method called “consistent hashing”. Consistent hashing isn’t the final word in distributed dictionaries, but it’s a good start. All the usual caveats apply to the post: I’m still learning this stuff, and corrections, improvements, etcetera are welcome!

Let’s start with the naive method. Suppose we have a cluster of [tex]n[/tex] computers. We number the computers [tex]0,1,2,\ldots,n-1[/tex], and then store the key-value pair [tex](k,v)[/tex] on computer number [tex]\mbox{hash}(k) \mod n[/tex], where [tex]\mbox{hash}(\cdot)[/tex] is a hash function. If we’re using a good hash function, then [tex]\mbox{hash}(k) \mod n[/tex] is uniform across [tex]0,\ldots,n-1[/tex] for any reasonable distribution of keys. This ensures our distributed dictionary is spread evenly across the computers in our cluster, and doesn’t build up too much on any individual computer.

As a brief Pythonic aside, Python’s builtin hash function doesn’t have this property of spreading keys out uniformly. It sometimes takes similar key strings to similar hash values. Here’s a typical sample (results may vary on your machine):

hash("answer0") -> 1291065813204512909

hash("answer1") -> 1291065813204512908

hash("answer2") -> 1291065813204512911

What this means is that you can’t depend on Python’s builtin hash function to spread keys and values uniformly across the cluster. In principle, far better hash functions are available through the hashlib library, functions which do offer guarantees about uniformity. Those functions are quite a lot slower than hash – I did a little benchmark comparing hash to the hashlib.md5 hash, and found that hash was about ten times faster. We’ll use the hashlib library, although in practice I imagine you’d want something faster, but without the disadvantages of hash. End of Pythonic aside.

Naive hash-based distributed dictionaries are simple, but they have serious limitations. Imagine you’re using a cluster of computers to crawl the web. You store the results of your crawl in a distributed dictionary. But as the size of the crawl grows, you’ll want to add machines to your cluster. Suppose you add even just a single machine. Instead of computing [tex]\mbox{hash}((k) \mod n[/tex] we’re now computing [tex]\mbox{hash}(k) \mod (n+1)[/tex]. The result is that each key-value pair will get reallocated completely at random across the cluster. You’ll end up moving a fraction [tex]n/(n+1)[/tex] of your data to new machines – i.e., nearly all of it. This will be slow, and might potentially be expensive. It’s also potentially inconvenient if jobs are being carried out. Similar problems arise if you add a larger block of machines to the cluster, or if you lose some machines (e.g., if some of the machines fail). You get the idea.

Consistent hashing solves these problems. Like naive hashing, consistent hashing spreads the distributed dictionary evenly across the cluster. But unlike naive hashing, consistent hashing requires only a relatively small amount of data to be moved: if you add a machine to the cluster, only the data that needs to live on that machine is moved there; all the other data stays where it is.

To understand how consistent hashing works, imagine wrapping the unit interval [tex][0,1)[/tex] onto a circle:


consistent_hashing_1.PNG

Suppose we number the machines [tex]0,\ldots,n-1[/tex]. If the hash function has range [tex][0,R)[/tex] then we rescale the hash function via [tex]x \rightarrow \mbox{hash}(x)/R[/tex], so that the hash function maps into the range [tex][0,1)[/tex], i.e., effectively onto the circle. Then we can hash machine number [tex]j[/tex] to a point [tex]\mbox{hash}(j)[/tex] on the circle, for each machine in the range [tex]j = 0,1,\ldots,n-1[/tex]. Here’s what it might look like for an [tex]n=3[/tex] machine cluster:


consistent_hashing_2.PNG

The points will be randomly distributed around the circle. Now suppose we have a key-value pair we want to store in the distributed dictionary. We simply hash the key onto the circle, and then store the key-value pair on the first machine that appears clockwise of the key’s hash point. E.g., for the key shown here, the key-value pair is stored on machine number [tex]1[/tex]:


consistent_hashing_3.PNG

Because of the uniformity of the hash function, a fraction roughly [tex]1/n[/tex] of the key-value pairs will get stored on any single machine.

Now imagine we add an extra machine into the cluster. It goes to the point [tex]\mbox{hash}(n)[/tex]:


consistent_hashing_4.PNG

Most of the key-value pairs are completely unaffected by this change. But we can see that some of the key-value pairs that were formerly stored on machine [tex]1[/tex] (including our example key-value pair) will need to be moved to the new machine. But the fraction that needs to be moved will typically be [tex]1/(n+1)[/tex] of the total, a much smaller fraction than was the case for naive hashing.

The procedure I’ve described isn’t quite ideal, for a couple of reasons. First, the distribution of keys can be pretty irregular. If, for example, two of the machines – say [tex]j[/tex] and [tex]j'[/tex] – get mapped to very nearby points on the circle, then one of those machines may end up with very few of the keys, and some other machines with correspondingly more. It’s not that serious a problem, but why waste even a single machine? Second, when a machine is added to the cluster, all the keys redistributed to that machine come from just one other machine. Ideally, the keys would come in a more balanced way from several other machines.

Both these problems are easily resolved. Instead of mapping machine number [tex]j[/tex] to a single point on the circle, we’ll map it to multiple points (“replicas”). In particular, let’s imagine that for each machine we pick out [tex]r[/tex] points, by hashing [tex](j,0), (j,1),\ldots(j,r-1)[/tex] onto the circle. Otherwise, everything works exactly as before. By adding these replicas, we increase both the uniformity with which key-value pairs are mapped to machines, and also ensure that when a machine is added to the cluster, a smaller number of keys ([tex]\Theta(1/rn)[/tex]) are redistributed to that machine, from each of [tex]\Theta(r)[/tex] other machines in the cluster.

Here’s a simple Python program that implements consistent hashing. It’s largely self-explanatory; if you’d like to play around with it, I recommend reading the doc string for the ConsistentHash class.

'''consistent_hashing.py is a simple demonstration of consistent
hashing.'''

import bisect
import hashlib

class ConsistentHash:
  '''ConsistentHash(n,r) creates a consistent hash object for a 
  cluster of size n, using r replicas.

  It has three attributes. num_machines and num_replics are
  self-explanatory.  hash_tuples is a list of tuples (j,k,hash), 
  where j ranges over machine numbers (0...n-1), k ranges over 
  replicas (0...r-1), and hash is the corresponding hash value, 
  in the range [0,1).  The tuples are sorted by increasing hash 
  value.

  The class has a single instance method, get_machine(key), which
  returns the number of the machine to which key should be 
  mapped.'''

  def __init__(self,num_machines=1,num_replicas=1):
    self.num_machines = num_machines
    self.num_replicas = num_replicas
    hash_tuples = [(j,k,my_hash(str(j)+"_"+str(k))) \
                   for j in range(self.num_machines) \
                   for k in range(self.num_replicas)]
    # Sort the hash tuples based on just the hash values
    hash_tuples.sort(lambda x,y: cmp(x[2],y[2]))
    self.hash_tuples = hash_tuples

  def get_machine(self,key):
    '''Returns the number of the machine which key gets sent to.'''
    h = my_hash(key)
    # edge case where we cycle past hash value of 1 and back to 0.
    if h > self.hash_tuples[-1][2]: return self.hash_tuples[0][0]
    hash_values = map(lambda x: x[2],self.hash_tuples)
    index = bisect.bisect_left(hash_values,h)
    return self.hash_tuples[index][0]

def my_hash(key):
  '''my_hash(key) returns a hash in the range [0,1).'''
  return (int(hashlib.md5(key).hexdigest(),16) % 1000000)/1000000.0

def main():
  ch = ConsistentHash(7,3)
  print "Format:"
  print "(machine,replica,hash value):"
  for (j,k,h) in ch.hash_tuples: print "(%s,%s,%s)" % (j,k,h)
  while True:
    print "\nPlease enter a key:"
    key = raw_input()
    print "\nKey %s maps to hash %s, and so to machine %s" \
        % (key,my_hash(key),ch.get_machine(key))

if __name__ == "__main__": main()

Having understood the basics of consistent hashing, there’s many natural followup questions: what’s the best way to cope with servers of different sizes; how to add and remove more than one machine at a time; how to cope with replication and fault-tolerance; how to migrate data when jobs are going on (including backups); and how best to backup a distributed dictionary, anyway? Hopefully it’s easy to get started on answering these questions at this point.

Let me finish up with a little background history. Consistent hashing was introduced pretty recently, in 1997, in a pair of papers, one describing the theory, the other about implementation. Note that the original papers were focused on applications to caching on the world wide web, not to distributed computing applications like MapReduce. A good (and funny!) basic introduction to consistent hashing is here. It’s now widely used inside services like the popular memcached caching system, and Amazon’s Dynamo key-value store.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. The posts are based on fortnightly meetups of a small group in Waterloo – email me (mn@michaelnielsen.org) if you’re local and interested in attending. Subscribe to my blog to follow future posts in the series.

Published
Categorized as GTS

Subtle Technologies

If you’re interested in art and science, and live in or near Toronto, then you might enjoy the Subtle Technologies Festival. It’s being held June 10-14 at Innis Town Hall in Toronto, and there are still a few tickets available. In particular, you might want to attend the Symposium (June 12-14). I expect to attend for most of the Symposium, and will be giving a talk on mass collaboration. Here’s the blurb from the Festival organizers – there’s much more at the website:

For 5 days in June, an impressive line up of scientists, designers and new media artists will journey from around the world to gather and share innovations at the 12th annual Subtle Technologies Festival.

This year’s festival is inspired by Networks of all kinds: neural, biological, genetic, virtual, social and dynamic. What will our ubiquitously networked, computer-driven future look like? How can we tell compelling stories through social networks? Take a closer look at the internal networks that drive our every action.

True to this year’s theme, the Subtle Technologies Festival will simultaneously take place at The University of Toronto’s Innis Town Hall, and in the seductive and animated virtual network, “Second Life”, where people from around the globe will engage with the festival via a live feed.

Join artists, designers and scientists who will share their work through 5 days and evenings of workshops, symposiums, exhibitions and performances. Learn how artists make art from bacteria, tell stories through social networks, use open source software to distribute multimedia, attend network-based performances from around the globe, and attend a networking evening of music and mingling.

[…]

Published

Biweekly links for 06/01/2009

  • Condensed concepts
    • Ross McKenzie’s excellent blog on condensed matter physics.
  • Square root of x divided by zero: The speed, size and dependability of programming languages
    • Terrrific visualizations comparing the performance and expressivity of different programming languages. Python (with Psyco) does extremely well: extremely expressive, and decent performance. OCAML is faster, and nearly as expressive.
  • Reed Elsevier Annual Report 2008
    • Revenue for their science-technology-medicine publishing division, Elsevier, was 1700 mill (up 4% at constant currencies), with a profit of 568 mil (up 11% at constant currencies). Elsevier makes up 32% of Reed-Elsevier’s revenue, and 41% of their profit. Amounts appear to be in pounds sterling(?)
  • Kalenjin people – Wikipedia, the free encyclopedia
    • “The Kalenjin have been called by some “the running tribe.”… From 1980 on, about 40% of the top honors available to men in international athletics at these distances (Olympic medals, World Championships medals, and World Cross Country Championships honors) have been earned by Kalenjin.”
  • Google Wave API – Google Code

Click here for all of my del.icio.us bookmarks.

Published