Write your first MapReduce program in 20 minutes

The slow revolution

Some revolutions are marked by a single, spectacular event: the storming of the Bastille during the French Revolution, or the destruction of the towers of the World Trade Center on September 11, 2001, which so changed the US’s relationship with the rest of the world. But often the most important revolutions aren’t announced with the blare of trumpets. They occur softly, too slow for a single news cycle, but fast enough that if you aren’t alert, the revolution is over before you’re aware it’s happening.

Such a revolution is happening right now in computing. Microprocessor clock speeds have stagnated since about 2000. Major chipmakers such as Intel and AMD continue to wring out improvements in speed by improving on-chip caching and other clever techniques, but they are gradually hitting the point of diminishing returns. Instead, as transistors continue to shrink in size, the chipmakers are packing multiple processing units onto a single chip. Most computers shipped today use multi-core microprocessors, i.e., chips with 2 (or 4, or 8, or more) separate processing units on the main microprocessor.

The result is a revolution in software development. We’re gradually moving from the old world in which multiple processor computing was a special case, used only for boutique applications, to a world in which it is widespread. As this movement happens, software development, so long tailored to single-processor models, is seeing a major shift in some its basic paradigms, to make the use of multiple processors natural and simple for programmers.

This movement to multiple processors began decades ago. Projects such as the Connection Machine demonstrated the potential of massively parallel computing in the 1980s. In the 1990s, scientists became large-scale users of parallel computing, using parallel computing to simulate things like nuclear explosions and the dynamics of the Universe. Those scientific applications were a bit like the early scientific computing of the late 1940s and 1950s: specialized, boutique applications, built with heroic effort using relatively primitive tools. As computing with multiple processors becomes widespread, though, we’re seeing a flowering of general-purpose software development tools tailored to multiple processor environments.

One of the organizations driving this shift is Google. Google is one of the largest users of multiple processor computing in the world, with its entire computing cluster containing hundreds of thousands of commodity machines, located in data centers around the world, and linked using commodity networking components. This approach to multiple processor computing is known as distributed computing; the characteristic feature of distributed computing is that the processors in the cluster don’t necessarily share any memory or disk space, and so information sharing must be mediated by the relatively slow network.

In this post, I’ll describe a framework for distributed computing called MapReduce. MapReduce was introduced in a paper written in 2004 by Jeffrey Dean and Sanjay Ghemawat from Google. What’s beautiful about MapReduce is that it makes parallelization almost entirely invisible to the programmer who is using MapReduce to develop applications. If, for example, you allocate a large number of machines in the cluster to a given MapReduce job, the job runs in a highly parallelized way. If, on the other hand, you allocate only a small number of machines, it will run in a much more serial way, and it’s even possible to run jobs on just a single machine.

What exactly is MapReduce? From the programmer’s point of view, it’s just a library that’s imported at the start of your program, like any other library. It provides a single library call that you can make, passing in a description of some input data and two ordinary serial functions (the “mapper” and “reducer”) that you, the programmer, specify in your favorite programming language. MapReduce then takes over, ensuring that the input data is distributed through the cluster, and computing those two functions across the entire cluster of machines, in a way we’ll make precise shortly. All the details – parallelization, distribution of data, tolerance of machine failures – are hidden away from the programmer, inside the library.

What we’re going to do in this post is learn how to use the MapReduce library. To do this, we don’t need a big sophisticated version of the MapReduce library. Instead, we can get away with a toy implementation (just a few lines of Python!) that runs on a single machine. By using this single-machine toy library we can learn how to develop for MapReduce. The programs we develop will run essentially unchanged when, in later posts, we improve the MapReduce library so that it can run on a cluster of machines.

Our first MapReduce program

Okay, so how do we use MapReduce? I’ll describe it with a simple example, which is a program to count the number of occurrences of different words in a set of files. The example is simple, but you’ll find it rather strange if you’re not already familiar with MapReduce: the program we’ll describe is certainly not the way most programmers would solve the word-counting problem! What it is, however, is an excellent illustration of the basic ideas of MapReduce. Furthermore, what we’ll eventually see is that by using this approach we can easily scale our wordcount program up to run on millions or even billions of documents, spread out over a large cluster of computers, and that’s not something a conventional approach could do easily.

The input to a MapReduce job is just a set of (input_key,input_value) pairs, which we’ll implement as a Python dictionary. In the wordcount example, the input keys will be the filenames of the files we’re interested in counting words in, and the corresponding input values will be the contents of those files:

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

After this code is run the Python dictionary i will contain the input to our MapReduce job, namely, i has three keys containing the filenames, and three corresponding values containing the contents of those files. Note that I’ve used Windows’ filenaming conventions above; if you’re running a Mac or Linux you may need to tinker with the filenames. Also, to run the code you will of course need text files text\1.txt, text\2.txt, text\3.txt. You can create some simple examples texts by cutting and pasting from the following:

text\a.txt:

The quick brown fox jumped over the lazy grey dogs.

text\b.txt:

That's one small step for a man, one giant leap for mankind.

text\c.txt:

Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.

The MapReduce job will process this input dictionary in two phases: the map phase, which produces output which (after a little intermediate processing) is then processed by the reduce phase. In the map phase what happens is that for each (input_key,input_value) pair in the input dictionary i, a function mapper(input_key,input_value) is computed, whose output is a list of intermediate keys and values. This function mapper is supplied by the programmer – we’ll show how it works for wordcount below. The output of the map phase is just the list formed by concatenating the list of intermediate keys and values for all of the different input keys and values.

I said above that the function mapper is supplied by the programmer. In the wordcount example, what mapper does is takes the input key and input value – a filename, and a string containing the contents of the file – and then moves through the words in the file. For each word it encounters, it returns the intermediate key and value (word,1), indicating that it found one occurrence of word. So, for example, for the input key text\a.txt, a call to mapper("text\\a.txt",i["text\\a.txt"]) returns:

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), 
 ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

Notice that everything has been lowercased, so we don’t count words with different cases as distinct. Furthermore, the same key gets repeated multiple times, because words like the appear more than once in the text. This, incidentally, is the reason we use a Python list for the output, and not a Python dictionary, for in a dictionary the same key can only be used once.

Here’s the Python code for the mapper function, together with a helper function used to remove punctuation:

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

mapper works by lowercasing the input file, removing the punctuation, splitting the resulting string around whitespace, and finally emitting the pair (word,1) for each resulting word. Note, incidentally, that I’m ignoring apostrophes, to keep the code simple, but you can easily extend the code to deal with apostrophes and other special cases.

With this specification of mapper, the output of the map phase for wordcount is simply the result of combining the lists for mapper("text\\a.txt"), mapper("text\\b.txt"), and mapper("text\\c.txt"):

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), 
 ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), 
 ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), 
 ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), 
 ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), 
 ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), 
 ('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1), 
 ('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1), 
 ('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1), 
 ('mankind', 1)]

The map phase of MapReduce is logically trivial, but when the input dictionary has, say 10 billion keys, and those keys point to files held on thousands of different machines, implementing the map phase is actually quite non-trivial. What the MapReduce library handles is details like knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form. We won’t worry about these issues for now, but we will come back to them in future posts.

What the MapReduce library now does in preparation for the reduce phase is to group together all the intermediate values which have the same key. In our example the result of doing this is the following intermediate dictionary:

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 
 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 
 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 
 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 
 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 
 'that': [1], 'little': [1], 'small': [1], 'step': [1], 
 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 
 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 
 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

We see, for example, that the word ‘and’, which appears only once in the three files, has as its associated value a list containing just a single 1, [1]. By contrast, the word ‘one’, which appears twice, has [1,1] as its value.

The reduce phase now commences. A programmer-defined function reducer(intermediate_key,intermediate_value_list) is applied to each entry in the intermediate dictionary. For wordcount, reducer simply sums up the list of intermediate values, and return both the intermediate_key and the sum as the output. This is done by the following code:

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

The output from the reduce phase, and from the total MapReduce computation, is thus:

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), 
 ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), 
 ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), 
 ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), 
 ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), 
 ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), 
 ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), 
 ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

You can easily check that this is just a list of the words in the three files we started with, and the associated wordcounts, as desired.

We’ve looked at code defining the input dictionary i, the mapper and reducer functions. Collecting it all up, and adding a call to the MapReduce library, here’s the complete wordcount.py program:

#word_count.py

import string
import map_reduce

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

print map_reduce.map_reduce(i,mapper,reducer)

The map_reduce module imported by this program implements MapReduce in pretty much the simplest possible way, using some useful functions from the itertools library:

# map_reduce.py
"'Defines a single function, map_reduce, which takes an input
dictionary i and applies the user-defined function mapper to each
(input_key,input_value) pair, producing a list of intermediate 
keys and intermediate values.  Repeated intermediate keys then 
have their values grouped into a list, and the user-defined 
function reducer is applied to the intermediate key and list of 
intermediate values.  The results are returned as a list."'

import itertools

def map_reduce(i,mapper,reducer):
  intermediate = []
  for (key,value) in i.items():
    intermediate.extend(mapper(key,value))
  groups = {}
  for key, group in itertools.groupby(sorted(intermediate), 
                                      lambda x: x[0]):
    groups[key] = list([y for x, y in group])
  return [reducer(intermediate_key,groups[intermediate_key])
          for intermediate_key in groups]

(Credit to a nice blog post from Dave Spencer for the use of itertools.groupby to simplify the reduce phase.)

Obviously, on a single machine an implementation of the MapReduce library is pretty trivial! In later posts we’ll extend this library so that it can distribute the execution of the mapper and reducer functions across multiple machines on a network. The payoff is that with enough improvement to the library we can with essentially no change use our wordcount.py program to count the words not just in 3 files, but rather the words in billions of files, spread over thousands of computers in a cluster. What the MapReduce library does, then, is provide an approach to developing in a distributed environment where many simple tasks (like wordcount) remain simple for the programmer. Important (but boring) tasks like parallelization, getting the right data into the right places, dealing with the failure of computers and networking components, and even coping with racks of computers being taken offline for maintenance, are all taken care of under the hood of the library.

In the posts that follow, we’re thus going to do two things. First, we’re going to learn how to develop MapReduce applications. That means taking familiar tasks – things like computing PageRank – and figuring out how they can be done within the MapReduce framework. We’ll do that in the next post in this series. In later posts, we’ll also take a look at Hadoop, an open source platform that can be used to develop MapReduce applications.

Second, we’ll go under the hood of MapReduce, and look at how it works. We’ll scale up our toy implementation so that it can be used over small clusters of computers. This is not only fun in its own right, it will also make us better MapReduce programmers, in the same way as understanding the innards of an operating system (for example) can make you a better application programmer.

To finish off this post, though, we’ll do just two things. First, we’ll sum up what MapReduce does, stripping out the wordcount-specific material. It’s not any kind of a formal specification, just a brief informal summary, together with a few remarks. We’ll refine this summary a little in some future posts, but this is the basic MapReduce model. Second, we’ll give a overview of how MapReduce takes advantage of a distributed environment to parallelize jobs.

MapReduce in general

Summing up our earlier description of MapReduce, and with the details about wordcount removed, the input to a MapReduce job is a set of (input_key,input_value) pairs. Each pair is used as input to a function mapper(input_key,input_value) which produces as output a list of intermediate keys and intermediate values:

[(intermediate_key,intermediate_value),
 (intermediate_key',intermediate_value'),
 ...]

The output from all the different input pairs is then sorted, so that values associated with the same intermediate_key are grouped together in a list of intermediate values. The reducer(intermediate_key,intermediate_value_list) function is then applied to each intermediate key and list of intermediate values, to produce the output from the MapReduce job.

A natural question is whether the order of values in intermediate_value_list matters. I must admit I’m not sure of the answer to this question – if it’s discussed in the original paper, then I missed it. In most of the examples I’m familiar with, the order doesn’t matter, because the reducer works by applying a commutative, associate operation across all intermediate values in the list. As we’ll see in a minute, because the mapper computations are potentially done in parallel, on machines which may be of varying speed, it’d be hard to guarantee the ordering, and this suggests that the ordering doesn’t matter. It’d be nice to know for sure – if anyone reading this does know the answer, I’d appreciate hearing it, and will update the post!

One of the most striking things about MapReduce is how restrictive it is. A priori it’s by no means clear that MapReduce should be all that useful in practical applications. It turns out, though, that many interesting computations can be expressed either directly in MapReduce, or as a sequence of a few MapReduce computations. We’ve seen wordcount implemented in this post, and we’ll see how to compute PageRank using MapReduce in the next post. Many other important computations can also be implemented using MapReduce, including doing things like finding shortest paths in a graph, grepping a large document collection, or many data mining algorithms. For many such problems, though, the standard approach doesn’t obviously translate into MapReduce. Instead, you need to think through the problem again from scratch, and find a way of doing it using MapReduce.

Exercises

How would you implement grep in MapReduce?

Problems

Take a well-known algorithms book, e.g., Cormen-Leiserson-Rivest-Stein, or a list of well-known algorithms. Which algorithms lend themselves to being expressed in MapReduce? Which do not? This isn’t so much a problem as it is a suggestion for a research program.
The early years of serial computing saw the introduction of powerful general-purpose ideas like those that went into the Lisp and Smalltalk programming languages. Arguably, those ideas are still the most powerful in today’s modern programming languages. MapReduce isn’t a programming language as we conventionally think of them, of course, but it’s similar in that it introduces a way of thinking about and approaching programming problems. I don’t think MapReduce has quite the same power as the ideas that went into Lisp and Smalltalk; it’s more like the Fortran of distributed computing. Can we find similarly powerful ideas to Lisp or Smalltalk in distributed computing? What’s the hundred-year framework going to look like for distributed computing?

How MapReduce takes advantage of the distributed setting

You can probably already see how MapReduce takes advantage of a large cluster of computers, but let’s spell out some of the details. There are two key points. First, the mapper functions can be run in parallel, on different processors, because they don’t share any data. Provided the right data is in the local memory of the right processor – a task MapReduce manages for you – the different computations can be done in parallel. The more machines are in the cluster, the more mapper computations can be simultaneously executed. Second, the reducer functions can also be run in parallel, for the same reason, and, again, the more machines are available, the more computations can be done in parallel.

The difficulty in this story arises in the grouping step that takes place between the map phase and the reduce phase. For the reducer functions to work in parallel, we need to ensure that all the intermediate values corresponding to the same key get sent to the same machine. Obviously, this requires communcation between machines, over the relatively slow network connections.

This looks tough to arrange, and you might think (as I did at first) that a lot of communication would be required to get the right data to the right machine. Fortunately, a simple and beautiful idea is used to make sure that the right data ends up in the right location, without there being too much communication overhead.

Imagine you’ve got 1000 machines that you’re going to use to run reducers on. As the mappers compute the intermediate keys and value lists, they compute hash(intermediate_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding reducer will be run on, and the resulting intermediate key and value list is then sent to that machine. Because every machine running mappers uses the same hash function, this ensures that value lists corresponding to the same intermediate key all end up at the same machine. Furthermore, by using a hash we ensure that the intermediate keys end up pretty evenly spread over machines in the cluster. Now, I should warn you that in practice this hashing method isn’t literally quite what’s done (we’ll describe that in later lectures), but that’s the main idea.

Needless to say, there’s a lot more to the story of how MapReduce works in practice, especially the way it handles data distribution and fault-tolerance. In future posts we’ll explore many more of the details. Nonetheless, hopefully this post gives you a basic understanding of how MapReduce works. In the next post we’ll apply MapReduce to the computation of PageRank.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog to follow future posts in the series.

33 comments

Richard Jones says:

January 2, 2009 at 3:44 pm

Good article.

One minor point about MapReduce – it wasn’t “introduced” by any 2004 paper. I learned about it on my university course at Imperial College in around 1994/5 (and it was called MapReduce at that time), and I’m pretty certain that if they were teaching it to undergrads in 1994 it wasn’t a new concept even then.
Michael Nielsen says:

January 2, 2009 at 3:50 pm

Richard – Thanks.

On your second point, are Jeff Dean and Sanjay Ghemawat wrong to claim originality in their paper? They pretty explicitly claim this is new in the introduction to their 2004 paper:

“As a reaction to this complexity [of parallelization, data distribution etc], we designed a new abstraction [MapReduce] that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.”

I’ve seen many other sources, including people at Microsoft Research and in academia, also credit MapReduce to this paper.
Richard Jones says:

January 2, 2009 at 3:57 pm

Here’s a couple of references dating the idea of using skeletons (such as mapreduce and others) in programming languages back to the early 90s:

http://spiral.imperial.ac.uk/handle/10044/1/694
http://www.di.unipi.it/~susanna/p3lintro.html
Michael Nielsen says:

January 2, 2009 at 4:01 pm

After a brief browse, those look very interesting. I’ll take a closer look. Thanks, Richard.
Richard Jones says:

January 2, 2009 at 4:01 pm

I should add that Google should certainly take the credit for making it work on their hundred-thousand node cluster, probably adding fault tolerance and the rest. That’s certainly no easy accomplishment in and of itself …
Greg Wilson says:

January 2, 2009 at 6:24 pm

The algorithmically adventurous will want to check out a third member of the trio: the “scan” operation (also called “cumulative sum”). The best explanation is still Blelloch’s “Vector Models for Data-Parallel Computing” (MIT Press, 1990). There is also Yang et al’s Map-Reduce-Merge work [1], which adds a join-like primitive to the suite. Interesting times…

[1] http://portal.acm.org/citation.cfm?doid=1247480.1247602
botchagalupe says:

January 2, 2009 at 6:26 pm

Great post. I love it when really smart guys like you make it easy to learn really hard things.

Thanks
johnmwillis.com
John Sidles says:

January 4, 2009 at 7:44 am

Michael’s Problem Assignment: Which algorithms lend themselves to being expressed in MapReduce? Which do not? This isn’t so much a problem as it is a suggestion for a research program.

Longest … assignment … ever! And, it’s why I’m tracking this series with such serious interest.

With regard to PageRank, it’s pretty clear that Michael’s lectures are going to teach us how to use MapReduce in at least the following two ways: (1) to learn the link structure matrix M and (2) to iteratively solve for the PageRank vector q. The former is the subject of the present lecture, the latter no doubt will be discussed pretty soon!

My own interest consists of mapping PageRank and MapReduce methods onto large-scale quantum simulations. Under this mapping, the link structure matrix M becomes a metric tensor on a tensor-product manifold (also called Kronecker product manifolds, product-sum manifolds, multilinear manifolds, Hartree state manifolds … the list of equivalent names is exceedingly long), and the PageRank vector q becomes the tangent vector to the local (stochastic) quantum trajectory.

The specific calculation is the following: we are given a (stochastic) one-form dv by the open-system quantum dynamics, and we seek to compute (and integrate to create a quantum trajectory) the tangent vector q by repeatedly solving Mq=dv.

We are motivated to examine PageRank and MapReduce for at least three reasons: (1) the dimension of q is typically of order 10^5-10^7, which is too large for direct techniques, hence we need a PageRank style iterative technique, (2) a very natural computational architecture for thousand-spin simulations is “one spin, one processor” for which MapReduce works well, and (3) there are numerous natural isomorphisms that include “spin” == “web page” and “link” == “Hamiltonian coupling”.

Of course, there are some significant differences too. For example, in quantum simulation we are interested in integrating dynamical evolution equations, whereas in PageRank/MapReduce we are interested in a static snapshot … but is that all we are *really* interested in about PageRank/MapReduce? Isn’t the web really a dynamical, evolving system? Shouldn’t PageRank (perhaps) be predictive of web evolution?

As Michael says, these lectures can be regarded as an extended suggestion for a research program. And in the words of young Anakin Skywalker … “It’s working!”
Pingback: sudo rm -r / » cool python tricks
Pingback: Cloud Droplets #36 Is Jack Welch the Smartest in the room if he is doing Quantum Computation on an AWS Mainframe? | IT Management and Cloud Blog
Scott says:

January 6, 2009 at 2:54 pm

I will be following these posts with much anticipation as well. Thanks for the easy to understand lesson.
Glenn G. Chappell says:

January 9, 2009 at 7:29 pm

Very nice article. A few comments:

(1) Regarding the comments by Richard Jones on the originality of MapReduce: It is true that the map & reduce (a.k.a. “fold”) pair of operations was around for a while before the 2004 paper. However, Google’s MapReduce is *not* the same as the old map + reduce. In particular, Google’s version deals only with key-value pairs, and their “reduce” step is only applied to the values corresponding to a single key. Thus it can easily be parallelized, while the old “reduce” often cannot.

Certainly, one can argue about how earthshaking the introduction of Google’s MapReduce was. But it was original (at least in part).

(2) Michael, one important point that I don’t think you made is that, as pointed out in the Dean-Ghemawat paper, the intermediate_value_list passed to the reducer may be *very* large. Therefore, the reducer should expect to access it via an iterator, rather than receiving an in-memory list.

Having had that thought, I noticed that, in your code, the only thing your reducer does with intermediate_value_list is to apply the Python “sum” built-in to it. And â€¦ “sum” works for any iterable. Perhaps you’re being sneaky, and next you’ll show us how you can alter map_reduce.py to pass the intermediate values using an iterator, without changing word_count.py at all?

(3) Lastly, as you point out, your filename list is formatted in a system-dependent manner. But Python is good at being system-independent. Instead of this:

filenames = [“text\\a.txt”,”text\\b.txt”,”text\\c.txt”]

How about this:

flist = [“a.txt”, “b.txt”, “c.txt”]
filenames = [ os.path.join(“text”, f) for f in flist ]

Again, a very nice article. I look forward to seeing your next installment.
Michael Nielsen says:

January 9, 2009 at 7:51 pm

Glenn – Thankyou very much for the clarifying comments about the differences between the Google MapReduce and earlier incarnations.

On your second point – this is a point I’ll have a lot to say about in a few posts time. I deliberately ignored it here, because the details don’t shed all that much light on the big picture of how to use MapReduce, which is the main point of this post. But you’re dead right that this is important. Ties into combiners, as well, which I’ll also address later.

On your third point – nice one. I’ll have to file that for future use.
Pingback: Michael Nielsen » Using MapReduce to compute PageRank
Pingback: Understanding Map/Reduce | Matt Williams
Matt Williams says:

January 18, 2009 at 12:55 pm

I’ve provided a high level overview of MapReduce in my brand spanking new blog here: http://wordflows.com/matt/2009/01/18/understanding-mapreduce/

Feedback is appreciated, as I’ve only just started blogging today!

I am in agreement that it is far from a new concept. Unless i am missing something obvious in their whitepaper, its a technique we have been using in distributed systems for at least a decade.
JK says:

February 18, 2009 at 11:12 am

> filenames = [ os.path.join(â€textâ€, f) for f in flist ]

Heh, you could even use `functools.partial` and then MapReduce to spread that over a whole bunch of clustered machines, if I got MR right from this well-explanatory article 😀
Michael Nielsen says:

February 18, 2009 at 6:17 pm

Thanks JK, I’ll look into it.
JK says:

February 18, 2009 at 9:22 pm

Please, don’t take that too serious 🙂 I just thought of it because I’d instantly rewrite that line to

> filenames = map(partial(os.path.join, ‘text’), flist)

when in the mood of doing things functional. That, however, doesn’t save many characters and additionally requires an import to work.

In any case, I really appreciate your article and I’m looking forward to more of that quality.

Oh, and let me point to the Disco project at http://discoproject.org/ which uses an Erlang server and a client library to write applications in Python. Their API for the map and reduce functions is a little different, though (which is interesting to note as there are similar implementation approaches that work).
Neal Richter says:

February 21, 2009 at 11:17 pm

Obviously Dean and Ghemawat are due some credit here, yet map and reduce have been around in languages like Lisp for approx 25 years prior to their paper. There were even companies/groups that parallelized their Lisp implementations in the 1980s:

http://ieeexplore.ieee.org/iel3/1244/519/00010373.pdf?arnumber=10373
http://search.barnesandnoble.com/Parallel-LISP/Takatoshi-Ito/e/9780387527826
http://www.springerlink.com/content/k560u307713j57r4/

Did Dean and Ghemawat ‘finish the job’ and do a great implementation at exactly the right time and at the right company to start a distributed computing revolution?
Absolutely and congrats to them for that.

Yet when first introduced to it, it was completely obvious to me and others who learned the map and reduce lisp/scheme primitives as undergrads. And ‘difference’ in their methodology is trivial.. of course this does describe the best revolutions.. they are all trivial in retrospect.

It’s also worth reading David DeWitt’s savaging of MapReduce to cure you of any further illusion of it being ‘invented’ by Dean and Ghemawat. (I think he goes way to far.. modern MapReduce is here to stay – we’re not going back to RDBMs systems for distributed computing anytime soon)
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
Asokan Pichai says:

April 15, 2009 at 12:40 am

The Dave Spencer link is wrong. AFAICS, it says tropo instead of chencer
Michael Nielsen says:

April 15, 2009 at 8:02 am

Asokan – Thanks, fixed.
Pingback: Michael Nielsen » Consistent hashing
Pingback: Jonathan Stray » Why We Need Open Search, and How to Make Money Doing It
Pingback: Bull … « Richard WM Jones
Pingback: Write your first MapReduce program in 20 minutes | Panicked Zebra
Pingback: 10001 ä»¶è³‡å·¥ç³»ç•¢æ¥å‰ä¸€å®šè¦åšçš„äº‹ » Mr. Jamie çœ‹ç¶²è·¯èˆ‡å‰µæŠ•
milosh says:

July 1, 2011 at 6:05 am

In JK’s map(partial(os.path.join, “text”), flist), is partial better than simply using a lambda?
Pingback: Noticeable Hacker news to read at 2011/06 | David Euler on programming, design and linux
piecurus says:

April 6, 2012 at 6:44 am

Regarding the originality of MapReduce paradigm, I completely agree with Neal Richter.
Map and Reduce are two primitives of functional programming that had particular use in Lisp.

Consider that those primitive are also implemented in Python ( Guido van Rossum suggested to remove those primivites in Python 3000 http://www.artima.com/weblogs/viewpost.jsp?thread=98196 ).

On the other side, consider that Peter Norvig, Lisp guru started as Research Director @ Google in 2001. That may have affected somehow to the development of that paradigm.
For more, http://norvig.com/
Pingback: MapReduce made easy to understand « Ram Prasad
Amp says:

July 30, 2013 at 9:30 am

A beginner on understanding Map-Reduce:

Just a clarification:

“i has three keys containing the filenames, and three corresponding values containing the contents of those files. ”

So:
1. key1: filename
2. key2: file contents
3. key3: ??

Am I interpreting in the right direction ?
Ankit says:

December 28, 2013 at 12:39 pm

Hi Michael,

I am a grad student and i want to create a graph processing framework(A smaller version)
which works on some standard graphs and may give higher performance than mpi/open mp for
shortest path and pagerank algorithm.Can you please give some suggestions on how to proceed as i have read about pregel,Giraph anf graphchi but i am not able to decide about parameters as to how to store and split the graph and how to assign the tasks in parallel and certain design issues.I want to design my own simulation just like giraph where user can define his own functions.

Comments are closed.