{"id":503,"date":"2008-11-27T09:44:04","date_gmt":"2008-11-27T13:44:04","guid":{"rendered":"http:\/\/michaelnielsen.org\/blog\/?page_id=503"},"modified":"2011-06-25T20:33:20","modified_gmt":"2011-06-26T00:33:20","slug":"lecture-course-the-google-technology-stack","status":"publish","type":"page","link":"https:\/\/michaelnielsen.org\/blog\/lecture-course-the-google-technology-stack\/","title":{"rendered":"The Google Technology Stack"},"content":{"rendered":"<h3>Posts<\/h3>\n<h4>PageRank and MapReduce<\/h4>\n<ul>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=506\">Introduction to PageRank<\/a>  (<a href=\"http:\/\/michaelnielsen.org\/blog\/?p=509\">video<\/a>)\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=511\">Building our     PageRank intuition<\/a>   (<a href=\"http:\/\/michaelnielsen.org\/blog\/?p=520\">video<\/a>)\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=516\">The PageRank     distribution for the web<\/a> (no video, short supplement extending   the last post)\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=523\">Using your laptop     to compute PageRank for millions of webpages<\/a>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=529\">Write your first     MapReduce program in 20 minutes<\/a>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=534\">Using MapReduce to     compute PageRank<\/a>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=613\">Consistent Hashing<\/a>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/a-number-theoretic-approach-to-consistent-hashing\/\">A Number-Theoretic     Approach to Consistent Hashing<\/a>\n<li> <strong>To be added:<\/strong> We&#8217;ll put MapReduce on a cluster, use   it to compute PageRank, and much more. <\/ul>\n<h4>Data mining and machine learning<\/h4>\n<ul>\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=577\">Introduction to     statistical machine translation<\/a>.  Good background is   <a href=\"http:\/\/norvig.com\/spell-correct.html\">Peter Norvig&#8217;s     introduction to spelling correction<\/a>.\n<li> <a href=\"http:\/\/michaelnielsen.org\/blog\/?p=587\">Implementing     statistical machine translation using MapReduce<\/a>\n<li> <strong>Much more to be added<\/strong> <\/ul>\n<h3>About<\/h3>\n<p>Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data.  According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: &#8220;Google uses Bayesian filtering the way Microsoft uses the if statement&#8221; (<a href=\"http:\/\/www.joelonsoftware.com\/items\/2005\/10\/17.html\">Credit:   Joel Spolsky<\/a>). This series of posts describes some of the technologies that make this high level of abstraction possible. The technologies I&#8217;ll describe include:<\/p>\n<ul>\n<li> <strong>The Google File System:<\/strong> a simple way of accessing   enormous amounts of data spread across a large cluster of machines.   Acts as a &#8220;virtual file system&#8221; that makes the cluster look to   developers more like a single machine, and eliminates the need to   think about details like what happens when a machine fails.\n<li> <strong>Bigtable:<\/strong> Building on the Google File System, Bigtable   provides a simple database model which can be run across large   clusters.  This allows developers to ignore many of the underlying   details of the cluster, and concentrate on getting things done.\n<li> <strong>MapReduce:<\/strong> A powerful programming model for processing   and generating very large data sets on large clusters, MapReduce   makes it easy to automatically parallelize many programming   tasks. It is used internally by Google developers to process many   petabytes of data every day, enabling developers to code and run   simple but large parallel jobs in minutes, instead of days. <\/ul>\n<p>Together, these technologies make it easy to run large parallel jobs on very big data sets.  Running on top of this technology stack are many powerful data mining and machine learning algorithms.  I&#8217;ll describe at least two of these: <\/p>\n<ul>\n<li> Web Crawling\n<li> PageRank <\/ul>\n<p> and may describe more, including statistical approaches to spell checking and machine translation, and recommendation algorithms.<\/p>\n<p>In addition to understanding how these technologies work, we&#8217;ll also:<\/p>\n<ul>\n<li> Develop toy implementations of some of the technologies.\n<li> Investigate some related open source technologies, such as   Hadoop, CouchDB, Nutch, Lucene, and others. <\/ul>\n<p>I&#8217;ve never worked at Google.  The posts are based on the published literature, especially some key technical publications released by Google describing the Google File System, Bigtable, MapReduce, and PageRank.  I&#8217;m sure there are some significant differences between the material I will describe, and the current internal state-of-the-art. Still, we should be able to build up a pretty good picture of a data-intensive web company, even if it&#8217;s not accurate about Google&#8217;s particulars.<\/p>\n<p><strong>Exercises and problems:<\/strong> The posts will contain exercises for you to do to master the material, and also some more challenging and often open-ended problems.  Note that while I&#8217;ve worked on many of the problems, I by no means have full solutions.<\/p>\n<p><strong>FriendFeed room:<\/strong> There is a <a href=\"http:\/\/friendfeed.com\/rooms\/lecture-course-on-the-google-techno\">FriendFeed   room<\/a> for the series.<\/p>\n<p><strong>Accompanying lectures:<\/strong> I&#8217;m basing a lecture series in Waterloo, Ontario, on the posts.  Here&#8217;s some details about the lectures.<\/p>\n<p><strong>Audience:<\/strong> The audience will include both theoretical physicists and developers; the lectures should be accessible to both sets of people.<\/p>\n<p><strong>Time:<\/strong> Lectures will be held every Tuesday, 7:00-8:30pm.  The exception is the first lecture, which will be Thursday, December 4, 2008, 7:00-8:30pm.<\/p>\n<p><strong>Location:<\/strong> <a href=\"http:\/\/www.postrank.com\">AideRSS<\/a> have kindly offered us a room for the course, at 505-180 King Street South, Waterloo &#8211; it&#8217;s opposite the Brick Brewery, and near the Hospital. <em>To get into the building, you need someone from AideRSS to let   you in.  If no one from AideRSS is by the entrance, call (519) 498   1476.<\/em><\/p>\n<h2>Background reading<\/h2>\n<p>The lectures are based primarily on the following sources:<\/p>\n<ol>\n<li> <strong>Original PageRank paper:<\/strong>   <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.38.5427\">&#8220;The     PageRank Citation Ranking: Bringing Order to the Web&#8221;<\/a>, by   Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd   (1998).\n<li> <strong>Textbook on PageRank and related ideas:<\/strong>   <a href=\"http:\/\/pagerankandbeyond.com\/\">&#8220;Google&#8217;s     PageRank and Beyond: The Science of Search Engine Rankings&#8221;<\/a>, by   Amy N. Langville and Carl~D.~Meyer (2006).\n<li> <strong>Overview of Google&#8217;s early architecture:<\/strong>   <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.109.4049\">&#8220;The     Anatomy of a Large-Scale Hypertextual Web Search Engine&#8221;<\/a>, by   Sergey Brin and Lawrence Page (1998).\n<li> <strong>More recent overview of Google&#8217;s architecture:<\/strong>   <a href=\"http:\/\/research.google.com\/archive\/googlecluster.html\">&#8220;Web     Search for a Planet: The Google Cluster Architecture&#8221;<\/a>, by Luiz   Barroso, Jeffrey Dean, and Urs Hoelze (2003).\n<li> <strong>Original Google File System paper:<\/strong>   <a href=\"http:\/\/labs.google.com\/papers\/gfs.html\">&#8220;The Google File     System&#8221;<\/a>, by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung   (2003).\n<li> <strong>Original Bigtable paper:<\/strong>   <a href=\"http:\/\/labs.google.com\/papers\/bigtable.html\">&#8220;Bigtable: A     distributed storage system for structured data&#8221;<\/a>, by Fay Chang,   Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,   Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber (2006).\n<li> <strong>Original MapReduce paper:<\/strong>   <a href=\"http:\/\/labs.google.com\/papers\/mapreduce.html\">&#8220;MapReduce:     Simplified Data Processing on Large Clusters&#8221;<\/a>, by Jeffrey Dean   and Sanjay Ghemawat (2004). <\/ol>\n<p>Related resources include:<\/p>\n<ol>\n<li> <a href=\"http:\/\/www.youtube.com\/watch?v=yjPBkvYh-ss\">First of five     lectures on MapReduce<\/a>.\n<li> Lecture course on   <a href=\"http:\/\/www.cs.washington.edu\/education\/courses\/cse490h\/08au\/\">scalable     systems<\/a> at the University of Washington, covering many similar   topics.  Videos of the lectures are included. <\/ol>\n<p>Many more resources will be posted in the <a href=\"http:\/\/friendfeed.com\/rooms\/lecture-course-on-the-google-techno\">FriendFeed   room<\/a> for the course.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Posts PageRank and MapReduce Introduction to PageRank (video) Building our PageRank intuition (video) The PageRank distribution for the web (no video, short supplement extending the last post) Using your laptop to compute PageRank for millions of webpages Write your first MapReduce program in 20 minutes Using MapReduce to compute PageRank Consistent Hashing A Number-Theoretic Approach&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/blog\/lecture-course-the-google-technology-stack\/\">Continue reading <span class=\"screen-reader-text\">The Google Technology Stack<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"open","template":"","meta":{"footnotes":""},"class_list":["post-503","page","type-page","status-publish","hentry","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/pages\/503","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/comments?post=503"}],"version-history":[{"count":1,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/pages\/503\/revisions"}],"predecessor-version":[{"id":953,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/pages\/503\/revisions\/953"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/media?parent=503"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}