{"id":256,"date":"2007-08-08T19:24:22","date_gmt":"2007-08-09T02:24:22","guid":{"rendered":"http:\/\/michaelnielsen.org\/blog\/?p=256"},"modified":"2007-08-15T21:43:49","modified_gmt":"2007-08-16T01:43:49","slug":"open-source-google","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/blog\/open-source-google\/","title":{"rendered":"Open source Google"},"content":{"rendered":"<p>Why can\u00e2\u20ac\u2122t we ask arbitrarily complex questions of the whole web?<\/p>\n<p>Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web.  At a more sophisticated level, Google makes available a limited API (see <a href=\"http:\/\/glinden.blogspot.com\/2007\/08\/google-search-api-for-research.html\">here<\/a>, <a href=\"http:\/\/research.google.com\/university\/search\/\">here<\/a>, and <a href=\"http:\/\/googleresearch.blogspot.com\/2007\/07\/drink-from-firehose-with-university.html\">here<\/a>) that lets you send simple queries to their back-end database.<\/p>\n<p>Compare that to what someone working internally for Google can do.  They can ask <em>arbitrarily complex<\/em> questions of the web as a whole, using powerful database query techniques.  They can even apply algorithms that leverage <em>all<\/em> the information available on the web, incorporating ideas from fields like machine learning to extract valuable information. This ability to query the web as a whole, together with Google&#8217;s massive computer cluster, enables not only Google search, but also many of the <a href=\"http:\/\/labs.google.com\/\">dozens of other applications<\/a> offered by Google. To do all this, Google constructs a local mirror of the web, which they then enhance by indexing and structuring it to make complex queries of the web possible.<\/p>\n<p>What I want is for <em>all developers<\/em> to have <em>full<\/em> <em>access<\/em> to such a mirror, enabling anyone to query the web as a whole.  Such a mirror would be an amazing development platform, leading to many entirely new types of applications and services. If developed correctly it would, in my opinion, eventually become a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Public_good\">public good<\/a> on a par with the electricity grid.<\/p>\n<p>A <a href=\"http:\/\/gigaom.com\/2007\/07\/30\/google-vs-jimmy-wales-and-open-source-search\/\">related idea<\/a> was announced last week by Wikipedia&#8217;s <a href=\"http:\/\/en.wikipedia.org\/wiki\/Jimbo_Wales\">Jimbo Wales<\/a>: the <a href=\"http:\/\/search.wikia.com\/wiki\/Search_Wikia\">Search Wikia<\/a> search engine is making available an open source web crawler which can be improved by the community at large. This great idea is, however, just the tip of a much larger iceberg.  Sure, an open source search tool might improve the quality and transparency of search, and provide some serious competition to Google.  But search is just a single application, no matter how important; it would be far more valuable to open up the entire underlying platform and computing infrastructure to developers. I predict that if Search Wikia is successful, then the developers contibuting to it will inevitably drive it away from being a search application, and towards being a development platform.<\/p>\n<p>I believe such a platform can be developed as an open source project, albeit a most unconventional one. So far as I am aware, no-one has ever attempted to develop an open source massively distributed computing platform.  Many of the required ideas can of course be found in massively distributed applications such as <a href=\"http:\/\/setiathome.berkeley.edu\/\">SETI@Home<\/a>, <a href=\"http:\/\/folding.stanford.edu\/\">Folding@Home<\/a>, and <a href=\"http:\/\/bramcohen.livejournal.com\/\">Bram Cohen&#8217;s<\/a> <a href=\"http:\/\/www.bittorrent.com\/\">BitTorrent<\/a>.  However, this project has many very challenging additional problems, such as privacy (who gets to see what data?) and resource allocation (how much time does any party get on the platform?)<\/p>\n<p>Once these problems are overcome, such an open source platform will enable us to query not only the web as a whole, but also what <a href=\"http:\/\/battellemedia.com\/archives\/000063.php\">John Battelle<\/a> has called the \u00e2\u20ac\u0153database of human intentions\u00e2\u20ac\u009d \u00e2\u20ac\u201c all the actions ever taken by any user of the platform.  Indeed, Google&#8217;s most powerful applications increasingly integrate their mirror of the web with their proprietary database of human intentions.  It&#8217;d be terrific if these two databases &#8211; the web as a whole, and the database of human intentions &#8211; were available to and fully queryable by humanity at large.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why can\u00e2\u20ac\u2122t we ask arbitrarily complex questions of the whole web? Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web. At a more sophisticated level, Google makes available a limited API&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/blog\/open-source-google\/\">Continue reading <span class=\"screen-reader-text\">Open source Google<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[23,22,21,24],"tags":[],"class_list":["post-256","post","type-post","status-publish","format-standard","hentry","category-dreams","category-ideas","category-open-source-google","category-the-whole-web","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts\/256","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/comments?post=256"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts\/256\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/media?parent=256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/categories?post=256"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/tags?post=256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}