Open source Google

Why can’t we ask arbitrarily complex questions of the whole web?

Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web. At a more sophisticated level, Google makes available a limited API (see here, here, and here) that lets you send simple queries to their back-end database.

Compare that to what someone working internally for Google can do. They can ask arbitrarily complex questions of the web as a whole, using powerful database query techniques. They can even apply algorithms that leverage all the information available on the web, incorporating ideas from fields like machine learning to extract valuable information. This ability to query the web as a whole, together with Google’s massive computer cluster, enables not only Google search, but also many of the dozens of other applications offered by Google. To do all this, Google constructs a local mirror of the web, which they then enhance by indexing and structuring it to make complex queries of the web possible.

What I want is for all developers to have full access to such a mirror, enabling anyone to query the web as a whole. Such a mirror would be an amazing development platform, leading to many entirely new types of applications and services. If developed correctly it would, in my opinion, eventually become a public good on a par with the electricity grid.

A related idea was announced last week by Wikipedia’s Jimbo Wales: the Search Wikia search engine is making available an open source web crawler which can be improved by the community at large. This great idea is, however, just the tip of a much larger iceberg. Sure, an open source search tool might improve the quality and transparency of search, and provide some serious competition to Google. But search is just a single application, no matter how important; it would be far more valuable to open up the entire underlying platform and computing infrastructure to developers. I predict that if Search Wikia is successful, then the developers contibuting to it will inevitably drive it away from being a search application, and towards being a development platform.

I believe such a platform can be developed as an open source project, albeit a most unconventional one. So far as I am aware, no-one has ever attempted to develop an open source massively distributed computing platform. Many of the required ideas can of course be found in massively distributed applications such as SETI@Home, Folding@Home, and Bram Cohen’s BitTorrent. However, this project has many very challenging additional problems, such as privacy (who gets to see what data?) and resource allocation (how much time does any party get on the platform?)

Once these problems are overcome, such an open source platform will enable us to query not only the web as a whole, but also what John Battelle has called the “database of human intentions” – all the actions ever taken by any user of the platform. Indeed, Google’s most powerful applications increasingly integrate their mirror of the web with their proprietary database of human intentions. It’d be terrific if these two databases – the web as a whole, and the database of human intentions – were available to and fully queryable by humanity at large.


  1. How about Amazon’s Alexa Web Search? It “offers programmatic access to Alexa’s web search engine. Developers can incorporate search results directly into their web sites or services, or answer complex queries that can’t be answered with traditional search engines.”

  2. Thanks for the pointer. Alexa Web Search is certainly in the direction of what I have in mind (as is the Google API). However, AWS is still very limited. You couldn’t run your own version of PageRank off AWS. You couldn’t write large scale clustering algorithms to (for example) identify emerging trends in technology, or fashion, or music, or whatever. More generally, there’s loads of interesting techniques from AI that simply couldn’t be implemented through AWS.

  3. I think the very reason for the unwilling to open the computing platform is what u’ve already mentioned. Although the search engine is of technical significance, it is still an application, thus open source may not effect the core techniques of the Corporation. However, open source of the computing platform which is the real core tech. of that company, may harm their profit. So as a profit company, they are willing to open source for improving efficiency under the assumption that this source must not harm their core profit.

  4. I think that next-generation online search algorithms could draw inspiration from models of human pattern recognition, such as Adaptive Resonance Theory (ART) (see

    The human brain has the wonderful capacity to focus attention on relevant input features during pattern recognition tasks, while ignoring irrelevant or distracting features. This allows the brain to continuously acquire new recognition memories without losing old memories. ART pattern recognition networks solve this so-called stability-plasticity problem and, in so doing, can be trained to recognize arbitrarily complex patterns.

    Could an ART-type network be trained to answer arbitrarily complex online search queries?

  5. DM: That’s one of the links I have in my post (where I talk about Google’s limited public API).

  6. Vlad: Google seems very interested in this kind of thing. Their director of research, Peter Norvig, is a noted AI researcher, and I am told that many of the people who work there have a background in AI. There have also been numerous rumours floating around like this about Google and AI.

  7. Duh. My bad. “here” is not the most descriptive link title, but I should have checked the URL nonetheless. Sorry.

  8. DM: No problem. Thanks for taking the time to suggest something interesting. It definitely IS really interesting, although, as I say in the post, it’s also a lot more limited than one might like.

  9. Michael: Thanks for the link. It seems to me that people are sometimes using the term AI as reference to some mysterious form of sentience, as often portrayed in the movies. One of my favourite examples of this type of thinking can be found in The Terminator, where Arnie says something to the effect of “at time X the AI began to learn at an exponential rate, reaching consciousness at time Y.”

    The point of my post was simply to call attention to the distinction between AI and natural intelligence (or neural computation). It seems to me that the hallmark of natural intelligence is NOT how quickly brains process information per se. A humble laptop computer could, after all, beat most people at just about any brute-force computation you would care to name (some autistic savants can quickly perform remarkable feats of arithmetic, but that is another story). Rather, what characterizes normal human thought is the ability to quickly select the relevant features of a problem, focusing attention on these relevant features in a recursive manner, while ignoring irrelevant features.

Comments are closed.