Open source Google

Why can’t we ask arbitrarily complex questions of the whole web?

Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web. At a more sophisticated level, Google makes available a limited API (see here, here, and here) that lets you send simple queries to their back-end database.

Compare that to what someone working internally for Google can do. They can ask arbitrarily complex questions of the web as a whole, using powerful database query techniques. They can even apply algorithms that leverage all the information available on the web, incorporating ideas from fields like machine learning to extract valuable information. This ability to query the web as a whole, together with Google’s massive computer cluster, enables not only Google search, but also many of the dozens of other applications offered by Google. To do all this, Google constructs a local mirror of the web, which they then enhance by indexing and structuring it to make complex queries of the web possible.

What I want is for all developers to have full access to such a mirror, enabling anyone to query the web as a whole. Such a mirror would be an amazing development platform, leading to many entirely new types of applications and services. If developed correctly it would, in my opinion, eventually become a public good on a par with the electricity grid.

A related idea was announced last week by Wikipedia’s Jimbo Wales: the Search Wikia search engine is making available an open source web crawler which can be improved by the community at large. This great idea is, however, just the tip of a much larger iceberg. Sure, an open source search tool might improve the quality and transparency of search, and provide some serious competition to Google. But search is just a single application, no matter how important; it would be far more valuable to open up the entire underlying platform and computing infrastructure to developers. I predict that if Search Wikia is successful, then the developers contibuting to it will inevitably drive it away from being a search application, and towards being a development platform.

I believe such a platform can be developed as an open source project, albeit a most unconventional one. So far as I am aware, no-one has ever attempted to develop an open source massively distributed computing platform. Many of the required ideas can of course be found in massively distributed applications such as SETI@Home, Folding@Home, and Bram Cohen’s BitTorrent. However, this project has many very challenging additional problems, such as privacy (who gets to see what data?) and resource allocation (how much time does any party get on the platform?)

Once these problems are overcome, such an open source platform will enable us to query not only the web as a whole, but also what John Battelle has called the “database of human intentions” – all the actions ever taken by any user of the platform. Indeed, Google’s most powerful applications increasingly integrate their mirror of the web with their proprietary database of human intentions. It’d be terrific if these two databases – the web as a whole, and the database of human intentions – were available to and fully queryable by humanity at large.