The Future of Science

How is the web going to impact science?

At present, the impact of the web on science has mostly been to make access to existing information easier, using tools such as online journals and databases such as the ISI Web of Knowledge and Google Scholar. There have also been some interesting attempts at developing other forms of tools, although so far as I am aware none of them have gained a lot of traction with the wider scientific community. (There are signs of exceptions to this rule on the horizon, especially some of the tools being developed by Timo Hannay’s team at Nature.)

The contrast with the internet at large is striking. Ebay, Google, Wikipedia, Facebook, Flickr and many others are new types of institution enabling entirely new forms of co-operation. Furthermore, the rate of innovation in creating such new institutions is enormous, and these examples only scratch the surface of what will soon be possible.

Over the past few months I’ve drafted a short book on how I think science will change over the next few years as a result of the web. Although I’m still revising and extending the book, over the next few weeks I’ll be posting self-contained excerpts here that I think might be of some interest. Thoughtful feedback, argument, and suggestions are very welcome!

A few of the things I discuss in the book and will post about here include:

  • Micropublication: Allowing immediate publication in small incremental steps, both of conventional text, and in more diverse media formats (e.g. commentary, code, data, simulations, explanations, suggestions, criticism and correction). All are to be treated as first class fully citable publications, creating an incentive for people to contribute far more rapidly and in a wider range of ways than is presently the case.
  • Open source research: Using version control systems to open up scientific publications so they can be extended, modified, reused, refactored and recombined by other users, all the while preserving a coherent and citable record of who did what, and when.
  • The future of peer review: The present quality assurance system relies on refereeing as a filtering system, prior to publication. Can we move to a system where the filtering is done after publication?
  • Collaboration markets: How can we fully leverage individual expertise? Most researchers spend much of their time reinventing the wheel, or doing tasks at which they have relatively little comparative advantage. Can we provide mechanisms to easily outsource work like this?
  • Legacy systems and migration: Why is it that the scientific community has been so slow to innovate on the internet? Many of the ideas above no doubt look like pipedreams. Nonetheless, I believe that by carefully considering and integrating with today’s legacy incentive systems (citation, peer review, and journal publication), it will be possible to construct a migration path that incentivizes scientists to make the jump to new tools for doing research.

The Research Funding “Crisis”

If you talk with academics for long, sooner or later you’ll hear one of them talk about a funding crisis in fundamental research (e.g. Google and Cosmic Variance).

There are two related questions that bother me.

First, how much funding is enough for fundamental research? What criterion should be used to decide how much money is the right amount to spend on fundamental research?

Second, the human race spent a lot lot more on fundamental research in the second half of the twentieth century than it did in the first. It’s hard to get a good handle on exactly how much, in part because it depends on what you mean by fundamental research. At a guess, I’d say at least 1000 times as much was spent in the second half of the twentieth century. Did we learn 1000 times as much? In fact, did we learn as much, even without a multiplier?

Question for Marc Andreessen

A few weeks ago, Marc Andreessen invited his readers to submit a question to him. Here’s mine:

My question is whether you think a technological singularity of the type Vernor Vinge has proposed is likely in the near-term future? If so, what shape do you think the singularity is likely to take? If not, why do you think it won’t occur?

I hope you have time to answer. My own (outsider’s) perspective is that an awfully large number of people (Google, Ebay, Wikipedia, etc) now seem to be working more or less directly towards such a singularity, and it is very suggestive that more and more of the world’s resources are being directed toward this end. Of course, Ebay, Google etc don’t look at it that way, but from the perspective of a posthuman historian 50 years from now that may well be how it looks.

Andreessen hasn’t replied, but I think this fact about the growing commerical utility of AI is fascinating. Here’s a couple of quotes from Google co-founder Larry Page that could easily be quoted by my putative posthuman historian:

We have some people at Google who are really trying to build artificial intelligence and to do it on a large scale […] to do the perfect job of search you could ask any query and it would give you the perfect answer and that would be artificial intelligence […] I don’t think it’s as far off as people think.

You think Google is good, I still think it’s terri ble. […] There’s still a huge number of things that we can’t answer. You might have a more complicated question. Like why did the GNP of Uganda decline relative to the weather last year? You type that into Google, the keywords for that, and you might get a reasonable answer. But there is probably something there that
explains that, which we may or may not find. Doing a good job doing search is basically artificial intelligence, we want it to be smart.

It’s interesting that the Director of Google research, Peter Norvig, wrote what appears to be the standard text on artificial intelligence. He’s also got a pretty interesting page of book reviews.

Open source Google

Why can’t we ask arbitrarily complex questions of the whole web?

Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web. At a more sophisticated level, Google makes available a limited API (see here, here, and here) that lets you send simple queries to their back-end database.

Compare that to what someone working internally for Google can do. They can ask arbitrarily complex questions of the web as a whole, using powerful database query techniques. They can even apply algorithms that leverage all the information available on the web, incorporating ideas from fields like machine learning to extract valuable information. This ability to query the web as a whole, together with Google’s massive computer cluster, enables not only Google search, but also many of the dozens of other applications offered by Google. To do all this, Google constructs a local mirror of the web, which they then enhance by indexing and structuring it to make complex queries of the web possible.

What I want is for all developers to have full access to such a mirror, enabling anyone to query the web as a whole. Such a mirror would be an amazing development platform, leading to many entirely new types of applications and services. If developed correctly it would, in my opinion, eventually become a public good on a par with the electricity grid.

A related idea was announced last week by Wikipedia’s Jimbo Wales: the Search Wikia search engine is making available an open source web crawler which can be improved by the community at large. This great idea is, however, just the tip of a much larger iceberg. Sure, an open source search tool might improve the quality and transparency of search, and provide some serious competition to Google. But search is just a single application, no matter how important; it would be far more valuable to open up the entire underlying platform and computing infrastructure to developers. I predict that if Search Wikia is successful, then the developers contibuting to it will inevitably drive it away from being a search application, and towards being a development platform.

I believe such a platform can be developed as an open source project, albeit a most unconventional one. So far as I am aware, no-one has ever attempted to develop an open source massively distributed computing platform. Many of the required ideas can of course be found in massively distributed applications such as SETI@Home, Folding@Home, and Bram Cohen’s BitTorrent. However, this project has many very challenging additional problems, such as privacy (who gets to see what data?) and resource allocation (how much time does any party get on the platform?)

Once these problems are overcome, such an open source platform will enable us to query not only the web as a whole, but also what John Battelle has called the “database of human intentions” – all the actions ever taken by any user of the platform. Indeed, Google’s most powerful applications increasingly integrate their mirror of the web with their proprietary database of human intentions. It’d be terrific if these two databases – the web as a whole, and the database of human intentions – were available to and fully queryable by humanity at large.