Machine-readable Open Access scientific publishing
Over the last 50 years, scientific publishing has become remarkably profitable. The growth of commercial publishers has greatly outstripped not-for-profit society journals, and some of those commercial publishers have achieved remarkable success – for example, in 2006 industry titan Elsevier had revenues of approximately EU1.5 billion in 2006 on their science and medical journals.
Against this backdrop, an Open Access movement has emerged which has lobbied with considerable success for various types of open access to the scientific literature. Many funding bodies (including six of the seven UK Research Councils, the Australian Research Council, and the US’s NIH) are now considering making mandatory Open Access provisions for all research they support. A sign of the success of the Open Access movement is that some journal publishers have started an aggressive counter-lobbying effort, going under the Orwellian moniker “Publishers for Research Integrity in Science and Medicine” (PRISM). (For much more background, see some of the excellent blogs related to Open Access – e.g., Peter Suber, Stevan Harnad, Coturnix, and John Wilbanks).
What does Open Access actually mean? In fact, there are many different types of Open Access, depending on exactly what types of access (and when) are allowed to the papers under consideration. However, most of the effort in the Open Access movement seems to have focused on providing access to human readers of the papers, providing documents in formats like html or pdf. While these formats are good for humans, they are rather difficult for machines to break down and extract meaning from – to pick a simple example, it is not easy for a machine to reliably extract a list of authors and institutions from the raw pdf of a paper.
I believe it is important to establish a principle of Machine-readable Open Access. This is the idea that papers should be published in such a way that both the paper and its metadata (such as citations, authors, title, and so on) should be made freely available in a format that is easily machine readable. Phrased another way, it means that publishers should provide Open APIs that allow other people and organizations access to their data.
The key point of Machine-Readable Open Access is that it will enable other organizations to build value-added services on top of the scientific literature. At first, these services will be very simple – a better way of viewing the preprint arxiv, or better search engines for the scientific literature. But as time goes on, Machine-Readable Open Access will enable more advanced services to be developed by anyone willing to spend the time to build them. Examples might include tools to analyse the research literature, to discover emerging trends (perhaps using data mining and artificial intelligence techniques applied to citation patterns), to recommend papers that might be of interest, to automatically produce analyses of grants or job applications, and to point out connections between papers that otherwise would be lost.
Indeed, provided such services themselves support open APIs, it will become possible to build still higher level services, and thus provide a greater return on our collective investment in the sciences. All this will be enabled by ensuring that at each level data is provided in a format that is not only human readable, but which is also designed to be accessed by machines. For this reason, I believe that the Open Access provisions now being considered by funding agencies would be greatly strengthened if they mandated Machine-Readable Open Access.