Machine-readable Open Access scientific publishing

Over the last 50 years, scientific publishing has become remarkably profitable. The growth of commercial publishers has greatly outstripped not-for-profit society journals, and some of those commercial publishers have achieved remarkable success – for example, in 2006 industry titan Elsevier had revenues of approximately EU1.5 billion in 2006 on their science and medical journals.

Against this backdrop, an Open Access movement has emerged which has lobbied with considerable success for various types of open access to the scientific literature. Many funding bodies (including six of the seven UK Research Councils, the Australian Research Council, and the US’s NIH) are now considering making mandatory Open Access provisions for all research they support. A sign of the success of the Open Access movement is that some journal publishers have started an aggressive counter-lobbying effort, going under the Orwellian moniker “Publishers for Research Integrity in Science and Medicine” (PRISM). (For much more background, see some of the excellent blogs related to Open Access – e.g., Peter Suber, Stevan Harnad, Coturnix, and John Wilbanks).

What does Open Access actually mean? In fact, there are many different types of Open Access, depending on exactly what types of access (and when) are allowed to the papers under consideration. However, most of the effort in the Open Access movement seems to have focused on providing access to human readers of the papers, providing documents in formats like html or pdf. While these formats are good for humans, they are rather difficult for machines to break down and extract meaning from – to pick a simple example, it is not easy for a machine to reliably extract a list of authors and institutions from the raw pdf of a paper.

I believe it is important to establish a principle of Machine-readable Open Access. This is the idea that papers should be published in such a way that both the paper and its metadata (such as citations, authors, title, and so on) should be made freely available in a format that is easily machine readable. Phrased another way, it means that publishers should provide Open APIs that allow other people and organizations access to their data.

The key point of Machine-Readable Open Access is that it will enable other organizations to build value-added services on top of the scientific literature. At first, these services will be very simple – a better way of viewing the preprint arxiv, or better search engines for the scientific literature. But as time goes on, Machine-Readable Open Access will enable more advanced services to be developed by anyone willing to spend the time to build them. Examples might include tools to analyse the research literature, to discover emerging trends (perhaps using data mining and artificial intelligence techniques applied to citation patterns), to recommend papers that might be of interest, to automatically produce analyses of grants or job applications, and to point out connections between papers that otherwise would be lost.

Indeed, provided such services themselves support open APIs, it will become possible to build still higher level services, and thus provide a greater return on our collective investment in the sciences. All this will be enabled by ensuring that at each level data is provided in a format that is not only human readable, but which is also designed to be accessed by machines. For this reason, I believe that the Open Access provisions now being considered by funding agencies would be greatly strengthened if they mandated Machine-Readable Open Access.

13 comments

  1. If you don’t already read it, Peter Murray-Rust’s blog might interest you. He is probably the most prominent advocate of Open Data — by which he means not just free for eyeballs but also machine readable (see, e.g., the wikipedia entry, which is mostly Peter’s work).

    If that link is of interest, the open access/open science section of my blogroll might have a few more you’d like — the cheminfomatics blogs are particularly keen on machine readability, as is Cameron Neylon of Science in the Open.

    And just for kicks, this might also interest you.

    Although I agree with you on the power and necessity of machine readable data, I am not sure a mandate is the way to go — at least, not just yet. Semantic markup is hard, and it’s been enough of a struggle to get mandates for OA (which, as Stevan Harnad is fond of reminding the world, is only a matter of keystrokes).

  2. I think an open API for accessing metadata would be a tremendous step forward. Whole PhD’s are written in our field (machine learning) about how one can figure out which people wrote which papers (as people not always use the same name, and some people have the same name). Rexa (http://rexa.info/) is one such initative that could be much simplified if one had access to meta-data. Ofcourse, this line of research makes a lot of sense still for mining old scientific works.

    Having access to large collection of paper content is possible already and I think it won’t be long before we will see things like topic modelling integrated in pre-print/online journal databases. David Blei recently gave a talk in our group on his work about topic modelling more than 100 years of Science. (Slide: http://www.cs.princeton.edu/~blei/modeling-science.pdf)

    The only question is: logistically it is not too hard to organize an open access journal: e.g. it requires not much more than a webserver and a few gigabytes of storage at most. However, imagine we have some very interesting ways to anotate papers, sophisticated user interfaces and datamining tools; the question then becomes: who is going to host this? Who will want to invest in infrastructure for this?

  3. Most funders (including the NIH) are not “making it mandatory for all research they support to be published in Open Access journals“. For example, the wording of the policy for NIH is:

    “The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.”

    See: What is the NIH Public Access Policy?.

    So, it’s submission of a “postprint” to PubMed Central that’s been mandated, not publication in an Open Access journal. It’s the metadata provided by PubMed Central that will be relevant.

  4. Bill:

    Thankyou for the links! I’ll be adding several of those to my blogroll.

    On the mandate issue, I’m personally not entirely sure a mandate is the right idea. I’m not against it, I just haven’t thought the issue through to my satisfaction.

    With that said, if Open Access is to be mandated, then I think it’s worth doing right, and that means, in my opinion, that provisions should be made for Machine-Readable Open Access. Such a mandate will prevent the problems I described in this post from arising.

  5. Bill:

    One more comment: you say that semantic markup is hard. I won’t argue the point, but I don’t think one needs to solve the problems you’re presumably thinking about to publish useful machine-readable data. A mandate to publish (for example) high-quality Open Archive Initiative (OAI) data is quite feasible right now, and would have great benefits.

  6. Jurgen: Thanks for the links, especially to Blei’s work, which is very interesting.

    On the infrastructure issue, it’s worth pointing out that the total infrastructure costs (computing + bandwith) associated to the academic literature are quite modest. Depending on exactly what one counts, on the order of 1-10 million papers are published per year. This is storable on a (powerful) single machine, and easily on a rather small cluster of machines. Bandwidth is more of an issue, but is still relatively modest.

    Andrew Odlyzko has several excellent papers (see, e.g., this one) where he analyses the cost of online publication. The upshot is that even ten years ago (when the papers were written) the infrastructure costs were relatively small.

  7. Jim: Thanks for the correction. I’ve seen the policies described in detail before, but somehow this never sunk in – my mistake. I’ll amend the post slightly.

  8. I very much agree with you that the machine-readable journal article is long overdue. I wrote about the idea a bit here [http://i9606.blogspot.com/2007/10/where-is-api.html]

    Though it seems like a great way to move things forward eventually, I don’t think pushing for government mandates is really going to work any time soon. This is because a) the technology for doing this effectively really is challenging to use b) most people have a hard time understanding what “machine-readable” means, c) all of the publishers will fight it (unless the standard comes from one of them – and then they will fight with each other).

    Before it can reach the level of broad standards, some one really need to step up and build a working example and demonstrate it. Ideally, an organization that is already pretty well respected like PLOS (http://www.plos.org/) could be convinced to host a machine-useful version of their journal. Failing that, a system could be built on top of a collection of existing journal articles. If the system could be demonstrated to do new and useful things (e.g. answer cross-paper queries) then perhaps the argument could then be carried to government forces. Otherwise, perhaps the argument could be carried directly to scientists and a completely new journal created. (I kind of like the latter idea and would be keen to be a part of it).

  9. Ben,

    Thanks for the pointer, great post at your blog (which I’ve added to my blogroll)! You might find the Physics arXiv’s API interesting (link). There is also a mailing list, linked from that page.

    On the issue of machine-readability standards, there is already a standard for metadata (OAI) that is quite useful (link), and which is supported by many journals. I do think it can be improved a lot, and there is progress in this direction (OAI is on version 2, already). Similar standards don’t yet seem to exist for entire articles, so there is an interesting opportunity there.

  10. There is also the XML metadata format that publishers use to provide metadata to Google Scholar. It does not include citations, though.

  11. David: Do you know of a place where Google Scholar’s format is described? I’ve often wondered what their arrangement with publishers is, but have never seen it mentioned in more than passing.

  12. Michael:

    I have documentation from working with them on the publisher side. I will follow up with you by email.

    David

Comments are closed.