January 2008 – Michael Nielsen

Why the h-index is little use

In 2005 Jorge E. Hirsch published an article in the Proceedings of the National Academy of Science (link), proposing the â€œh-indexâ€, a metric for the impact of an academicâ€™s publications.

Your h-index is the largest number n such that you have n papers with n or more citations. So, for example, if you have 21 papers with 21 or more citations, but don’t yet have 22 papers with 22 or more citations, then your h-index is 21.

Hirsch claims that this measure is a better (or at least different) measure of impact than standard measures such as the total number of citations. He gives a number of apparently persuasive reasons why this might be the case.

In my opinion, for nearly all practical purposes, this claim is incorrect. In particular, and as I’ll explain below, you can to a good approximation work out the h-index as a simple function of the total number of citations, and so the h-index contains very little information beyond this already standard citation statistic.

Why am I mentioning this? Well, to my surprise the h-index is being taken very seriously by many people. A Google search shows the h-index is spreading and becoming very influential, very quickly. Standard citation services like the Web of Science now let you compute h-indices automatically. Promotion and grant evaluation committees are making use of them. And, of course, the h-index has been extensively discussed in the blogosphere (e.g., here, here, here, here, here, here, and here).

Hirsch focuses his discussion on physicists, and I’ll limit myself to that group, too; I expect the main conclusions to hold for other groups, with some minor changes in the constants. For the great majority of physicists (Iâ€™ll get to the exceptions), the h-index can be computed to a good approximation from the total number of citations they have received, as follows. Suppose T is the total number of citations. For most physicists, the following relationship holds to a very good approximation:

(*) h ~ sqrt(T)/2.

Thus, if someone has 400 citations, then their h-index is likely to be about half the square root of 400, which is 10. If someone has 2500 citations, then their h-index is likely to be about half the square root of 2500, which is 25.

The relationship (*) actually follows from the data Hirsch analysed. He notes in passing that he found empirically that T = a h^2, where a is between between 3 and 5. Inverting the relationship, we find that (*) holds to within an accuracy of about plus or minus 15%. Thatâ€™s accurate enough â€“ nobody cares whether your h-index is 20 or 23, particularly since citation statistics are already quite noisy. Provided a is in this range, h contains little additional information beyond T, which is already a commonly used citation statistic.

What about the exceptions to this rule? I believe there are two main sources of exception.

The first class of exception is people with very few papers. Someone with 1-4 papers can easily evade the rule, simply because their distribution of citations across papers may be very unusual. In practice, though, this doesnâ€™t much matter, since in such cases itâ€™s possible to look at a personâ€™s entire record, and measures of aggregate performance are not used so much in these cases, anyway.

The second class of exceptions is people who have one work which is vastly more cited than any other work. In that case the formula (*) tends to overstate the h-index. The effect is much smaller than you might think, though, since it seems to be that for the great majority of physicists their top-cited publication has many more citations than their next-most cited publication.

In any case, I hypothesize that this effect is mostly corrected by using the formula:

(**) h approx b sqrt(Tâ€™)

where Tâ€™ is a the total number of citations, less the most cited publication, and b is a constant which needs to be empirically determined. At a guess Iâ€™d believe that omitting the top two cited publications would work even better, but after that weâ€™d hit the point of diminishing returns.

Returning to the main point, my counter-claim to Hirsch is that the h-index contains very little additional information beyond the total number of citations. Itâ€™s not that the h-index is irrelevant, itâ€™s just that in the great majority of cases the h-index is not very useful, given that the total number of citations is likely to already be known.

Anonymous browsing now possible at the Academic Reader

Another update: as of a few weeks ago, it’s possible to browse anonymously for up to three days at the Academic Reader. So you can now easily try the service out and see if it’s for you.

After three days of anonymous use, you’ll be required to create an account. This isn’t imposed frivolously – it’s because the anonymous browsing is implemented using browser-based ookies, which aren’t particularly stable over periods of months. Creating an account ensures that your data (e.g., choice of journals, comments, personal feeds) won’t be lost if you’re a long-term user.

Living Reviews in Relativity added!

I just added the entire back catalog of one of my favourite journals, “Living Reviews in Relativity”, to the Academic Reader. It’s not part of the default feed set, so you’ll need to click on “Find more external feeds”, near the top of the left-hand column, if you want to add it. The journal publishes only a few articles per year, so it’s not a particularly active feed. But the articles are usually very good, so if you’re interested in relativity, it’s worth adding to your feeds.

Machine-readable Open Access scientific publishing

Over the last 50 years, scientific publishing has become remarkably profitable. The growth of commercial publishers has greatly outstripped not-for-profit society journals, and some of those commercial publishers have achieved remarkable success – for example, in 2006 industry titan Elsevier had revenues of approximately EU1.5 billion in 2006 on their science and medical journals.

Against this backdrop, an Open Access movement has emerged which has lobbied with considerable success for various types of open access to the scientific literature. Many funding bodies (including six of the seven UK Research Councils, the Australian Research Council, and the US’s NIH) are now considering making mandatory Open Access provisions for all research they support. A sign of the success of the Open Access movement is that some journal publishers have started an aggressive counter-lobbying effort, going under the Orwellian moniker “Publishers for Research Integrity in Science and Medicine” (PRISM). (For much more background, see some of the excellent blogs related to Open Access – e.g., Peter Suber, Stevan Harnad, Coturnix, and John Wilbanks).

What does Open Access actually mean? In fact, there are many different types of Open Access, depending on exactly what types of access (and when) are allowed to the papers under consideration. However, most of the effort in the Open Access movement seems to have focused on providing access to human readers of the papers, providing documents in formats like html or pdf. While these formats are good for humans, they are rather difficult for machines to break down and extract meaning from – to pick a simple example, it is not easy for a machine to reliably extract a list of authors and institutions from the raw pdf of a paper.

I believe it is important to establish a principle of Machine-readable Open Access. This is the idea that papers should be published in such a way that both the paper and its metadata (such as citations, authors, title, and so on) should be made freely available in a format that is easily machine readable. Phrased another way, it means that publishers should provide Open APIs that allow other people and organizations access to their data.

The key point of Machine-Readable Open Access is that it will enable other organizations to build value-added services on top of the scientific literature. At first, these services will be very simple – a better way of viewing the preprint arxiv, or better search engines for the scientific literature. But as time goes on, Machine-Readable Open Access will enable more advanced services to be developed by anyone willing to spend the time to build them. Examples might include tools to analyse the research literature, to discover emerging trends (perhaps using data mining and artificial intelligence techniques applied to citation patterns), to recommend papers that might be of interest, to automatically produce analyses of grants or job applications, and to point out connections between papers that otherwise would be lost.

Indeed, provided such services themselves support open APIs, it will become possible to build still higher level services, and thus provide a greater return on our collective investment in the sciences. All this will be enabled by ensuring that at each level data is provided in a format that is not only human readable, but which is also designed to be accessed by machines. For this reason, I believe that the Open Access provisions now being considered by funding agencies would be greatly strengthened if they mandated Machine-Readable Open Access.

arXiv now complete at the Academic Reader

A quick update: I recently completed uploading metadata from all the papers at the arXiv to the Academic Reader.

The tension between information creators and information organizers

In 2006, a group of Belgian newspapers sued Google, ostensibly to get snippets of their news stories removed from Google News (full story). In fact, the newspapers were well aware that this could be easily achieved by putting a suitable file on their webservers, instructing Google’s web crawler to ignore their webservers. What then was the real purpose of the lawsuit? It’s difficult to know for sure, but it seems likely that it was part of a ploy to pressure Google into paying the newspapers for permission to reuse the newspaper’s content.

This story is an example of a growing tension between creators of information, whether it be blogs, books, movies, music, or whatever, and organizers of information, such as Google. This tension is tightening sharply as people develop more services for organizing information, and profits increasingly flow toward the organizers rather than the creators.

As another example, in 2007, Google had advertising revenues of approximately 16 billion dollars(!), most of it from search. Yet, according to one study, approximately twenty-five percent of the number one search results on Google led to Wikipedia. Wikipedia, of course, does not directly benefit from Google’s advertising profits. I bet that at least some of Google’s best sources – e.g., Wikipedia, the New York Times, and some of the top blogs – are not happy that Google reaps what may seem a disproportionately large share of the advertising dollar.

Other examples of new niches in the organization of information include RSS readers (Bloglines, Netvibes); social news sites (Digg, Reddit); even my own Academic Reader. In each case, there is a natural tension between the creators of the underlying information, and the organizing service.

Now, of course, it’s greatly to the public benefit for such organizing services to thrive. However, for this to happen, a great deal of information must be made publicly available, preferably in a machine-readable format, like RSS or OAI. If the information is partially or completely locked up (think, e.g., Facebook’s friendship graph), then that enormously limits the web of value that can be built on top of the information. Yet Facebook is understandably very cautious about opening that information up, fearing that it would harm their business.

The situation is further complicated by the fact that the best people to organize and add value to information are often not the original creators of that information. This is for two reasons. First, is lack of technical expertise – the New York Times does lots of good reporting, but this doesn’t mean they’ll do a good job at providing a search interface to their archive of old articles. Second, is the problem of conflicts of interest – the New York Times would have a much harder time running something like Google News than Google does, since other news organizations would not co-operate with them.

Summing up the problem here in a single sentence, the question is this: to what extent should information be made freely accessible, in order to best serve the public interest?

There has, of course, been a lot of debate about this question, but much of that debate has centered around filesharing of music, movies and so on, where the additional value being added to the information is often minimal. The question becomes much more interesting when applied to services like Google News which add additional layers of meaning and organization to information.

At present, the legal situation is not clear. As an example, in the Belgian newspaper case, one might ask whether or not Google’s useage was acceptable under the fair use doctrine for copyright? After all, Google News only excerpted a few lines from the Belgian newspapers. Obviously, the Belgian Courts thought this was not fair use, but other jurisdictions are yet to follow suit.

If the situation today has not yet been resolved, then what might we see in an ideal future? On the one hand, it is highly desirable for information to be freely available for other people to add value. This will often mean making use of a large fraction (or all) of the content, a type of reuse not currently recognized as fair use, yet which is clearly in the public’s interest.

On the other hand, it is also highly desirable for content producers to have incentives to produce content. What we’re seeing at present is a migration of value up the chain from content creators like the New York Times to content organizers, like Google. This, in turn, is causing the content creators to erect fences around their data. The net result is not in anybody’s best interest.

I don’t know what the resolution of this problem is. But it is a real problem, and it’s going to get worse, and it worries me that we’ll end up in a world where the balance is too much one way or the other.

Basic papers on cluster-state quantum computation

Cluster-state quantum computing is an extremely interesting approach to quantum computing. Instead of doing lots of coherent “quantum gates”, as in the usual approach to quantum computing, cluster-state quantum computing provides a way of doing quantum computing with measurements alone. This is surprising from a fundamental point of view, and also turns out to be surprisingly practical – in many physical systems, but especially in optics, it seems like cluster-state quantum computation might be a lot easier to do than regular quantum computing.

Anyways, here’s a collection of basic papers on cluster-state quantum computing, which pretty much mirrors the papers I used to give my students as a way of learning the basics. It’s in no way meant to be complete, or slight anyone, or whatever – it’s just a starter pack describing a lot of the basic ideas.

(As will be obvious if you click on the link, I’m sharing this collection of papers using a website I’ve been developing, the Academic Reader. Suggestions for how to improve the sharing of collections of papers are welcome.)

APIs and the art of building powerful programs

In a recent essay, Steve Yegge observed, wisely in my opinion, that “the worst thing that can happen to a code base is size”.

Nowadays, I spend a fair bit of my time programming, and, like any programmer, I’m interested in building powerful programs with minimal effort. I want to use Yegge’s remark as a jumping off point for a few thoughts about how to build more powerful programs with less effort.

Let’s start by asking what, exactly, is the problem with big programs?

For solo programmers, I think the main difficulty is cognitive overload. As a program gets larger it gets harder to hold the details of the entire program in one’s head. This means that it gradually becomes more difficult to understand the side effects of changes to the code, and so alterations at one location become more likely to cause unintended (usually negative) consequences elsewhere. I’d be surprised if there are many programmers who can hold more than a few hundred thousand lines of code in their head, and most probably can’t hold more than a few tens of thousands. It is suggestive that these numbers are comparable to the sizes involved in other major coherent works of individual human creativity – for example, a symphony has roughly one hundred thousand notes, and a book has tens of thousands of words or phrases.

An obvious way to attempt to overcome this cognitive limit is to employ teams of programmers working together in collaboration. Collaborative programming is a fascinating topic, but I’m not going to discuss it here. Instead, in this essay I focus on how individual programmers can build the most powerful possible programs, given their cognitive limitations.

The usual way individual programmers overcome the limits caused by program size is to reuse code other people have built; perhaps the most familiar examples of this are tools such as programming languages and operating systems. This reuse effectively allows us to incorporate the codebase in these tools into our own codebase; the more powerful the tools, the more powerful the programs we can, in principle, build, without exceeding our basic cognitive limits.

I’ve recently experienced the empowerment such tools produce in a particularly stark way. Between 1984 and 1990, I wrote on the order of a hundred thousand lines of code, mostly in BASIC and 6502 assembly language, with occasional experimentation using languages such as Forth, Smalltalk and C. At that point I stopped serious programming, and only wrote a few thousand lines of code between 1990 and 2007. Now, over the last year, I’ve begun programming again, and it’s striking to compare the power of the tools I was using (and paying
for) circa 1990 with the power of today’s free and open source tools. Much time that I would formerly have spent writing code is now instead spent learning APIs (application programming interfaces) for libraries which let me accomplish in one or a few lines of code what would have formerly taken hundreds or thousands of lines of code.

Let’s think in more detail about the mechanics of what is going on when we reuse someone else’s code in this way. In a well-designed tool, what happens is that the internals of the tool’s codebase are hidden behind an abstract external specification, the API. The programmer need only master the API, and can ignore the internal details of the codebase. If the API is sufficiently condensed then, in principle, it can hide a huge amount of functionality, and we need only learn a little in order to get the benefit of a huge codebase. A good API is like steroids for the programming mind, effectively expanding the size of the largest possible programs we can write.

All of this common knowledge to any programmer. But it inspires many natural questions whose answers perhaps aren’t so obvious: How much of a gain does any particular API give you? What makes a particular API good or bad? Which APIs are worth learning? How to design an API? These are difficult questions, but I think it’s possible to make some simple and helpful observations.

Let’s start with the question of how much we gain for any given API. One candidate figure of merit for an API is the ratio of the size of the codebase implementing the API to the size of the abstract specification of the API. If this ratio is say 1:1 or 3:2 then not much has been gained – you might just as well have mastered the entire codebase. But if the ratio is 100:1 then you’ve got one hundred times fewer details you need to master in order to get the benefit of the codebase. This is a major improvement, and potentially greatly expands the range of what we can accomplish within our cognitive limits.

One thing this figure of merit helps explain is when one starts to hit the point of diminishing returns in mastering a new API. For example, as we master the initial core of a new library, we’re often learning a relatively small number of things in the API that nonetheless enable us to wield an enormous codebase. Thus, the effective figure of merit is high. Later, when we begin to master more obscure features of the library, the figure of merit drops substantially.

Of course, this figure of merit shouldn’t be taken all that seriously. Among its many problems, the number of lines of code in the codebase implementing the API is only a very rough proxy for the sophistication of the underlying codebase. A better programmer could likely implement the same API in fewer lines of code, but obviously this does not make the API less powerful. An alternate and better figure of merit might be the ratio of the time required for you to produce a codebase implementing the API, versus the time required to master the API. The larger this ratio, the more effort the API saves you. Regardless of which measure you use, this ratio seems a useful way of thinking about the power of an API.

A quite different issue is the quality of the API’s design. Certain abstractions are more powerful and useful than others; for example, many programmers claim that programming in Lisp makes them think in better or more productive ways. This is not because the Lisp API is especially powerful according to the figure of merit I have described above, but rather because Lisp (so I am told) introduces and encourages programmers to use abstractions that offer particularly effective ways of programming. Understanding what makes such an abstraction “good” is not something I’ll attempt to do here, although obviously it’s a problem of the highest importance for a programmer!

Which APIs are worth learning? This is a complicated question, which deserves an essay in its own right. I will say this: one should learn many APIs, across a wide variety of areas, and make a point of studying multiple APIs that address the same problem space using greatly different approaches. This is not just for the obvious reason that learing APIs is often useful in practice. It’s because, just as writers need to read, and movie directors should watch movies from other directors, programmers should study other people’s APIs. They should study other people’s code, as well, but it’s a bit like a writer studying the sentence structure in Lord of the Rings; you’ll learn a lot, but you may just possibly miss the point that Frodo is carrying a rather nasty ring. As a programmer, studying APIs will alert you to concepts and tricks of abstraction that may very well help in your own work, and which will certainly help improve your skill at rapidly judging and learning unfamiliar APIs.

How does one go about mastering an API? At its most basic level mastery means knowing all or most of the details of the specification of the API. At the moment I’m trying to master a few different APIs – for Ruby, Ruby on Rails, MySQL, Amazon EC2, Apache, bash, and emacs. I’ve been finding it tough going, not because it’s particularly difficult, but just because it takes quite a lot of time, and is often tedious. After some experimentation, the way I’ve been going about it is to prepare my own cheatsheets for each API. So, for example, I’ll take 15 minutes or half an hour and work through part of the Ruby standard library, writing up in my cheatsheet any library calls that seem particularly useful (interestingly, I find that when I do this, I also retain quite a bit about other, less useful, library calls). I try to do this at least once a day.

(Suggestions from readers for better ways to learn an API would be much appreciated.)

Of course, knowing the specification of an API is just the first level of mastery. The second level is to know how to use it to accomplish real tasks. Marvin Minsky likes to say that the only way you can really understand anything is if you understand it in at least two different ways, and I think a similar principle applies to learning an API – for each library call (say), you need to know at least two (and preferably more) quite different ways of applying that library call to a real problem. Ideally, this will involve integrating the API call in non-trivial ways with other tools, so that you begin to develop an understanding of this type of integration; this has the additional benefit that it will simultaneously deepen your understanding of the other tools as well.

Achieving this second level of mastery takes a lot of time and discipline. While I feel as though I’ve made quite some progress with the first level of mastery, I must admit that this second level tries my patience. It certainly works best when I work hard at finding multiple imaginative ways of applying the API in my existing projects.

There’s a still higher level of mastery of an API, which is knowing the limits of the abstract specification, and understanding how to work around and within those limits. Consider the example of manipulating files on a computer file system. In principle, operations like finding and deleting files should be more or less instantaneous, and for many practical purposes they are. However, if you’ve ever tried storing very large number of files inside a single directory (just how large depends on your file system), you’ll start to realize that actually there is a cost to file manipulation, and it can start to get downright slow with large numbers of files.

In general, for any API the formal specification is never the entire story. Implicit alongside the formal specification is a meta-story about the limits to that specification. How does the API bend and break? What should you do when it bends or breaks? Knowing these things often means knowing a little about the innards of the underlying codebase. It’s a question of knowing the right things so you can get a lot of benefit, without needing to know a huge amount. Poorly designed APIs require a lot of this kind of meta-knowledge, which greatly reduces their utility, in accord with our earlier discussion of API figures of merit.

We’ve been talking about APIs as things to learn. Of course, they are also things you can design. I’m not going to talk about good API design practice here – I don’t yet have enough experience – but I do think it’s worth commenting on why one ought to spend some fraction of one’s time designing and implementing APIs, preferably across a wide variety of domains. My experience, at least, is that API design is a great way of improving my skills as a programmer.

Of course, API design has an immediate practical benefit – I get pieces of code that I can reuse at later times without having to worry about the internals of the code. But this is only a small part of the reason to design APIs. The greater benefit is to improve my understanding of the problems I am solving, of how APIs function, and what makes a good versus a bad API. This improved understanding makes it easier to learn other APIs, improves how I use them, and, perhaps most important of all, improves my judgement about which APIs to spend time learning, and which to avoid.

Fred Brooks famously claimed that there is “no silver bullet” for programming, no magical idea or technique that will make it much easier. But Brooks was wrong: there is a silver bullet for programming, and it’s this building of multiple layers of abstraction using ever more powerful tools. What’s really behind Brooks’ observation is a simple fact of human psychology: as more powerful tools become available, we start to take our new capabilities for granted and so, inevitably, set our programming sites higher, desiring ever more powerful programs. The result is that we have to work as hard as ever, but can build more powerful tools.

What’s the natural endpoint of this process? At the individual level, if, for example, you master the API for 20 programming tools, each containing approximately 50,000 lines of code, then you can wield the power of one million lines of code. That’s a lot of code, and may give you the ability to create higher level tools that simply couldn’t have been created at the lower level. Those higher level tools can be used to create still higher level tools, and so on. Stuff that formerly would have been impossible first becomes possible, then becomes trivial, and finally becomes invisible, absorbed into higher-level primitives. If we move up a half a dozen levels, buying a factor of 2-5 in power at each layer, the result is that we can get perhaps a factor of 1,000 or more done. Collectively, the gain for programmers over the long run is even greater. As time goes on we will see more and more layers of abstraction, built one on top of the other, ever expanding the range of what is possible with our computing systems.