The mismeasurement of science

Albert Einstein’s greatest scientific “blunder” (his word) came as a sequel to his greatest scientific achievement. That achievement was his theory of gravity, the general theory of relativity, which he introduced in 1915. Two years later, in 1917, Einstein ran into a problem while trying to apply general relativity to the Universe as a whole. At the time, Einstein believed that on large scales the Universe is static and unchanging. But he realized that general relativity predicts that such a Universe can’t exist: it would spontaneously collapse in on itself. To solve this problem, Einstein modified the equations of general relativity, adding an extra term involving what is called the “cosmological constant”, which, roughly speaking, is a type of pressure which keeps a static Universe from collapsing.

Twelve years later, in 1929, Edwin Hubble discovered that the Universe isn’t static and unchanging, but is actually expanding. Upon hearing the news, Einstein quickly realized that if he’d taken his original 1915 theory seriously, he could have used it to predict the expansion that Hubble had observed. That would have been one of the great theoretical predictions of all time! It was this realization that led Einstein to describe the cosmological constant as the “biggest blunder” of his life.

The story doesn’t end there. Nearly seven decades later, in 1998, two teams of astronomers independently made some very precise measurements of the expansion of the Universe, and discovered that there really is a need for the cosmological constant (ref,ref). Einstein’s “biggest blunder” was, in fact, one of his most prescient achievements.

The point of the story of the cosmological constant is not that Einstein was a fool. Rather, the point is that it’s very, very difficult for even the best scientists to accurately assess the value of scientific discoveries. Science is filled with examples of major discoveries that were initially underappreciated. Alexander Fleming abandoned his work on penicillin. Max Born won the Nobel Prize in physics for a footnote he added in proof to a paper – a footnote that explains how the quantum mechanical wavefunction is connected to probabilities. That’s perhaps the most important idea anyone had in twentieth century physics. Assessing science is hard.

The problem of measuring science

Assessing science may be hard, but it’s also something we do constantly. Countries such as the United Kingdom and Australia have introduced costly and time-consuming research assessment exercises to judge the quality of scientific work done in those countries. In just the past few years, many new metrics purporting to measure the value of scientific research have been proposed, such as the h-index, the g-index, and many more. In June of 2010, the journal Nature ran a special issue on such metrics. Indeed, an entire field of scientometrics is being developed to measure science, and there are roughly 1,500 professional scientometricians.

There’s a slightly surreal quality to all this activity. If even Einstein demonstrably made enormous mistakes in judging his own research, why are the rest of us trying to measure the value of science systematically, and even organizing the scientific systems of entire countries around these attempts? Isn’t the lesson of the Einstein story that we shouldn’t believe anyone who claims to be able to reliably assess the value of science? Of course, the problem is that while it may be near-impossible to accurately evaluate scientific work, as a practical matter we are forced to make such evaluations. Every time a committee decides to award or decline a grant, or to hire or not hire a scientist, they are making a judgement about the relative worth of different scientific work. And so our society has evolved a mix of customs and institutions and technologies to answer the fundamental question: how should we allocate resources to science? The answer we give to that question is changing rapidly today, as metrics such as citation count and the h-index take on a more prominent role. In 2006, for example, the UK Government proposed changing their research assessment exercise so that it could be done in a largely automated fashion, using citation-based metrics. The proposal was eventually dropped, but nonetheless the UK proposal is a good example of the rise of metrics.

In this essay I argue that heavy reliance on a small number of metrics is bad for science. Of course, many people have previously criticised metrics such as citation count or the h-index. Such criticisms tend to fall into one of two categories. In the first category are criticisms of the properties of particular metrics, for example, that they undervalue pioneer work, or that they unfairly disadvantage particular fields. In the second category are criticisms of the entire notion of quantitatively measuring science. My argument differs from both these types of arguments. I accept that metrics in some form are inevitable – after all, as I said above, every granting or hiring committee is effectively using a metric every time they make a decision. My argument instead is essentially an argument against homogeneity in the evaluation of science: it’s not the use of metrics I’m objecting to, per se, rather it’s the idea that a relatively small number of metrics may become broadly influential. I shall argue that it’s much better if the system is very diverse, with all sorts of different ways being used to evaluate science. Crucially, my argument is independent of the details of what metrics are being broadly adopted: no matter how well-designed a particular metric may be, we shall see that it would be better to use a more heterogeneous system.

As a final word before we get to the details of the argument, I should perhaps mention my own prejudice about the evaluation of science, which is the probably not-very-controversial view that the best way to evaluate science is to ask a few knowledgeable, independent- and broad-minded people to take a really deep look at the primary research, and to report their opinion, preferably while keeping in mind the story of Einstein and the cosmological constant. Unfortunately, such a process is often not practically feasible.

Three problems with centralized metrics

I’ll use the term centralized metric as a shorthand for any metric which is applied broadly within the scientific community. Examples today include the h-index, the total number of papers published, and total citation count. I use this terminology in part because such metrics are often imposed by powerful central agencies – recall the UK government’s proposal to use a citation-based scheme to assess UK research. Of course, it’s also possible for a metric to be used broadly across science, without being imposed by any central agency. This is happening increasingly with the h-index, and has happened in the past with metrics such as the number of papers published, and the number of citations. In such cases, even though the metric may not be imposed by any central agency, it is still a central point of failure, and so the term “centralized metric” is appropriate. In this section, I describe three ways centralized metrics can inhibit science.

Centralized metrics suppress cognitive diversity: Over the past decade the complexity theorist Scott Page and his collaborators have proved some remarkable results about the use of metrics to identify the “best” people to solve a problem (ref,ref). Here’s the scenario Page and company consider. Suppose you have a difficult creative problem you want solved – let’s say, finding a quantum theory of gravity. Let’s also suppose that there are 1,000 people worldwide who want to work on the problem, but you have funding to support only 50 people. How should you pick those 50? One way to do it is to design a metric to identify which people are best suited to solve the problem, and then to pick the 50 highest-scoring people according to that metric. What Page and company showed is that it’s sometimes actually better to choose 50 people at random. That sounds impossible, but it’s true for a simple reason: selecting only the highest scorers will suppress cognitive diversity that might be essential to solving the problem. Suppose, for example, that the pool of 1,000 people contains a few mathematicians who are experts in the mathematical field of stochastic processes, but who know little about the topics usually believed to be connected to quantum gravity. Perhaps, however, unbeknownst to us, expertise in stochastic processes is actually critical to solving the problem of quantum gravity. If you pick the 50 “best” people according to your metric it’s likely that you’ll miss that crucial expertise. But if you pick 50 people at random you’ve got a chance of picking up that crucial expertise [1]. Richard Feynman made a similar point in a talk he gave shortly after receiving the Nobel Prize in physics (ref):

If you give more money to theoretical physics it doesn’t do any good if it just increases the number of guys following the comet head. So it’s necessary to increase the amount of variety… and the only way to do it is to implore you few guys to take a risk with your lives that you will never be heard of again, and go off in the wild blue yonder and see if you can figure it out.

What makes Page and company’s result so striking is that they gave a convincing general argument showing that this phenomenon occurs for any metric at all. They dubbed the result the diversity-trumps-ability theorem. Of course, exactly when the conclusion of the theorem applies depends on many factors, including the nature of the cognitive diversity in the larger group, the details of the problem, and the details of the metric. In particular, it depends strongly on something we can’t know in advance: how much or what type of cognitive diversity is needed to solve the problem at hand. The key point, though, is that it’s dangerously naive to believe that doing good science is just a matter of picking the right metric, and then selecting the top people according to that metric. No matter what the metric, it’ll suppress cognitive diversity. And that may mean suppressing knowledge crucial to solving the problem at hand.

Centralized metrics create perverse incentives: Imagine, for the sake of argument, that the US National Science Foundation (NSF) wanted to encourage scientists to use YouTube videos as a way of sharing scientific results. The videos could, for example, be used as a way of explaining crucial-but-hard-to-verbally-describe details of experiments. To encourage the use of videos, the NSF announces that from now on they’d like grant applications to include viewing statistics for YouTube videos as a metric for the impact of prior research. Now, this proposal obviously has many problems, but for the sake of argument please just imagine it was being done. Suppose also that after this policy was implemented a new video service came online that was far better than YouTube. If the new service was good enough then people in the general consumer market would quickly switch to the new service. But even if the new service was far better than YouTube, most scientists – at least those with any interest in NSF funding – wouldn’t switch until the NSF changed its policy. Meanwhile, the NSF would have little reason to change their policy, until lots of scientists were using the new service. In short, this centralized metric would incentivize scientists to use inferior systems, and so inhibit them from using the best tools.

The YouTube example is perhaps fanciful, at least today, but similar problems do already occur. At many institutions scientists are rewarded for publishing in “top-tier” journals, according to some central list, and penalized for publishing in “lower-tier” journals. For example, faculty at Qatar University are given a reward of 3,000 Qatari Rials (US $820) for each impact factor point of a journal they publish in. If broadly applied, this sort of incentive would creates all sorts of problems. For instance, new journals in exciting emerging fields are likely to be establishing themselves, and so have a lower impact factor. So the effect of this scheme will be to disincentivize scientists from participating in new fields; the newer the field, the greater the disincentive! Any time we create a centralized metric, we yoke the way science is done to that metric.

Centralized metrics misallocate resources: One of the causes of the financial crash of 2008 was a serious mistake made by rating agencies such as Moody’s, S&P, and Fitch. The mistake was to systematically underestimate the risk of investing in financial instruments derived from housing mortgages. Because so many investors relied on the rating agencies to make investment decisions, the erroneous ratings caused an enormous misallocation of capital, which propped up a bubble in the housing market. It was only after homeowners began to default on their mortgages in unusually large numbers that the market realized that the ratings agencies were mistaken, and the bubble collapsed. It’s easy to blame the rating agencies for this collapse, but this kind of misallocation of resources is inevitable in any system which relies on centralized decision-making. The reason is that any mistakes made at the central point, no matter how small, then spread and affect the entire system.

In science, centralization also leads to a misallocation of resources. We’ve already seen two examples of how this can occur: the suppression of cognitive diversity, and the creation of perverse incentives. The problem is exacerbated by the fact that science has few mechanisms to correct the misallocation of resources. Consider, for example, the long-term fate of many fashionable fields. Such fields typically become fashionable as the result of some breakthrough result that opens up many new research possiblities. Encouraged by that breakthrough, grant agencies begin to invest heavily in the field, creating a new class of scientists (and grant agents) whose professional success is tied not just to the past success of the field, but also to the future success of the field. Money gets poured in, more and more people pursue the area, students are trained, and go on to positions of their own. In short, the field expands rapidly. Initially this expansion may be justified, but even after the field stagnates, there are few structural mechanisms to slow continued expansion. Effectively, there is a bubble in such fields, while less fashionable ideas remain underfunded as a result. Furthermore, we should expect such scientific bubbles to be more common than bubbles in the financial market, because decision making is more centralized in science. We should also expect scientific bubbles to last longer, since, unlike financial bubbles, there are few forces able to pop a bubble in science; there’s no analogue to the homeowner defaults to correct the misallocation of resources. Indeed, funding agencies can prop up stagnant fields of research for decades, in large part because the people paying the cost of the bubble – usually, the taxpayers – are too isolated from the consequences to realize that their money is being wasted.

One metric to rule them all

No-one sensible would staff a company by simply applying an IQ test and employing whoever scored highest (c.f., though, ref). And yet there are some in the scientific community who seem to want to move toward staffing scientific institutions by whoever scores highest according to the metrical flavour-of-the-month. If there is one point to take away from this essay it is this: beware of anyone advocating or working toward the one “correct” metric for science. It’s certainly a good thing to work toward a better understanding of how to evaluate science, but it’s easy for enthusiasts of scientometrics to believe that they’ve found (or will soon find) the answer, the one metric to rule them all, and that that metric should henceforth be broadly used to assess scientific work. I believe we should strongly resist this approach, and aim instead to both improve our understanding of how to assess science, and also to ensure considerable heterogeneity in how decisions are made.

One tentative idea I have which might help address this problem is to democratize the creation of new metrics. This can happen if open science becomes the norm, so scientific results are openly accessible, online, making it possible, at least in principle, for anyone to develop new metrics. That sort of development will lead to a healthy proliferation of different ideas about what constitutes “good science”. Of course, if this happens then I expect it will lead to a certain amount of “metric fatigue” as people develop many different ways of measuring science, and there will be calls to just settle down on one standard metric. I hope those calls aren’t heeded. If science is to be anything more than lots of people following the comet head, we need to encourage people to move in different directions, and that means valuing many different ways of doing science.

Update: After posting this I Googled my title, out of curiosity to see if it had been used before. I found an interesting article by Peter Lawrence, which is likely of interest to anyone who enjoyed this essay.

Acknowledgements

Thanks to Jen Dodd and Hassan Masum for many useful comments. This is a draft of an essay to appear in a forthcoming volume on reputation systems, edited by Hassan Masum and Mark Tovey.

Footnotes

[1] Sometimes an even better strategy will be a mixed strategy, e.g., picking the top 40 people according to the metric, and also picking 10 at random. So far as I know this kind of mixed strategy hasn’t been studied. It’s difficult to imagine that the proposal to pick, say, one in five faculty members completely at random is going to receive much support at Universities, no matter how well founded the proposal may be. We have too much intuitive sympathy for the notion that the best way to generate global optima is to locally optimize. Incidentally, the success of such mixed strategies is closely related to the phenomenon of stochastic resonance, wherein adding a noise to a system can sometimes improve its performance.

My book “Reinventing Discovery” will be released in 2011. It’s about the way open online collaboration is revolutionizing science. A summary of many of the themes in the book is available in this essay. If you’d like to be notified when the book is available, please send a blank email to the.future.of.science@gmail.com with the subject “subscribe book”. You can subscribe to my blog here, and to my Twitter account here.

35 comments

  1. A couple of comments.

    1) I think it’s fairly obvious that relying on “centralized metrics” to allocate research funds and/or jobs leads to inferior outcomes. The confounding factor is that those who seem to rely most-heavily on centralized metrics are already in the 2nd (or 3rd or 4th or 5th…) tier. It’s not at all clear whether they are 2nd-rate because of their reliance on centralized metrics, or whether the reverse is true.

    See, the problem is this: it is relatively easy to name the top 20 people in a particular field; it’s much harder to name numbers 105-124. It’s in figuring out who the latter are, where quantitative metrics become important (or, at least, hard to find viable alternatives).

    That’s why your concrete example involved Qatar University, not Harvard University.

    2) I think your account of scientific bubbles is seriously deficient. On the one hand, you describe a field as “stagnating”; on the other hand, you describe new researchers and new research dollars as continuing to flow into the field. That doesn’t sound much like “stagnation” and, from experience on Departmental committees that end up trying to tackle the phenomenon in the real world, that’s not what stagnation looks like in practice.

    In practice, what happens is that you have a once-hot field, with a group of now-tenured faculty members. These faculty continue to pull in substantial research grants. (Over at the funding agency, the program funding their research continues to receive large pots of money, for similar institutional reasons.) And they’d like to perpetuate that situation — through new faculty hires, recruiting more graduate students to the field (for some reason, most prospective graduate students don’t seem all that interested…), etc.

    The problem to be overcome is not — I would say — poor metrics, but rather institutional entrenchment.

    3) Your Feynman quote is typical self-serving Feynman. Of course the correct answer to the problem he poses is simply to fund more Feynmans!

  2. Well put, I agree with you: the use of metrics is in some regards useful and unavoidable anyway, but it’s the use of centralized criteria that brings trouble. The problem is that this is where we’re headed.

  3. Great essay. Probably worth a mention of the types of metrics (citation, usage, etc) and more on impact factor. Also probably worth a mention of efforts to bring in new metrics from social software mentions, etc (e.g. Priem and Costello).

  4. From the comment above: “Your Feynman quote is typical self-serving Feynman. Of course the correct answer to the problem he poses is simply to fund more Feynmans!”

    But that’s exactly the point. We got to fund more Feynmans, more Einsteins, more Darwins if we want to advance science. Advances in science come not only in the form of technical advances (e.g. inventing the telescope, sequencing human genome…). The greatest ***scientific*** advances are conceptual, paradigmatic ones (e.g. the theory of relativity, the theory of evolution…), and these kind of advances are only made possible if we insert some diversity into our scientific fields. Assessment of science based on a few metrics hinders the funding and development of valuable, non-trendy research.

    What I learned from reading this article is that decisions as to award or decline a grant should also consider whether an individual project sheds some new light onto a problem.

  5. A couple more comments:

    1) From various points of view, you argue for injecting an element of randomness into the selection process. I would argue that is already part of the process, in that a significant part of the funding pie is typically devoted to new researchers.

    They, for obvious reasons, typically don’t score well on whatever metric you happen to be using. So selecting among them involves a level of guesswork (randomness) not present in selecting among “established” researchers.

    That doesn’t address all of your issues (e.g., there’s no particular reason to expect new researchers to possess the esoteric knowledge we don’t currently realize will later turn out to be crucial to cracking some particular problem), but it is baked into the cake already, and shouldn’t be neglected in your discussion.

    2) You suggest that “open science” could allow “anyone to develop new metrics,” and then make an elliptical reference to accessing server logs, as a prerequisite. Since you don’t give any examples, it’s not clear what you have in mind. But, if I had to guess at what is being suggested, all I can say is that would be an unmitigatedly bad idea.

    MN: Reference to server logs deleted, as it distracts from the main point.

  6. > 1) From various points of view, you argue for injecting an element of randomness into the selection process. I would argue that is already part of the process, in that a significant part of the funding pie is typically devoted to new researchers.

    Jacques, are new researchers really that random? I seem to recall a vein of research showing ‘Matthew effects’ where established researchers steer or otherwise increase the chances of associated juniors in getting grants, publications, and prizes. New researchers aren’t very random if they are merely following up unoriginal tangents thought up by their seniors.

  7. “Random” in the sense that some will turn out to be great, some will turn out to be duds, but your metric (whatever it is) is more-or-less useless in determining which will be great and which will be duds.

    Sorry if that was unclear.

  8. I think that central, government-based funding is a concern. It will unavoidably put in the hands of few people the power to direct science, more or less blindly. It is made worse when it is decided that fewer, but larger, grants is better.

    In effect, to get diverse metrics, you need diverse funding sources. We also need a more diverse pool of employers (or patron) in science.

    The current system is fundamentally flawed.

    [MN: I quite agree: with more diverse funding sources, science itself will explore far more different directions, not just a few comet heads…]

  9. I’d love to see what statisticians come up with to satisfy the diversity problem with metrics. Perhaps they’ll invent a sort of affirmative action style policy. Of course that would fail, for all the reasons you mentioned, but I’m looking forward to it being implemented on account of these observations.

  10. Pingback: Using Metrics
  11. The most frequently used measures of scientific impact tend to be a combination of the general impact factor of the scientific publication in which the work in question appears and the number of times the publication is subsequently cited by other scientists in their own published work. These data are commonly used to determine the scientific impact of an investigator when it comes to career advancement and the awarding of grant-in-aid funds. One significant problem is that it may take several years post-publication before a reasonable citation count can be established for a scientific manuscript, so the general impact score of the journal in which it appears actually predominates the assessment.

    In my observations and experiences, the premier scientific journals tend to be very trendy in terms of what they choose to publish. If someone works and publishes something truly novel, it is highly unlikely to appear in such journals. If one is engaged in research in a “hot” area, then there is also a much greater prospect that other scientists will bother to read and perhaps cite a publication in this field. This generates a lot of what I call “me too research” down well trodden paths. While it may seem that this published work is apparently having an impact, the real question is whether it is truly advancing science in new and productive directions.

    For most academic scientists, their key objective is to create new knowledge, but this need not have immediate practical outcomes. The expectation of the general public that actually funds research is that there should be some clear practical results with applications in, for just a few examples, medicine, agriculture, and energy production. Such translation of basic research can take decades so the real measure of scientific impact of even a body of work from an investigator will rarely become evident within a period that can have a bearing on their career progression and especially their grant funding. Nobel prizes almost never are awarded to young investigators.

    In view of this clear difficulty in assessment of the importance of the scientific contributions of an individual investigator and their proposed studies for allocation of grant funding as well as many other caveats with other aspects of this process, it is probably time for a complete overhaul of the system. The continuing trend has been to support a smaller percentage of the biomedical researchers with larger grants. I advocate that more people should be funded with smaller grants. With smaller allocations, investigators will form more transient, but natural alliances and collaborations that are much more likely to be diversified, fruitful, and require much less counter-productive administration.

  12. perhaps: for each grant, one can define an appropriate metric.
    perhaps: metrics need not operate on marginal statistics of individual scientists, but rather, joint statistics. that is, if one desires a diverse group of people, first choose the scientist with the highest score, then choose the next one discounting for overlap with the first one, etc. this will yield the best diverse group of scientists. to my knowledge, pairwise, or other higher order metrics are not available, although i’d be interested in seeing them….

    [MN: Collective metrics are an intriguing idea. The difficulty, of course, is that you then have to explain to the “second-best” (according to your basic metric) scientist that they didn’t get funding because they overlapped too much with number 1, while number 7 is getting funding, because they overlap much less. Not something we humans would have an easy time with, I suspect – the primate brain has some pretty inbuilt notions of fairness, and this would violate them! Still, I like this way of thinking, and maybe something satisfactory could be done.]

  13. You say that given a goodness g(p) for each person p it is not a good idea to pick the best 50 people out of 1000, for any goodness function g. Then, in the footnote, you propose picking the best 40 (according to some g) and 10 others at random. Isn’t that (almost) the same as picking the best 50 according to the goodness g'(p) = g(p) + (1 with probability 10/1000), where g(p) is normalized with maximum 1? How can that be?

    I guess what I’m saying is this: From a purely theoretical point of view, I have trouble understanding the difference between one (possibly randomized) metric and multiple metrics.

    [MN: Yes, it’s similar. But it’s not exactly the same. I expect (in fact, I’m quite certain) that under some circumstances what I propose would work better. But I haven’t explicitly worked out an example.]

  14. @Christina: Thanks, as always, for your comments. I do already mention several metrics, including citation count, number of papers and impact factor. I’m reluctant to add more on impact factor: it’s been done to death, and, of course, all my arguments apply to it, since it’s a centralized metric. I have a hard time getting worked up about the impact factor the way some people do, perhaps because none of the hiring committees I’ve ever sat on cared very much about impact factor. When writing the essay I considered discussing new uses of metrics in social software for scientists. It seemed to me best either to expand the essay quite a bit to talk about this very interesting subject, or else to remain silent, and take it as given that the reader will see the relevance. Ultimately, I decided to leave the essay as is, and go for silence.

  15. Pingback: Quora
  16. Pingback: Quora
  17. Pingback: Quora

Comments are closed.