October 2007 – Michael Nielsen

Ruby script to count the number of lines changed in a subversion repository

I like to set minimal targets for the amount of work I’m going to do on different projects during a week – things like word counts on papers, and so on. In that vein, here’s a short Ruby script to count the number of lines changed in a subversion repository. It returns the number of added lines and deleted lines from today’s edits. The advantage of this over most wordcount facilities is that it can be used both for programming and writing projects, and copes with an entire directory structure, not just a single file.

print “Additions: ”

puts `svn diff -r \{#{(Time.now-86400).strftime(‘%Y-%m-%d’)}\}:\{#{
(Time.now+86400).strftime(‘%Y-%m-%d’)}\} | grep \”^+\” | wc`

print “Deletions: ”

puts `svn diff -r \{#{(Time.now-86400).strftime(‘%Y-%m-%d’)}\}:\{#{
(Time.now+86400).strftime(‘%Y-%m-%d’)}\} | grep \”^\\-\” | wc`

It’s easily modified to run as a shell script, or to vary the dates.

The future of libraries

Via Peter Brantley, a beautiful comment from Winston Tabb, the Librarian at The Johns Hopkins University, which provides a nice lens through which to view the challenging situation libraries find themselves in:

Data centers are the new stacks.

It’s an interesting time to be a librarian. I spent a couple of months earlier this year working in the State Library of Queensland. I met some librarians who were incredibly upbeat about current opportunities, and others who were in denial. One day, at lunch, I overhead a conversation between two librarians discussing the “impossibility” of putting the entire back catalogue of major newspapers online. I wonder how long before Google or someone else unveils such a catalogue.

Update: In comments, John Dupuis, Head of the Science Library at York University, points to a thoughtful series of essays on the opportunities and challenges now before libraries.

“The” Scientific Method

Cosma Shalizi ponders the Scientific Method and the Philosophy of Science:

Philosophy of science these days seems largely concerned with questions of method, justification and reliability — what do scientists do (and are they all doing the same thing? are they doing what they think they’re doing?), and does it work, and if so why, and what exactly does it produce? There are other issues, too, like, do scientific theories really tell us about the world, or just give us tools for making predictions (and is there a difference there?). The whole reductionism—emergence squabble falls under this discipline, too. But (so far as an outsider can judge), method is where most of the debate is these days.

Of course, most scientists proceed in serene indifference to debates in methodology, and indeed all other aspects of the philosophy of science. What Medawar wrote thirty years ago and more is still true today:

If the purpose of scientific methodology is to prescribe or expound a system of enquiry or even a code of practice for scientific behavior, then scientists seem to be able to get on very well without it. Most scientists receive no tuition in scientific method, but those who have been instructed perform no better as scientists than those who have not. Of what other branch of learning can it be said that it gives its proficients no advantage; that it need not be taught or, if taught, need not be learned?

(Actually, has anyone done a controlled study of that point?) One of the things a good methodology should do is, therefore, either explain why scientists don’t have to know it.

An observation I find fascinating is that scientists employ very different norms when evaluating what it means to know something in their area of expertise versus what they know about doing science. An experimental physicist may have extremely rigorous standards of what it means to determine something experimentally, and a far more seat-of-the-pants means of evaluating knowledge about what it means to be a good experimental physicist. And, of course, they apply different standards again to everyday knowledge.

Cosma continues a little later:

Now of course working scientists do employ lots of different methods, which are of varying quality. The same is true of all learned professions, and it is probably also true that most professionals (lawyers, architects, doctors) pay no heed to foundational debates about what they are doing. Instead methods seem to breed within the profession — this technique is unreliable under these circumstances, that procedure works better than the old one, etc. — without, as it were, the benefit of philosophical clergy.

Feyerabend had a nice term for this – “anything goes”. I don’t think he meant this literally. Rather, he meant that method was something that scientists invented on a case-by-case basis, with formal methodology being only a heuristic guide, not gospel.

StartupCamp Waterloo

StartupCamp is on in Waterloo next Tuesday night, 6pm to 9pm, at the Waterloo Accelerator Centre, 295 Hagey Blvd., Waterloo. I went to a similar event, DemoCamp Guelph, in June, and it was terrific – lots of energy and great ideas in the room, with a nice informal atmosphere that made it easy to meet people. Great for anyone interested in entrepeneurship, or just interesting new technology.

More generally, if you’re in the Greater Toronto area, and haven’t already done so, you should check out the TorCamp page, which lists all the BarCamps, DemoCamps and so on going on in the area. There’s a lot of really amazing stuff going on!

Academic Reader Database

Just a brief note for users of the Academic Reader – we lost a few hours worth of database updates today. It was an error on my part, and I apologize if anyone lost anything significant because of it. Ironically, the error occurred as I was updating the code so that database backups are made more frequently than once every 24 hours.

Open source software at centralized servers?

Does anyone know of examples of open source software projects which are developing software that is run on large centralized servers? I can think of one example off the top of my head â€“ Second Life â€“ but canâ€™t think of any others.

(I am, of course, asking for a reason â€“ Iâ€™m interested in whether open source might be a viable development model for tools for scientific collaboration and publication.)

My impression at the moment is that there are few centralized web services which are open source. I can think of a couple of natural reasons why this might be the case.

First are security issues. In any software development one needs to be sure the programmers are honest, and not slipping back doors into the code, or making unethical use of the database. This is potentially harder to control in an open source software project.

Second, although the software may in some sense be owned by the wider community, it does not necessarily follow that the server is owned by the wider community. The group that owns the servers has a much greater incentive to contribute, and other people less so, which lessens the advantages to be had by open sourcing the project.

Are there any reasons Iâ€™m missing? Centralized services other than Second Life which are open source?

Scientific communication in the 21st century

By guest blogger Peter Rohde

In the last year the number of papers I have fully read can easily be counted on your hands. For the larger part I only read abstracts. Why is this? Because for most academic works I’m not especially interested in the details of calculations or the nitty gritty fine points of results. That’s something I’ll refer back to when/should I need it. For the larger part I’m only interested in understanding what it is that’s been done, what approaches were used to obtain the results, and what the remaining unanswered questions are. Typically these things can be characterized much more compactly than via a full scientific paper.

Aside from reading abstracts I gain much of my knowledge by speaking to people. This is a particularly useful way of learning for two reasons. Firstly, it is efficient, unlike verbose papers, and secondly it is interactive. If a particular point is not clear to me, I can grill for more detail. So, for the larger part, verbose scientific papers are far less useful to me than are their abstracts or talking to other people. Both of these points concur with the suggestions made in Robin Blume-Kohout’s contribution to this blog, where he advocates the “choose you own adventure” or hierarchically structured model. Evidently, speaking to other people is an example of this model – we prefer the terse over the verbose, with elaborations only when required. In such a structure, as I would envisage it, the abstract would be the root node of a tree. It would summarize the paper in a condensed, but completely self contained way – a micro-publication in very compact form. Each of the components in the abstract could be folded out to reveal further underlying details. This way the content is tailored to every reader. It means that I can continue doing what I normally do – only reading abstracts – with the bonus that if a particular aspect of the abstract is of interest to me, I can delve into it a little further without requiring me to read the entire paper.

This type of scientific communication lends itself exclusively to online publication. Indeed electronic media provides a plethora of new ways to structure and modularize information. Despite this, scientific publication has been stuck in a time warp where the archaic form of publication has been preserved. Essentially, present day electronic publications are structured and organized in exactly the same way as printed publications were 50 years ago, the only difference being that an LCD replaces paper. This is a sad misuse of resources.

Almost every other aspect of e-society has adopted, to some extent, the ideas advocated here and by Robin. The Wikipedia is the obvious, and perhaps most sophisticated example of this. Here every point in every article cross-references to other articles, creating a highly modularized and hierarchical structure. There are also less obvious examples. These days I never purchase newspapers, and it’s not an issue of saving money, it’s an issue of structural design. If I go to any major online news source, I’m presented with a very elegantly structured, hyperlinked front page. At the top of the page are all the headlines, each with a single line summary. Below this are divisions for international news, politics, technology, science etc, each with their own headlines and single line summary. In principle I could read just the front page and have a pretty good idea of what’s going on in the world and if I want more detail I can follow the links. This is much more efficient than the style adopted by many conventional newspapers of having one main story on the front page in addition to a few other headlines crammed at the bottom of the page, and all the rest jammed into separate pullouts.

Another area where the e-world is a step ahead of the paper world is in creating awareness of content. In present day scientific communication awareness of articles is created via two primary means. The first is by speaking with fellow scientists who draw our attention to articles that interested them. The second is by stumbling across things by oneself, for example, by reading the daily arXiv feeds. The trouble is that nowadays there is so much throughput that it becomes increasingly difficult to keep track of it all. A good analogy is the internet itself. Clearly the amount of material becoming available online is impossibly large to manage oneself. So to increase awareness of things that are of general interest, sites such as Slashdot, reddit and Digg have emerged. All these sites use some voting mechanism to create a list of pages that are of most interest to the online community. I think it is rapidly reaching the point where coping with the massive quantity of scientific communication will necessitate these kinds of approaches.

Another example of awareness creation, which is perhaps more suited to scientific publication, is that of recommendation systems. Some well known examples of recommendation systems are Amazon, iTunes, StumbleUpon and Last.fm. Here users’ preferences for pages/books/music are tracked, but not with the intention of creating a popularity list. Instead the preferences are hidden and only used internally by the service provider, who cross correlates your preferences with other users’ to suggest pages/books/music that might be of interest to you. This approach to discovering material is clearly much more effective than trawling through the immense amount of material out there on my own. Instead I can exploit the fact that others have done it for me.

In summary, the structure of present day scientific communication is inherently archaic. It replaces paper with LCD while taking little advantage of the abundance of possibilities for structuring information. Second, the sheer magnitude of scientific communication necessitates new means for creating awareness of material, using, for example, recommendation systems. While it’s very easy for me to sit here and bawl criticism at the current system, it’s not so straightforward to actually effect a transition to a different model. One route would be to convince a major publisher to adopt some of the aforementioned suggestions, and hope that it’s a success. The other would be set up a new system (e.g. a wiki or the like) and convince a group of reputable scientists to transition to that system. In either case, the success of the pursuit would require a certain critical mass.