Michael Nielsen – Page 39 – Michael Nielsen

Announcement: RailsNite Waterloo

Announcing RailsNite Waterloo, to be held Monday, Nov 12, 6pm-9pm, at Caesar Martini’s (140 University Ave West, map).

Ruby on Rails is a powerful and easy-to-use Web development framework, based on the Ruby programming language. It’s being widely used by web startups, including Twitter, 37 Signals (Basecamp), and 43 Things.

RailsNite Waterloo is an opportunity for people interested in Rails and in the Waterloo region to meet, share knowledge, and ask questions. The format will be informal, and mostly oriented towards meeting people and discussion. Everyone is encouraged to attend, from those who are simply interested in Rails through to experienced developers.

We’ll have one or more short presentations during the evening. Please RSVP to Michael Nielsen (mnielsen@perimeterinstitute.ca) by Nov 12 if you’re attending. Please also pass encourage other people to attend!

Hope to see you there!

Ilya Grigorik and Michael Nielsen (Organizers)

Is Google buying local bandwith?

I’ve been using the Unix utility ping to check the speed I can relay information to and from various organizations. Here are some approximate times for a few:

I tried a bunch of others, and the times above were fairly typical, with 30-100 ms common for sites in North America, and 80-150ms outside. I’m based in Waterloo, Canada, so that distribution is not surprising.

Google, however, consistently came in with times between 5-20ms. So far as I know, they don’t have any local datacenters, so I’m wondering how they’re doing this, and, if they’re creating local infrastructure to serve results quickly, how broadly they’re doing it.

Could readers in other cities try doing the same experiment, and put the results in comments?

Update: Well, that was quick. A helpful commenter points me to just-ping.com, which basically automates this. Looks like I’m seeing a local anomaly – presumably Google has a data center somewhere nearby.

Refactoring Prose

One of the most interesting ideas from software development is that of refactoring code.

A famous book on refactoring identifies a set of “code smells” that indicate when something is wrong with your code (blog). The book is in part diagnostic, explaining how to identify poor code and determine what the problem is. It is also prescriptive, explaining how to fix many common problems.

I like the idea of producing a similar list of “writing smells” that helps writers identify bad prose, figure out what the problem is, and how to improve it. Such a list would be much more useful than the laundry lists of do’s and don’ts that form the bulk of style manuals such as Strunk and White. Such laundry lists are usually focused on the problem of writing clearly, a problem for which it is possible to give relatively precise rules, not the vaguer problem of writing in a way that is interesting. For the latter problem, the concept writing smells and refactoring prose is, I think, just what is needed.

An example of such a list of writing smells is provided by the book “Made to Stick” by Chip and Dan Heath. Summarizing the book here doesn’t do it justice, but it does suggest the following list of prose smells:

Is the prose unnecessarily complex?
Is the prose surprising? If not, why bore your reader?
Are abstract points illustrated by concrete examples and stories?
Is the prose credible enough? Where are the most significant credibility gaps?
Does the prose engage the reader’s emotions at all?
Does the prose engage issues the reader considers important?

These are all “just” common sense, of course, but I definitely find that when I take the time to apply the Heath’s list my writing improves considerably.

Assimilation

I’m now LinkedIn.

Information Aggregators

One of the most interesting class of tools that I know of is information aggregators. These are tools which pull in information from multiple sources, and consolidate that information into a smaller and more easily digested number of streams.

There are many such information aggregators, and we use them all the time without thinking – things like the Encyclopedia, the newspaper, the library, and so on. I’d never thought much about them until a few years ago, when I started using RSS feedreaders like Google Reader and Bloglines to keep up with my blogreading.

Up to that point, I used to read a dozen or so blogs regularly. Every day I’d do the rounds, laboriously checking each blog for updates.

RSS feedreaders changed that. They pull information from many blogs into a single stream of information, giving a huge boost to reading efficiency. Nowadays I read 200 or so blogs regularly (see my blogroll, on the right), and it takes perhaps half an hour or so a day. This is partially because I’m pretty ruthless about skimming content, and only focus on a small fraction of what is posted. But it’s also because the feedreader makes it much easier to track a large number of sources. Even the mechanics of skimming / focused reading are made easier by the feedreaders, because they simplify all the other mechanics of reading.

The more I thought about this, the more surprising I found it. Today I process several times more information from blogs than I did a few years ago, and yet it takes me about the same amount of time. A huge fraction of my former “reading” was not really reading at all. Rather, it was being spent on switching costs caused by the heterogeneity of the information sources I was using.

Okay, that’s blogs. What about everything else? I, and, I suspect, many of my readers, spend a great deal of time dealing with heterogeneous, fine-grained sources of information – someone mentions something interesting in conversation, passes me a note, recommends a book, or whatever. I use an ad hoc personal system to integrate all this, and it’s really not a very good system. How much more can I improve my ability to deal with information? How can I better integrate all this information into my creative workflow so that I have access to information precisely when I need to know it?

Lots of people talk about information overload. But, actually, a huge fraction of the problem isn’t overload, per se it’s that (1) the problem is too open-ended, absent good tools for organizing information and integrating it into our lives; and (2) we waste a lot of our time in dealing with switching costs associated with multiple sources of information.

This post has been pretty personal, up to now. Let’s take a few steps back, and look at things in a broader context. Over the past 15 years, we’ve seen an explosion in the number of different sources of information. Furthermore, much of that information has been pretty heterogeneous. If they’re smart, content producers don’t sit around waiting for standards committees to agree on common formats for information; they put it up on the network, and worry about standards when (if) they ever become an issue.

There’s an interesting arms race going on on the web. First, people innovate, producing new content in a multitude of new media types and formats; and then people consoldiate, producing aggregators which pull in the other direction, consolidating information so that it can be more easily digested, and reducing the heterogeneity in formats. (They also produce filters, and organize the information in other ways, but that’s a whole other topic!) We saw a cycle in this arms race play out first with information about things like travel, for which it was relatively straightforward to aggregate the information. Now we’re seeing more complex types of data be aggregated – stuff like financial data, with single portals like mint.com through which we can manage all our financial transactions, across multiple institutions. This week’s announcement of OpenSocial by Google is a nice example – it’ll help aggregate and consolidate all the data produced by Social Networks like LinkedIn, MySpace, and SixApart. The result of this arms race is gradually improving our ability to manage large swathes of information.

A notable aspect of all this is that a lot of power and leverage ends up in the hands not of the people who originally produce the information, but in the hands of the people who aggregate the information, especially if they add additional layers of value by organizing that information in useful ways. We can expect to see a major associated economic shift, where people move away from being content producers to adding value through the aggregation and organization of information. It will be less visible, but I think this shift is in some ways as important from the 19’th century shift as work moved off farms and into factories.

The aggregation metaphor is a useful one, and at the moment lots of really successful tools do aggregation and not much else. I think, though, that in the future the best aggregators will combine with other metaphors to help us more actively engage with information, by organizing it better, and better incorporating it into our workflow.

We’re a long ways short of this at present. For example, although Bloglines has organizational tools built in, they’re not very good. Still, the overall service is good enough to be worth using. Another example is some newspapers which rely on syndicated content from multiple sources for their best news, and do relatively little reporting of their own; they still offer a useful service to their communities.

Part of the problem is that we don’t yet have very powerful metaphors for organizing information, and for better integrating it into our workflow. Files and tags are awful ways of organizing information, although they are getting better in settings where collective intelligence can be brought to bear. Search is a more powerful metaphor for organizing personal information, as Gmail and Google Desktop show, but it’s still got a long way to go. We need more and better metaphors for organizing information. And workflow management is something that no-one has figured out. Why don’t my tools tell me what I need to know, when I need to know it?

With that said, there are some useful tools out there for organizing information and integrating it into our workflow. For an aggregation service like Bloglines it’d be nice to have simple, well-designed integration with tools like Flickr and del.icio.us, which do provide useful ways of organizing content, or to Basecamp or Trac, which are oriented towards managing one’s workflow. Part of the success of Facebook and Gmail are that they integrate, in limited ways, information aggregation, organization, and workflow management.

To do this kind of integration one can either try to do it all, like Facebook, or else try to integrate with other products. There’s a lot of problems with this kind of integration. Developers distrust API’s controlled by other companies. If del.icio.us can with a flick of the switch shut down a large part of your functionality, that’s a problem, and might cause you to think about developing your own tagging system. Sounds great, except now your users have two sets of tags, and everyone loses. The only solution I know of to this problem is open standards, which bring problems of their own.

I’ve been talking about this from the point of view of typical users. To finish off, I want to switch to a different type of user, the tool-building programmer. It’s interesting how even advanced programming languages like python and ruby are not designed to deal with aggregating and organizing lare quantities of information. Joel Spolsky has a great related quote:

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said. That’s true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it’s not based on dictionaries; it’s based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t.

Where are the programming languages that have Bayesian filters, PageRank, and other types of collective intelligence as a central, core part of the language? I don’t mean libraries or plugines, I mean integrated into the core of the language in the same way the if statement is. Instead, awarenes of the net is glued on through things like XML libraries, REST, and so on.

(I can’t resist two side remarks. First, what little I know of Prolog suggests that it has something of the flavour I’m talking about. And I also have to mention that my friend Laird Breyer taught his Bayesian spam flter to play chess.)

An example which captures part (though only part) of what I’m talking about is emacs. One of the reasons emacs is such a wonderful text editor is that it isn’t really a text editor. It’s a programming language and development platform specifically geared toward the development of text editing tools, and which happens to have a very nice text editor built into it. This is one of the reasons why people like emacs – it’s possible to do everything related to text through a single programmable interface which is specifically geared towards the processing of text. What would a language and development platform for information aggregation, organization and workflow management look like?

Ruby script to count the number of lines changed in a subversion repository

I like to set minimal targets for the amount of work I’m going to do on different projects during a week – things like word counts on papers, and so on. In that vein, here’s a short Ruby script to count the number of lines changed in a subversion repository. It returns the number of added lines and deleted lines from today’s edits. The advantage of this over most wordcount facilities is that it can be used both for programming and writing projects, and copes with an entire directory structure, not just a single file.

print “Additions: ”

puts `svn diff -r \{#{(Time.now-86400).strftime(‘%Y-%m-%d’)}\}:\{#{
(Time.now+86400).strftime(‘%Y-%m-%d’)}\} | grep \”^+\” | wc`

print “Deletions: ”

puts `svn diff -r \{#{(Time.now-86400).strftime(‘%Y-%m-%d’)}\}:\{#{
(Time.now+86400).strftime(‘%Y-%m-%d’)}\} | grep \”^\\-\” | wc`

It’s easily modified to run as a shell script, or to vary the dates.

The future of libraries

Via Peter Brantley, a beautiful comment from Winston Tabb, the Librarian at The Johns Hopkins University, which provides a nice lens through which to view the challenging situation libraries find themselves in:

Data centers are the new stacks.

It’s an interesting time to be a librarian. I spent a couple of months earlier this year working in the State Library of Queensland. I met some librarians who were incredibly upbeat about current opportunities, and others who were in denial. One day, at lunch, I overhead a conversation between two librarians discussing the “impossibility” of putting the entire back catalogue of major newspapers online. I wonder how long before Google or someone else unveils such a catalogue.

Update: In comments, John Dupuis, Head of the Science Library at York University, points to a thoughtful series of essays on the opportunities and challenges now before libraries.

“The” Scientific Method

Cosma Shalizi ponders the Scientific Method and the Philosophy of Science:

Philosophy of science these days seems largely concerned with questions of method, justification and reliability — what do scientists do (and are they all doing the same thing? are they doing what they think they’re doing?), and does it work, and if so why, and what exactly does it produce? There are other issues, too, like, do scientific theories really tell us about the world, or just give us tools for making predictions (and is there a difference there?). The whole reductionism—emergence squabble falls under this discipline, too. But (so far as an outsider can judge), method is where most of the debate is these days.

Of course, most scientists proceed in serene indifference to debates in methodology, and indeed all other aspects of the philosophy of science. What Medawar wrote thirty years ago and more is still true today:

If the purpose of scientific methodology is to prescribe or expound a system of enquiry or even a code of practice for scientific behavior, then scientists seem to be able to get on very well without it. Most scientists receive no tuition in scientific method, but those who have been instructed perform no better as scientists than those who have not. Of what other branch of learning can it be said that it gives its proficients no advantage; that it need not be taught or, if taught, need not be learned?

(Actually, has anyone done a controlled study of that point?) One of the things a good methodology should do is, therefore, either explain why scientists don’t have to know it.

An observation I find fascinating is that scientists employ very different norms when evaluating what it means to know something in their area of expertise versus what they know about doing science. An experimental physicist may have extremely rigorous standards of what it means to determine something experimentally, and a far more seat-of-the-pants means of evaluating knowledge about what it means to be a good experimental physicist. And, of course, they apply different standards again to everyday knowledge.

Cosma continues a little later:

Now of course working scientists do employ lots of different methods, which are of varying quality. The same is true of all learned professions, and it is probably also true that most professionals (lawyers, architects, doctors) pay no heed to foundational debates about what they are doing. Instead methods seem to breed within the profession — this technique is unreliable under these circumstances, that procedure works better than the old one, etc. — without, as it were, the benefit of philosophical clergy.

Feyerabend had a nice term for this – “anything goes”. I don’t think he meant this literally. Rather, he meant that method was something that scientists invented on a case-by-case basis, with formal methodology being only a heuristic guide, not gospel.

StartupCamp Waterloo

StartupCamp is on in Waterloo next Tuesday night, 6pm to 9pm, at the Waterloo Accelerator Centre, 295 Hagey Blvd., Waterloo. I went to a similar event, DemoCamp Guelph, in June, and it was terrific – lots of energy and great ideas in the room, with a nice informal atmosphere that made it easy to meet people. Great for anyone interested in entrepeneurship, or just interesting new technology.

More generally, if you’re in the Greater Toronto area, and haven’t already done so, you should check out the TorCamp page, which lists all the BarCamps, DemoCamps and so on going on in the area. There’s a lot of really amazing stuff going on!

Academic Reader Database

Just a brief note for users of the Academic Reader – we lost a few hours worth of database updates today. It was an error on my part, and I apologize if anyone lost anything significant because of it. Ironically, the error occurred as I was updating the code so that database backups are made more frequently than once every 24 hours.