Information Aggregators

One of the most interesting class of tools that I know of is information aggregators. These are tools which pull in information from multiple sources, and consolidate that information into a smaller and more easily digested number of streams.

There are many such information aggregators, and we use them all the time without thinking – things like the Encyclopedia, the newspaper, the library, and so on. I’d never thought much about them until a few years ago, when I started using RSS feedreaders like Google Reader and Bloglines to keep up with my blogreading.

Up to that point, I used to read a dozen or so blogs regularly. Every day I’d do the rounds, laboriously checking each blog for updates.

RSS feedreaders changed that. They pull information from many blogs into a single stream of information, giving a huge boost to reading efficiency. Nowadays I read 200 or so blogs regularly (see my blogroll, on the right), and it takes perhaps half an hour or so a day. This is partially because I’m pretty ruthless about skimming content, and only focus on a small fraction of what is posted. But it’s also because the feedreader makes it much easier to track a large number of sources. Even the mechanics of skimming / focused reading are made easier by the feedreaders, because they simplify all the other mechanics of reading.

The more I thought about this, the more surprising I found it. Today I process several times more information from blogs than I did a few years ago, and yet it takes me about the same amount of time. A huge fraction of my former “reading” was not really reading at all. Rather, it was being spent on switching costs caused by the heterogeneity of the information sources I was using.

Okay, that’s blogs. What about everything else? I, and, I suspect, many of my readers, spend a great deal of time dealing with heterogeneous, fine-grained sources of information – someone mentions something interesting in conversation, passes me a note, recommends a book, or whatever. I use an ad hoc personal system to integrate all this, and it’s really not a very good system. How much more can I improve my ability to deal with information? How can I better integrate all this information into my creative workflow so that I have access to information precisely when I need to know it?

Lots of people talk about information overload. But, actually, a huge fraction of the problem isn’t overload, per se it’s that (1) the problem is too open-ended, absent good tools for organizing information and integrating it into our lives; and (2) we waste a lot of our time in dealing with switching costs associated with multiple sources of information.

This post has been pretty personal, up to now. Let’s take a few steps back, and look at things in a broader context. Over the past 15 years, we’ve seen an explosion in the number of different sources of information. Furthermore, much of that information has been pretty heterogeneous. If they’re smart, content producers don’t sit around waiting for standards committees to agree on common formats for information; they put it up on the network, and worry about standards when (if) they ever become an issue.

There’s an interesting arms race going on on the web. First, people innovate, producing new content in a multitude of new media types and formats; and then people consoldiate, producing aggregators which pull in the other direction, consolidating information so that it can be more easily digested, and reducing the heterogeneity in formats. (They also produce filters, and organize the information in other ways, but that’s a whole other topic!) We saw a cycle in this arms race play out first with information about things like travel, for which it was relatively straightforward to aggregate the information. Now we’re seeing more complex types of data be aggregated – stuff like financial data, with single portals like mint.com through which we can manage all our financial transactions, across multiple institutions. This week’s announcement of OpenSocial by Google is a nice example – it’ll help aggregate and consolidate all the data produced by Social Networks like LinkedIn, MySpace, and SixApart. The result of this arms race is gradually improving our ability to manage large swathes of information.

A notable aspect of all this is that a lot of power and leverage ends up in the hands not of the people who originally produce the information, but in the hands of the people who aggregate the information, especially if they add additional layers of value by organizing that information in useful ways. We can expect to see a major associated economic shift, where people move away from being content producers to adding value through the aggregation and organization of information. It will be less visible, but I think this shift is in some ways as important from the 19’th century shift as work moved off farms and into factories.

The aggregation metaphor is a useful one, and at the moment lots of really successful tools do aggregation and not much else. I think, though, that in the future the best aggregators will combine with other metaphors to help us more actively engage with information, by organizing it better, and better incorporating it into our workflow.

We’re a long ways short of this at present. For example, although Bloglines has organizational tools built in, they’re not very good. Still, the overall service is good enough to be worth using. Another example is some newspapers which rely on syndicated content from multiple sources for their best news, and do relatively little reporting of their own; they still offer a useful service to their communities.

Part of the problem is that we don’t yet have very powerful metaphors for organizing information, and for better integrating it into our workflow. Files and tags are awful ways of organizing information, although they are getting better in settings where collective intelligence can be brought to bear. Search is a more powerful metaphor for organizing personal information, as Gmail and Google Desktop show, but it’s still got a long way to go. We need more and better metaphors for organizing information. And workflow management is something that no-one has figured out. Why don’t my tools tell me what I need to know, when I need to know it?

With that said, there are some useful tools out there for organizing information and integrating it into our workflow. For an aggregation service like Bloglines it’d be nice to have simple, well-designed integration with tools like Flickr and del.icio.us, which do provide useful ways of organizing content, or to Basecamp or Trac, which are oriented towards managing one’s workflow. Part of the success of Facebook and Gmail are that they integrate, in limited ways, information aggregation, organization, and workflow management.

To do this kind of integration one can either try to do it all, like Facebook, or else try to integrate with other products. There’s a lot of problems with this kind of integration. Developers distrust API’s controlled by other companies. If del.icio.us can with a flick of the switch shut down a large part of your functionality, that’s a problem, and might cause you to think about developing your own tagging system. Sounds great, except now your users have two sets of tags, and everyone loses. The only solution I know of to this problem is open standards, which bring problems of their own.

I’ve been talking about this from the point of view of typical users. To finish off, I want to switch to a different type of user, the tool-building programmer. It’s interesting how even advanced programming languages like python and ruby are not designed to deal with aggregating and organizing lare quantities of information. Joel Spolsky has a great related quote:

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said. That’s true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it’s not based on dictionaries; it’s based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t.

Where are the programming languages that have Bayesian filters, PageRank, and other types of collective intelligence as a central, core part of the language? I don’t mean libraries or plugines, I mean integrated into the core of the language in the same way the if statement is. Instead, awarenes of the net is glued on through things like XML libraries, REST, and so on.

(I can’t resist two side remarks. First, what little I know of Prolog suggests that it has something of the flavour I’m talking about. And I also have to mention that my friend Laird Breyer taught his Bayesian spam flter to play chess.)

An example which captures part (though only part) of what I’m talking about is emacs. One of the reasons emacs is such a wonderful text editor is that it isn’t really a text editor. It’s a programming language and development platform specifically geared toward the development of text editing tools, and which happens to have a very nice text editor built into it. This is one of the reasons why people like emacs – it’s possible to do everything related to text through a single programmable interface which is specifically geared towards the processing of text. What would a language and development platform for information aggregation, organization and workflow management look like?

10 comments

  1. This is slightly off-topic (but there are really three posts in this one anyway). One major thing that aggregators are missing is the ability to keep track of comments per post (ideally threaded, although that is also a major limitation of blog software). For instance, it’s possible to get an RSS feed for all the comments on your blog, but it’s impossible to have the aggregator present them nicely to me together with the blog post, and let me know when new ones are posted for the entries I care about. There is already a standard developing for threading for Atom (see RFC 4685), but until common aggregators support it, it’s not going to get going.

    For me personally (as I suspect for many other people), it’s the single greatest obstacle for following conversations in the comments on most blogs. It’s also coincidentally one of the things that NNTP readers did spectacularly well back when USENET was the main game in town.

  2. Hi Mohan,

    I’ll be interested to see how Twine goes when it is released. There’s certainly been a fair bit of hype about Semantic Web stuff, but it always seems to have languished.

    And thanks for the link to the Tech Review!

  3. DM,

    Integrating comments in a more friction-free way would be terrific, especially if it became possible to contribute content (i.e., post comments, not just read) from the aggregators. I gather that there are efforts in this direction, but the main players still seem stuck with kludges.

    (More generally, the ability to post content is a very useful add-on for any aggregator.)

  4. Off topic to your content a bit, but it might help people who come across this page looking for something else. Information Aggregators are useful in research driven projects as well, the Census being one of the best examples.

    It would be interesting to have a Census like search engine that pulls information from the web for research based projects where quantitative data is needed.

Comments are closed.