Michael Nielsen – Page 14 – Michael Nielsen

SciBarCamp 2

Last year’s SciBarCamp was one of my favourite events ever – here’s a great explanation of why, from Jim Thomas. It’s on again this year, May 8-9, in Toronto. Take a look at the participant list, and sign up (space is limited)! Here’s more, from the organizers:

SciBarCamp is a gathering of scientists, artists, and technologists for a weekend of talks and discussions. The goal is to create connections between science, entrepreneurs and local businesses, and arts and culture.

In the tradition of BarCamps, otherwise known as “unconferences”, the program is decided by the participants at the beginning of the meeting, in the opening reception. SciBarCamp will require active participation; while not everybody will present or lead a discussion, everybody will be expected to contribute substantially – this will help make it a really creative event.

Our venue, Hart House, is a congenial space with plenty of informal areas to work or talk. The space, which made such a wonderful venue for last year’s SciBarCamp, is being made available through a collaboration with Science Rendezvous.

Biweekly links for 04/20/2009

singletasking: Caterina Fake
Massively Multiplayer Online Game service granted banking license
- “MMO operator MindArk has been granted a banking license for its virtual world Entropia Universe, by the Swedish Financial Supervisory Authority.
  MindArk says the move will allow it to act as a central bank for all variations of Entropia Universe and integrate the in-game economies with the real world.
  
  “This is an exciting and important development for the future of all virtual worlds being built using the Entropia Platform,” commented MindArk CEO, Jan Welter Timkrans.
  
  “Together with our partner planet owner companies we will be in a position to offer real bank services to the inhabitants of our virtual universe.”
  
  Entropia Universe acts as a platform from which partners can launch virtual worlds within, with the focus being on microtransactions and virtual currency monetisation. “
Luis von Blog

Click here for all of my del.icio.us bookmarks.

Biweekly links for 04/17/2009

Pooling of Unshared Information in Group Decision Making: Biased Information Sampling During Discussion
- “Decision-making groups can potentially benefit from pooling members’ information, particularly when members individually have partial and biased information but collectively can compose an unbiased characterization of the decision alternatives. The proposed biased sampling model of group discussion, however, suggests that group members often fail to effectively pool their information because discussion tends to be dominated by (a) information that members hold in common before discussion and (b) information that supports members’ existent preferences. In a political caucus simulation, group members individually read candidate descriptions that contained partial information biased against the most favorable candidate and then discussed the candidates as a group. Even though groups could have produced unbiased composites of the candidates through discussion, they decided in favor of the candidate initially preferred by a plurality rather than the most favorable candidate…”
SciBarCamp Toronto 2
- SciBarCamp Toronto 2 is happening May 8-9, Hart House, Toronto. See the Participant page to register!
Killer Bean Forever
- Feature-length animated movie animated entirely by one person, Jeff Lew (of the Matrix Reloaded). Will be released on DVD in July (US and Canada).
arXiview: A New iPhone App for the arXiv
- Browse the preprint arXiv from your iPhone.
A Comparison of Approaches to Large-Scale Data Analysis
- “There is currently considerable enthusiasm around the MapReduce (MR) paradigm.. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper… we evaluate both kinds of systems in terms of performance and development complexity… we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each systemâ€™s performance for various degrees of parallelism on a cluster of 100 nodes… Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures”

Click here for all of my del.icio.us bookmarks.

Biweekly links for 04/13/2009

Click here for all of my del.icio.us bookmarks.

Biweekly links for 04/06/2009

Tricki available for viewing Â« Gowersâ€™s Weblog
The Tricki
- “The main body of the Tricki will be a (large, if all goes according to plan) collection of articles about methods for solving mathematical problems. These will be everything from very general problem-solving tips such as, â€œIf you canâ€™t solve the problem, then try to invent an easier problem that sheds light on it,â€ to much more specific advice such as, â€œIf you want to solve a linear differential equation, you can convert it into a polynomial equation by taking the Fourier transform.â€”
Wikipedia:Unusual articles
- Articles about February 30, Manhattenhenge, the Voynich manuscript, and many others.
Factoring Again: No Joking Â« GÃ¶delâ€™s Lost Letter and P=NP
- Richard Lipton reviews some of the reasons to think that the factoring problem may be computationally easy.
A near fork of Linux (September, 1998)
- A discussion on the Linux kernel mailing list, September 1998.

Click here for all of my del.icio.us bookmarks.

Biweekly links for 04/03/2009

Amazon Elastic MapReduce
- “Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”
Data produced, analyzed and consumed. The impact of big science : business|bytes|genes|molecules
- “The fact remains that today we are moving towards a clear separation between data producers, data consumers and methods developers. There was a time that a small group of people could cover all that ground, but with the industrialization of data production (microarrays are already there, mass specs and sequencers not quite yet), traditional roles, even in an academic setting are not efficient. “
Adding Noughts in Vain
- Andrew Doherty’s wonderful blog about politics, climate, New Zealand, and whatever else strikes his fancy.
Mathemata: the blog of Francois Dorais
Noam Chomsky on Post-Modernism
- “There are lots of things I don’t understand — say, the latest debates over whether neutrinos have mass or the way that Fermat’s last theorem was … proven … But from 50 years in this game, I have learned two things: (1) I can ask friends who work in these areas to explain it to me at a level that I can understand, and they can do so…; (2) if I’m interested, I can proceed to learn more so that I will come to understand it. Now Derrida, Lacan, Lyotard, Kristeva, etc. — even Foucault, whom I knew and liked, and who was somewhat different from the rest — write things that I also don’t understand, but (1) and (2) don’t hold: no one who says they do understand can explain it to me and I haven’t a clue as to how to proceed to overcome my failures. That leaves one of two possibilities: (a) some new advance in intellectual life has been made… which has created a form of “theory” that is beyond quantum theory, topology, etc., in depth and profundity; or (b) … I won’t spell it out.
Caveat Lector Â» Blog preservation
- “I suggest mildly that this [blog preservation] would be a fantastic problem to tackle for an academic library looking to make a name for itself. If you canâ€™t make the argument for a general blog-preservation program (and thatâ€™s hard, because libraries are so inward-looking at times of crisis), dig up the ten or fifteen best blogs published by people at your institution and make an argument about those. Then release the code you write to the rest of us who want to do this!”
Preservation for scholarly blogs â€“ Gavin Baker
- How will we preserve scholarly blogs for the future?
A Blog Around The Clock : Defining the Journalism vs. Blogging Debate, with a Science Reporting angle
- Thoughtful and thought-provoking.
Anarchism Triumphant: Free Software and the Death of Copyright (Eben Moglen)
Western internet censorship: The beginning of the end or the end of the beginning? – Wikileaks

Click here for all of my del.icio.us bookmarks.

Conscious modularity and scaling open collaboration

I’ve recently been reviewing the history of open source software, and one thing I’ve been struck by is the enormous effort many open source projects put it into making their development modular. They do this so work can be divided up, making it easier to scale the collaboration, and so get the benefits of diverse expertise and more aggregate effort.

I’m struck by this because I’ve sometimes heard sceptics of open science assert that software has a natural modularity which makes it easy to scale open source software projects, but that difficult science problems often have less natural modularity, and this makes it unlikely that open science will scale.

It looks to me like what’s really going on is that the open sourcers have adopted a posture of conscious modularity. They’re certainly not relying on any sort of natural modularity, but are instead working hard to achieve and preserve a modular structure. Here are three striking examples:

The open source Apache webserver software was originally a fork of a public domain webserver developed by the US National Center for Supercomputing Applications (NCSA). The NCSA project was largely abandoned in 1994, and the group that became Apache took over. It quickly became apparent that the old code base was far too monolithic for a distributed effort, and the code base was completely redesigned and overhauled to make it modular.
In September 1998 and June 2002 crises arose in Linux because of community unhappiness at the slow rate new code contributions were being accepted into the kernel. In some cases contributions from major contributors were being ignored completely. The problem in both 1998 and 2002 was that an overloaded Linus Torvalds was becoming a single point of failure. The situation was famously summed up in 1998 by Linux developer Larry McVoy, who said simply “Linus doesn’t scale”. This was a phrase repeated in a 2002 call-to-arms by Linux developer Rob Landley. The resolution in both cases was major re-organization of the project that allowed tasks formerly managed by Torvalds to be split up among the Linux community. In 2002, for instance, Linux switched to an entirely new way of managing code, using a package called BitKeeper, designed in part to make modular development easier.
One of the Mozilla projects is an issue tracking system (bugzilla), designed to make modular development easy, and which Mozilla uses to organize development of the Firefox web browswer. Developing bugzilla is a considerable overhead for Mozilla, but it’s worth it to keep development modular.

The right lesson to learn from open source software, I think, is that it may be darned hard to achieve modularity in software development, but it can be worth it to reap the benefits of large-scale collaboration. Some parts of science may not be “naturally” modular, but that doesn’t mean they can’t be made modular with conscious effort on the part of scientists. It’s a problem to be solved, not to give up on.

First Principles

How would you use 100 million dollars if someone asked you to set up and run an Institute for Theoretical Physics?Â My friend Howard Burton has written a memoir of his 8 years as the founding Executive Director of the Perimeter Institute, taking it from conception to being one of the world’s best known institutes for theoretical physics. I’ve heard many people theorize about how a scientific institution ideally should be organized (“consider a spherical physicist…”), and I’ve contributed more than a few thoughts of my own to such discussions. What I really liked about this book, and what gives it a unique perspective, is that it’s from someone who was actually in the hot seat, from the get-go.

Biweekly links for 03/30/2009

Reflections on Trusting Trust: Ken Thompson
- One of my favourite essays. Written by the co-creator of Unix, Ken Thompson, this is the text of his Turing Award lecture. In it, he explains a beautiful hack to put a backdoor into the Unix login program, a backdoro that would let him login to any Unix system. I doubt his superiors at AT&T thought it was all that funny.
Python pdfminer
- “PDFMiner is a suite of programs that aims to help extracting or analyzing text data from PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It can be also used as a basis for a full-fledged PDF interpreter.”
Science in the open Â» Capturing the record of research process – Part I
- Thoughtful discussion of the difficulty of capturing _process_, not just facts, or ideas, or relationships. Ties in with many other important issues in open science: the importance of the context in which data was taken, and how tacit knowledge affects that data; how to represent process online in a way that is useful.
Galaxy Zoo Blog Â» This is my first timeâ€¦.
- “[I]f one wants to use a telescope to study something in the sky, one must write a proposal 6 months in advance, submit it for scrutiny, and then await your allocation of time on a telescope. The process can take nearly a year and then after your night staring at the stars, it can take a further year to analyse the data (assuming it wasnâ€™t cloudy!). Only then are we ready to ask questions of the data and test our observations against our original hypothesis written two years ago in a haste!
  With the Zoo, itâ€™s all a little too quick! For example, I can ask the question â€œhow many galaxies have a bar through the middle of themâ€ and typically I would embark on a career-long quest to answer this fundamental question. I may even recruit some poor graduate student to eyeball 50,000 galaxies to answer the question (like they did with Kevin!). But now, two days after the launch, we already have the data to address this question and itâ€™s a little too fast for an old-timer like me. “
Galaxy Zoo Blog Â» Lost in Space â€¦ and more (An Italian in the zoo)
- The story of how someone became addicted to Galaxy Zoo.
[0803.3247] Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey
- Analyses the rotation axes of spiral galaxies, and claims that, contra Longo, there is no statistically significant anisotropy.
[0812.3437] Does the Universe Have a Handedness?
- The latest in Longo’s series of papers suggesting an anisotropy in the rotation axes of spiral galaxies. Big news if true, but a team using Galaxy Zoo’s data suggests that Longo is mistaken.
Galaxy Zoo Blog Â» Everything youâ€™ve ever wanted to know about Blue Early-type galaxies!
Academic publishing house Springer put up for sale in teeth of recession | guardian.co.uk
Jacques Distler on Instiki
- I’ve only fiddled with Instiki, but it looks like Jacques has done a great job making this a truly math-aware wiki.
Yuja Wang plays the Flight of the Bumble-Bee
- Don’t try this at home.
ErdÃ¶s and the Quantum Method Â« GÃ¶delâ€™s Lost Letter and P=NP
Daring Fireball: Obsession Times Voice
- “My muse for the session was this quote from Walt Disney: â€œWe donâ€™t make movies to make money, we make money to make more movies.â€ To me, thatâ€™s it. Thatâ€™s the thing… Obsession times voice is a pretty good stab at a simple formula for doing it right.”
Machine Learning (Theory) Â» Machine Learning is too easy
Whitepaper: Estimating the Total Development Cost of a Linux Distribution Â· October 2008
- “we estimate that it would take approximately $10.8 billion to build the Fedora 9 distribution in todayâ€™s dollars, with todayâ€™s software development costs. Additionally, it would take $1.4 billion to develop the Linux kernel alone. This paper outlines our technique and highlights the latest costs of developing Linux. “
Cheating Goes Global as Essay Mills Multiply – Chronicle.com
- “One customer, for example, identifies himself as a Ph.D. student in aerospace engineering at the Massachusetts Institute of Technology. He or she (there is no name on the order) is interested in purchasing a 200-page dissertation. The student writes that the dissertation must be “well-researched” and includes format requirements and a general outline. Attached to the order is a one-page description of Ph.D. requirements taken directly from MIT’s Web site. The student also suggests areas of emphasis like “static and dynamic stability of aircraft controls.””

Click here for all of my del.icio.us bookmarks.

On scaling up the Polymath project

Tim Gowers has an interesting post on the problem of scaling up the Polymath project to involve more contributors. Here are a few comments on the start of Tim’s post. I’ll return to the remainder of the post tomorrow:

As I have already commented, the outcome of the Polymath experiment differed in one important respect from what I had envisaged: though it was larger than most mathematical collaborations, it could not really be described as massive. However, I havenâ€™t given up all hope of a much larger collaboration, and in this post I want to think about ways that that might be achieved.

As discussed in my earlier post, I think part of the reason for the limited size was the short time-frame of the project. The history of open source software suggests that building a large community usually takes considerably more time than Polymath had available – Polymath’s community of contributors likely grew faster than open source projects like Linux and Wikipedia. In that sense, Polymath’s limited scale may have been in part a consequence of its own rapid success.

With that said, it’s not clear that the Polymath community could have scaled up much further, even had it taken much longer for the problem to be solved, without significant changes to the collaborative design. The trouble with scaling conversation is that as the number of people participating goes up, the effort required to track the conversation also goes up. The result is that beyond a certain point, participants are no longer working on the problem at hand, but instead simply trying to follow the conversation (c.f. Brooks’ law). My guess is that Polymath was near that limit, and, crucially, was beyond that limit for some people who would otherwise like to have been involved. The only way to avoid this problem is to introduce new social and technical means for structuring the conversation, limiting the amount of attention participants need to pay to each other, and so increasing the scale at which conversation can take place. The trick is to do this without simultaneously destroying the effectiveness of the medium as a means of problem-solving.

(As an aside, it’s interesting to think about what properties of a technological platform make it easy to rapidly assemble and grow communities. I’ve noticed, for example, that the communities in FriendFeed rooms can grow incredibly rapidly, under the right circumstances, and this growth seems to be a result of some very particular and clever features of the way information is propagated in FriendFeed. But that’s a discussion for another day.)

First, let me say what I think is the main rather general reason for the failure of Polymath1 to be genuinely massive. I had hoped that it would be possible for many people to make small contributions, but what I had not properly thought through was the fact that even to make a small contribution one must understand the big picture. Or so it seems: that is a question I would like to consider here.

One thing that is undeniable is that it was necessary to have a good grasp of the big picture to contribute to Polymath1. But was that an essential aspect of any large mathematical collaboration, or was it just a consequence of the particular way that Polymath1 was organized? To make this question more precise, I would like to make a comparison with the production of open-source software (which was of course one of the inspirations for the Polymath idea). There, it seems, it is possible to have a large-scale collaboration in which many of the collaborators work on very small bits of code that get absorbed into a much larger piece of software. Now it has often struck me that producing an elaborate mathematical proof is rather like producing a complex piece of software (not that I have any experience of the latter): in both cases there is a clearly defined goal (in one case, to prove a theorem, and in the other, to produce a program that will perform a certain task); in both cases this is achieved by means of a sequence of strings written in a formal language (steps of the proof, or lines of code) that have to obey certain rules; in both cases the main product often splits into smaller parts (lemmas, subroutines) that can be treated as black boxes, and so on.

This makes me want to ask what it is that the producers of open software do that we did not manage to do.

Here’s two immediate thoughts inspired by that question, both of which are ways large open-source projects (a) reduce barriers to entry, and (b) limit the amount of attention required from potential contributors.

Clear separation of what is known from how it is known: In some sense, to get involved in an open source project, all you need do is understand the current source code. (In many projects, the code is modular, which means you may only need to understand a small part of the code.) You don’t need to understand all the previous versions of the code, or read through all the previous discussion that led to those versions. By contrast, it was, I think, somewhat difficult to follow the Polymath project without also following a considerable fraction of the discussion along the way.

Bugtracking: One of the common answers to the question “How can I get involved in open source?” is “Submit a bug report to your favourite open source project’s bugtracking system”. The next step up the ladder is: “Fix a bug in the bugtracking system”. Bugtracking systems are a great way of providing an entry point for new contributors, because they narrow the scope of problems down, limiting what a new contributor needs to learn, and how many other contributors they need to pay attention to. Of course, many bugs will be beyond a beginning contributor to fix. But it’s easy to browse through the bug database to find something within your ability to solve. While I don’t think bugtracking is quite the right model for doing mathematics, it’s possible that a similar system for managing problems of limited scope may help in projects like Polymath.

More tomorrow.