directed procrastination: April 2013

First, this is not a post about economists or about how we need more stimulus spending or something like that. This is actually about software, primarily. So, be not afraid fiscal conservatives. If you are interested in the economics, Krugman's articles and blog entries make up a pretty thorough analysis of the whole ordeal.

This has been irking me for the last week or so. From Krugman's article in the NY Times:

Finally, Ms. Reinhart and Mr. Rogoff allowed researchers at the University of Massachusetts to look at their original spreadsheet — and the mystery of the irreproducible results was solved. First, they omitted some data; second, they used unusual and highly questionable statistical procedures; and finally, yes, they made an Excel coding error. Correct these oddities and errors, and you get what other researchers have found: some correlation between high debt and slow growth, with no indication of which is causing which, but no sign at all of that 90 percent “threshold.”

After reading about this a bit, as a person that performs publicly funded research for a living (at least right now) I find myself with four questions. First...

Why are you using Excel for important, presumably mathematically involved, statistical work?

Now this is a minor issue, more of just a curiosity really and a particular pet peeve of mine: Excel is a spreadsheet program. It is meant for things like making invoices, calculating grades, and balancing your check book, and even for those purposes it is not the best tool for the job. Each of those tasks have specialized programs written expressly for them. Excel is a "jack of all trades" sort of program, which is good (I tend to like those types of programs), but if your profession is to run statistics on data, is that the tool you should use? We have tools that are designed specifically for statistical analysis. We even have tools that look basically exactly like Excel but have a bunch of statistical tools built in as well.

Yes, Excel can be used for more complicated things. I knew a guy who used Excel to perform a discrete element method calculation of the heat flow throughout the complex geometry of an internal combustion engine. It worked, but it is not what Excel is meant to do. I'll even go so far is say that you should not use Excel to solve PDEs in complicated geometries; just a blanket no, don't do it. Excel is certainly Turing complete, that doesn't mean that you should use it for everything. I can use LaTeX to compute as well (TeX is also Turing complete), but I would never use it for something that wasn't document formatting.

Should you use tools for things that they are not intended to be used for? Sure, if the tool is well suited for that task. Sometimes tools are well designed and they can be used for tasks that the designers never intended them for, sometimes wildly outside the target use case. This usually falls under what we would call a hack. Should you use hacks routinely in your professional code? Probably not.

Perhaps I am off base. I'm not a statistician or an economist, perhaps the mathematics/calculations involved in economics is so brain-dead simple that Excel or a pocket calculator is the perfect tool for the job.

Why is it that the simple coding error resonates so well with the public whereas the general bad statistics falls flat?

The answer (which Krugman draws attention to elsewhere) is actually pretty obvious, one is embarrassing because people understand it and have probably made a similar mistake themselves while the other is considered complicated. Thus, people tend to give a pass on the latter. There is also the fact that the coding error resonates better with the media. It is a much easier story to tell, and media outlets routinely talk to the lowest common denominator (how else are you going to make money on the news?).

My understanding is that the paper did some pretty sketchy things regarding selecting what data to include and what to exclude in their analysis. There was also some pretty bad logic involved as well: the paper was published based on a correlation found in some data but never made a causality argument. I think that this is actually pretty common in social sciences, but data mining for correlations and then never attempting to figure out the reason behind them is a pretty shoddy research method.

Of all of the errors that seem to have gone into this research, the error in the spreadsheet seems to be the least offensive. That is until you consider my next question.

Why did this take so long to uncover?

The answer to this is almost certainly the fact that the code that was used for these calculations was not made available to the public. This was a contentious result and others have attempted and failed to reproduce the results. This is exactly the time when having source available to others would have helped this dispute get resolved much faster.

I know that there are people that don't share my opinion that Free/Libre Software is (to first approximation) always to be preferred over proprietary alternatives, but there are places where it is wholly inappropriate to use proprietary software. Perhaps the most important place for Free/Libre Software is the source code and interpreters/compilers for that source code that is used in public research. Note that I'm not talking about Open Source development; we are talking about the freedom to inspect the code, use it in your own research, and distribute it to others, not a development model, though that might be a good fit for some projects.

In my opinion, and hopefully more and more researchers share this opinion, this research is flawed largely because it is not available for inspection. It is not available for inspection for two reasons: 1) Excel, the interpreter, is not Free Software nor gratis, and 2) the Excel document that they performed the calculation in was not made immediately available. Though, to their credit, it was made available to another researcher upon request and Excel is widely used and has several Free Software spreadsheet programs that very likely would run these files. Think of how much better things would have been if the files were posted where anybody could get at them on a whim rather than by request. Anybody could execute them without buying a software license (due to gratis software) and understand what is happening in the software (freedom 1) and have the right to pass on altered versions to other researchers (freedom 2 and 3).

In addition, the proprietary nature of Excel causes concern as well as there may be internal errors in Excel that invalidate user programs that are correct. I should point out that errors within Excel itself become more and more likely as you start using Excel for things that it really wasn't made for, like advanced statistical analysis/modelling.

Thus, I think that people performing public research should publish any associated code to the public. And yes, I know that this is a scary prospect. I recently found an error in some code that I had written for a paper after it had been submitted to a journal. I had neglected to initialize a floating point variable in my program. Luckily, it did not effect the end result, and I really appreciate the luck in being able to fix this before someone else found it. If I am honest to myself, I am grateful that I didn't get caught making a huge mistake; I am grateful that I avoided shame, and that I got to analyze the situation and (if it had been necessary) get a head start on damage control.

But scary as it is, let's face it, the right thing to do for the advancement of research is to have open access of source code to other researchers and have that source code licensed in such a way that others can use it in their work. After all, what is worse, your work misleading people for years before it goes down in flames, or someone pointing it out early before anything bad happens? More eyes mean errors get caught faster, even if some of those eyes are out to completely discredit your work; perhaps even more so when this is the case.

Do we need a "Free/Libre Research" Movement?

When pondering this last question, I initially thought that it might be useful to define a sort of "Free/Libre Research" movement, where publications themselves and all source code and data associated with it are made available to the public (or at least the portion of the public that funded the research) at approximately the cost required to package and deliver it if not free and can be freely reused in derivative work. There are hints of this happening, particularly in the field of Computer Science. After some thought, I realized that such a movement shouldn't actually be necessary.

Defining such a movement is akin to saying that we want to do scientific research, i.e. we should be doing it this way already. These conditions I've described in a so called "Free/Libre Research" movement clearly falls under the very definition of scientific research. It is all part of reproducibility (i.e. have other researchers attempt to reproduce your results) and falsifiability (allowing other researchers to potentially challenge your hypothesis). And, while it is not absolutely necessary to provide source to people in order to maintain reproducibility and falsifiability, shouldn't we be actively trying to make the scientific process work better rather than hindering it?

directed procrastination

Friday, April 26, 2013

Rogoff, Reinhart, and Research: Four Questions

Wednesday, April 10, 2013

Programming Competition News