Friday, August 24, 2007

Study: Reduced Open Source developer productivity linked to "restrictive" FLOSS licenses (where "restrictive"=GPL and "non-restrictive"=BSD)

A study by economists from Tel Aviv University and the Centre for Economic Policy Research (CEPR) entitled "Open source software: Motivation and restrictive licensing"[1] (pre-print) looks at the productivity of developers on Open Source projects and concludes:

"...that the output per contributor in open source projects is much higher when licenses are less restrictive and more commercially oriented."

and observe:
"Projects written for the Linux operating system have lower output per contributor than projects written for other operating systems..."
and:
"Output per contributor in projects oriented towards end users
(DESKTOP) is significantly lower than that in projects for developers."
They also observed that the median # of contributors in "restrictive" projects (13) to be much less than for "non-restrictrive" projects (35).

They chose the 71 most active projects on SourceForge in January 2000 and studied them over an 18 month period starting in January 2002. They measure these projects every 2 months over this period resulting in 9 samples. The metrics they used include: Source lines of code (SLOC), #contributors, the "restrictiveness" of the license (ranging from GPL = very; LGPL, Mozilla, NPL, MPL = moderate; or BSD = non), operating system, age of project, if it is a desktop or system application, language (C++ or C = 1; all others = 0), and others. They took in to account the difference between the LOC of language by separately also looking at just the C++ or C projects.

I do not understand the lag in choosing the projects (January 2000) and the start of the data sampling (January 2002). This in itself could have skewed the results, i.e. the 71 most active projects in 2000 would almost definitely NOT be the most active 2 years later. I think this may be a major flaw in this study.

I also don't think that the sampling size is large enough & that the sampling method should have been a random selection of projects that met some reasonable criteria, like:
  • had at least C contributors
  • had at least L lines of code contributed over the last M months
  • had at least D downloads over the last M months (penalized very new & very unpopular projects??)
I also believe that they made another possible error: they observe in their discussion that the median number of LOC per project was 53K for "non-restrictive" and 60k for "restrictive". They suggest that this is not a big difference (they do not appear to verify the nature of the distribution of LOC in projects by license grouping statistically). But I would suggest that 500 lines of code for a project that has 5k LOC can often be a more significant contribution than 500 LOC to a 100K LOC project. They should have looked into the effect of normalizing the contributed LOC by the total LOC in the project.

I haven't taken too much time to go over all of their experimental design, model & stats....

This study builds on an earlier study titled "The Scope of Open Source Licensing"[2] 2005, (pre-print), which is where the authors get their view of "restrictiveness" for licenses. This study found:
"Projects geared toward end-users tend to have restrictive licenses, while those oriented toward developers are less likely to do so. Projects that are designed to run on commercial operating systems and whose primary language is English are less likely to have restrictive licenses. Projects that are likely to be attractive to consumers—such as games—and software developed in a corporate setting are more likely to have restrictive licenses. Projects with unrestricted licenses attract more contributors."
This study used all 40k SourceForge projects available (2002).

[1] Fershtman, C. & N. Gandal. 2007. Open source software: Motivation and restrictive licensing. International Economics and Economic Policy. http://dx.doi.org/10.1007/s10368-007-0086-4

[2] Lerner J, Tirole J (2005) The scope of open source licensing. Journal of Law, Economics and Organization 21:20–56

8 comments:

zbrown said...

Somehow I'm not 100% sure I buy into this. Maybe its just me but I don't typically consider the license I'm writing for while I write the open source software that I do. The license is almost a secondary problem to me. I rarely consider it in how I write my software and generally accept whatever the lead developer (unless its myself) chooses as a license.

But thats just me :)

Glen Newton said...

I agree with you: I am not at all sure of the result of this study. I think it is like many first couple of medical studies on a previously un-studied topic or disorder: you have to wait until a number of studies come out before getting close to the truth.

--Andy said...

This is a poorly designed study. You have already done a good job of discussing many of the problems with this protocol, but I thought I would add one more to your list.

Source Forge is undeniably one of the most important OSS repositories in the world. On the other hand many of the biggest, most important OSS projects are hosted elsewhere.

For example - KDE has more than 13 developers and it is licensed under the "restrictive" GPL. FireFox has more than 13 developers and is under the "moderately restrictive" MPL. This list could go on and on.

These guys made the classic research faux pax. They looked at a skewed unrepresentative sample and then announced that their results are globally relevant. They would have been much better off restricting their commentary to what they can prove about the projects they looked at.

Tad said...

Another thing to consider is that developers aren't interchangeable cogs. It seems possible to me that BSD might attract a more hard-core technical type developer, whereas the greater distribution of Linux might attract more casual programmers. This could cause the statistics to make the Linux projects look less efficient, but in reality, it argues against the study's conclusion as it might indicate that the GPL projects do a better job of attracting all sorts of people, whereas the BSD projects only appeal to a small group of developers.

Tad said...
This comment has been removed by the author.
Jose said...

Here is my slanted report. I can keep up with the best of them:

In engineering.. in life.. there is always a trade-off. This means that marketing will always have an angle no matter how messed up their intentions.

>> "...that the output per contributor in open source projects is much higher when licenses are less restrictive and more commercially oriented."

Com'n guys. First to code up 1,000,000 lines gets $100 dollars.

Ready set go!

Ah, chucks. I'm being bad.. The real reason for the increased production per BSD developer is that there is a lot more competition to come up with GPL code so each person has a harder time getting quality code in ahead of someone else. This means that each person *officially contributes* less [there being so many more people contributing].

There could even be another reason. With so much more GPL code out there than BSD code, there is a lot more for BSD people to copy from that is new to them than there is for GPL people to copy [copy via studying not copy/paste]. I'd say the GPL people are pulling everyone else. The much smaller number of BSD people (to a greater extent) are like the worker bees trying to help keep the proprietary vendors from falling too far behind by bringing to them the fruits n honey of GPL developers.

Hence the greater sloc for BSD worker bees.

Can I slant or what?

Jose said...

When you are a monopolist -- and I am not saying that the report was paid for or will be paid for by or on behalf of any monopolist -- and can push your software through to the customer -- and I am not saying that just because you can push it through that it is utter crap -- then it becomes more difficult to judge the productivity of your development team because of the missing free market feedback. Under such a scenario something generally as worthless as SLOC might pass as a respectable metric for performance -- and note that I am not saying that any monopolist out there relies on worthless SLOC even if they could. SLOC doesn't measure code quality. What it helps measure is how much code your paid employee is pushing per hour.

In the FLOSS world, the software either works or it doesn't. It doesn't matter how many eyeballs it took because many of the eyeballs are not being paid. The eyeballs frequently can afford to be extra careful and take their time because of not having an externally imposed deadline. The people that own the eyeballs, who use the product, gain directly from the quality of the product not from the lines of code it took to write it. A single fix might be all that a particular eyeball really cares about. In this world, SLOC doesn't even guide the decisions of a typical intelligent coder artificially. [Bad pun I agree.]

As stated in the earlier posting, if the less permissive licenses are more popular, it means that the average sloc will go down in all likelihood with more people competing for the set of ideas and features. Also, as stated earlier, with more code ideas being expressed first with less permissive licenses, everyone else coming after can more easily pad their sloc's. And then there is the possibility that BSD people are simply more greedy and exclusive.

Did a proprietary monopolist -- or any company -- have anything to do with a SLOC report that appears to indicate that coders that use license terms that benefit the monopolist have a bigger ... um, produce more babies.. I mean, are more fertile and prolific? I don't know. I doubt it. Could this monopolist -- or any company, even a company that produces horribly malfunctioning gaming systems that should be given away, not sold -- sponsor multiple experiments in parallel allowing only the "good" ones -- or should I say lucky ones wink wink -- to be published? I don't know, and I doubt it.

Regardless of who dunnit or why -- and it may have been a monopolist or it may not have been a monopolist -- the bottom line is that there are more reports and statistics to "prove" I am superman (as if we don't all already know that I am) than there is land in Florida.

And as for the value of SLOC.. Surely, 100 less tasty pies is a better thing than 85 tastier pies. Surely. Surely with (eg) Windows filled with so much more bloat than Linux, Windows has to be better. Surely.

Surely. Surely. Surely.

So all of you go out and code with less permissive licenses so you can all be real macho men (or lovely little fertile princesses).

OK, still don't believe me?

Can you write any more than me here now where I am writing a lt as clearlay I am a more powful wirter adn coder tan andyone else as I write a lot and you dont nannynanyboobobo.. Ha seell ##RWERWETDR i knalkdfn sf sdlf you lfklsadjf klasdlfk h e afsdlkfjaks df lkasthey are coming lkjfklsjfkldsj fdkswhoah .jskdfjslkdfj sdfj asdfk umm faskdfjaslkdfj asdfk sdf comem and get it. slkdjfalksjdflkasjdfk lasdjf sdf whay chu mean lkdjflskjfalksdjfklasdfj asdkfj hello lksjdflkasjdfkljsd testing testing kldsjfklasdjfklasdjf kj 1kj klfjaskldj fklj3kj kljdkfljds kjkt6jkjf kdjfkla sdklfj askdlfj sdkfjklsdfjklsjfa;lksjfl;kasjdfl;kasjdfasdjf
aklsdfj;lasjdf;laksdjf;lkasdf
asdkfjal;skdfjjjjjjjjjjjjfalksdjfasdjfalksdjflkasdjflkajsdflkjasdlkfjasldkfjsdlkfjalksjfsdaf

I hope I made my point because I am not getting paid for any of this.

sourceview said...

This is not astounding information. Just look at the productivity of postgresql and apache, and then count the number of gpl projects on sourceforge.net that haven;t been active for five years.!