Monday, December 20, 2010

Research Data and Metadata at Risk: Degradation over Time

Research data and metadata - usually derived from an experiment or series of experiments - is in the hot focus of the researcher during the experiment and the subsequent interpretation, paper writing and publishing. But once the researcher has moved on to their next effort, this data is very much at risk. Often (read 'the-rule-rather-than-the-exception'), the data is not properly managed and archived, and resides as a single copy on the researcher's desktop or maybe research server.

In addition, the metadata is minimal or non-existent, and if it does exist is only interpretable by the researcher and their colleagues or students. Over time, the chance that this data will be lost or useful knowledge about it forgotten by the researcher increases, and the information content of the data and metadata rapidly decreases. Some events can seriously accelerate this decrease: data loss (media failure, computer replacement, other serious accidents or failures, etc); change of careers and retirement; and the death of the researcher.

This common scenario was first described in published form in 1997 (to my knowledge) in a paper entitled: Nongeospatial Metadata for the Ecological Sciences (citation below). It included a very expressive diagram, which I have re-created for another paper, and you can see it below:





A higher resolution jpeg image can be found here.
A high resolution PDF can be found here.

The diagram was created using LaTeX and TikZ. The source files can be found at the github repo.
From: de la Sablonnière, Auger, Sabourin and Newton. 2012. Facilitating Data Sharing in the Behavioral Sciences. Data Science Journal. Volume 11, 23 March 2012.    After: Michener, W., J. Brunt, J. Helly, T. Kirchner & S. Stafford. 1997. Nongeospatial Metadata for the Ecological Sciences. Ecological Applications 7:1:330-342 DOI: 10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2
The Michener paper is excellent and very much before its time. Its many insights generalize to other research domains.

Wednesday, November 17, 2010

Canadian "Open Data" Cities: 'No stars' in Tim Berners-Lee Five Star Rating for Open Government Data

At the International Open Government Data Conference (IOGDC) Tim Berners-Lee "...reiterated his five star system” for open government data:
  • 1 Star for putting data on the Web at all, with an open license.
  • 2 Stars if it’s machine-readable.
  • 3 Stars for machine-readable, non-proprietary formats
  • 4 Stars if the data is converted into open linked data standards like RDF or SPARQL
  • 5 Stars when people have gone through the trouble of linking it"
From: http://gov20.govfresh.com/open-data-accountability-citizen-utility-and-economic-opportunity/
Original TB-L ref: http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/

So according this metric by TB-L, the Canadian cites (Ottawa, Vancouver, Toronto,, Edmonton, etc) that have recently released data with not open licenses get ZERO stars!!!

Friday, September 17, 2010

Not 'in spite of' but 'because of': agile: Multics--The first seven year

In the Multics--The first seven years (1972 Spring Joint Computer Conference) Corbato et al make the interesting statement:
"In spite of [bold added] the unexpected design iteration phase, the Multics system became sufficiently effective by late 1968 to allow system programmers to use the system while still developing it."
I think that many from the agile community might change the "in spite of" to "because of"...

:-)

Mars Inc. Cacao Genome Database claims Open Access, public domain: falls short

This initially looked very promising: Mars, along with a number of collaborators (USDA, IBM, Clemson University Genomics Institute; Public Intellectual Property Resource for Agriculture at the University of California-Davis; National Center for Genome Resources; Center for Genomics and Bioinformatics at Indiana University; HudsonAlpha Institute for Biotechnology; and Washington State University), have sequenced the cacao genome and released it "Open Access" and "public domain" for the benefit of all, at a site called the Cacao Genome Project:
McLean, VA –Today, Mars, Incorporated, the U.S. Department of Agriculture-Agricultural Research Service (USDA-ARS), and IBM released the preliminary findings of their breakthrough cacao genome sequence and made it available in the public domain.
- From the Mars Inc. press release 15 September 2010
A quote from the Independent article on the release (First rice, then wheat – now cocoa genome unravelled 15 Sep 2010) from one of the collaborators on the project:
Professor Shapiro, a molecular biologist, said: "We thought: 'Let's put this in the public domain so everyone has free access to it for eternity'. It could be patented and it can't be now. We have full open access.
"public domain"

"full open access"

As this is data, we could also be talking about Open Data.

Let's take a look at how 'open' this Cacao Genome Project is by examining the fine print (of the license):

In order to get access to the data, you have to get an account (no anonymous access; obligatory registration is pretty counter-Open Access and arguably not 'public domain'). In order to get an account, you have to agree to a license.

Registration & license

From the license:
The Provider is making available the information and data found in the cocoa genome databases for general information purposes for scientific research, germplasm conservation and enhancement such as plant breeding, technical training, general education, academic use, or personal use.
Restricted use, appearing not to include commercial use. So more of a GPL-ish license as opposed to a BSD-ish license (before anyone calls me out, but I am not saying GPL is NOT commercial, just generally viewed as less commercial-friendly than BSD).

Moving on:
Anytime the User consults the data base through the cocoa genome database web site, he/she shall be bound to the same obligations under this IAA. Should the User store the information and data for future use he/she shall be bound to the same obligations under this IAA.

The User shall not transfer the information referred to in this agreement, or any copy of them, to a third party without obtaining written authorization from the Providers which will only be provided subject to the third party user entering into this same IAA.

Wow. That is particularly extraordinary. A WTF moment.

Fortunately I didn't agree to the license so I AM able to talk about it now.

Not allowing third parties to see a license is inherently incompatible to the idea of Open Access, Open Source, Open Data and public domain.

It is simply bizarre in these modern times.


Moving on:
The User shall not claim legal ownership over the information and data found in the data base nor seek intellectual property protection under any form over these information, data and data base. For clarity, the user agrees not to claim any of the sequences disclosed in these databases in any patent application.

Translation: Don't claim legal ownership, because we own the IP for the data AND the sequences, and (maybe) we will be claiming patents, etc some time in the future. I have not been able to find anything on the site to the contrary (see below 'Deluded or Disengenuous' below).

Moving on:
However, the foregoing shall not prevent the User from releasing, reproducing or seeking intellectual property protection on improved seeds or plants that may be developed using the information for purposes of making such seeds or plants available to farmers for cultivation.
This appears to allow commercial use of the database ("make available" can include selling the seeds), which seems to conflict with the earlier clause.


Conclusion

Clearly, this data set has not been released as Open Access and certainly not released into the public domain.

Instead of Open Access or public domain, they have a restrictive license, which allows gated access for a restricted set of uses.

They should therefore not be claiming Open Access or public domain for this data.



Deluded or disingenuous?

The "About" page of the Cacao Genome Project claims that the license is in place to defensibly block patents of the sequences. While this may be true, claiming an Open Access AND public domain release of the data is either disingenuous or deluded.
Public access to the genome will be available permanently without
patent via the Cacao Genome Database. Before viewing the data, users
have to agree that they will not seek any intellectual property
protection over the data, including gene sequences contained in the
database. The Information Access Agreement allows any cacao breeders
and other researchers to freely use the genome information to develop
new cacao varieties. This allows for a level playing field and a
healthy competitive environment that will ultimately benefit the
sustainability of cacao production in the long term.

'Free' as in 'beer' they should have said.

Saturday, August 21, 2010

Conference proceedings: Electronic Government and the Information Systems Perspective

Volume 6267: Electronic Government and the Information Systems Perspective. First International Conference, EGOVIS 2010, Bilbao, Spain, August 31. September 2, 2010. [NB: Behind paywall]

Wednesday, August 18, 2010

What is Open Gov Data? The Sunlight Foundation: Ten Principles for Opening Up Government Information

My earlier entry/rant, It's not Open Data, so stop calling it that... about the non-Open Data nature of a number of Canadian cities' Open Data initiatives is supported by the just released Ten Principles for Opening Up Government Information from the Sunlight Foundation. Specifically:
  • 6. Non-discrimination

    "Non-discrimination" refers to who can access data and how they must do so. Barriers to use of data can include registration or membership requirements. Another barrier is the uses of "walled garden," which is when only some applications are allowed access to data. At its broadest, non-discriminatory access to data means that any person can access the data at any time without having to identify him/herself or provide any justification for doing so.

  • 8. Licensing

    The imposition of "Terms of Service," attribution requirements, restrictions on dissemination and so on acts as barriers to public use of data. Maximal openness includes clearly labeling public information as a work of the government and available without restrictions on use as part of the public domain.

  • 9. Permanence

    The capability of finding information over time is referred to as permanence. Information released by the government online should be sticky: It should be available online in archives in perpetuity. Often times, information is updated, changed or removed without any indication that an alteration has been made. Or, it is made available as a stream of data, but not archived anywhere. For best use by the public, information made available online should remain online, with appropriate version-tracking and archiving over time.

Friday, August 13, 2010

ARL Report: E-Science and Data Support Services

The U.S. Association of Research Libraries (ARL) has produced a new report (E-Science and Data Support Services).

Alzheimer's Spinal Fluid Test and Research Data Sharing

The recent reports on being able to predict Alzheimer's (Alzheimer's predicted by spinal-fluid test -- CBC, 2010.08.10) are the direct results due to the data sharing of scores of biomedical researchers (Sharing of Data Leads to Progress on Alzheimer’s -- New York Times, 2010.08.10). The sharing included both academic researchers and drug company researchers. The data sets are available online:
Companies as well as academic researchers are using the data. There have been more than 3,200 downloads of the entire massive data set and almost a million downloads of the data sets containing images from brain scans.
Alzheimer’s Disease Neuroimaging Initiative (ADNI), the organization looking after the data, has a very complete policy (Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data Sharing and Publication Policy) about their data sharing.

Tuesday, July 27, 2010

It's not Open Data, so stop calling it that...

While it is a great positive change that data is being released through numerous efforts around the world, data release is not the same as Open Data release. A number of Canadian cities have announced Open Data initiatives, but they are not releasing Open Data. They are just releasing data. Of course, this is better than not releasing data. But let's at least be honest about what we are doing.

Why aren't they Open Data? Because their licenses are not Open Data licenses:
  • Not Open Data: Edmonton: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
  • Not Open Data: Vancouver: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - Terms of Use
  • Not Open Data: Ottawa: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
  • Not Open Data: Toronto: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
All of these licenses also suffer from the additional mis-feature of arbitrary retroactivity:
"The City may at any time and from time to time add, delete, or change the datasets or these Terms of Use. Notice of changes may be posted on the home page for these datasets or this page. Any change is effective immediately upon posting, unless otherwise stated"

These two clauses mean that there is no stability for someone using this data. If, something they do or say (data related or not) is not liked by the city whose data they are using, they can lose access. Or if the city finds that many data users are doing things they do not like, they can change the terms of reference to impact data previously obtained by users.

How to fix
Obligatory versioning of both datasets and licenses, and losing the above two clauses. When a dataset is released, it is given a version, and that release is matched to a (usually the most recent) license version, that will always apply to that version of that data release. Any change to a license generates a new version, only applicable to subsequent releases that choose to use the new license.

This is how things work in the Open Source world. It means that if you possess a piece of Open Source software, with a license of a specific version, someone half-way across the world from you cannot turn you into criminal and/or shut you down by retroactively changing the license. It means that you have stability. Of course, you may be shut out of the next version if they change its license, but that doesn't necessarily shut you down today. You have some level of stability.

An example: an SME builds a business based on data released by the cities. This business perhaps includes data mining tools that reveal some things that some of the cities do not like revealed or discussed. They change the license (remember: "...cancel or suspend ...without notice and for any reason...") or simply cancel or suspend the company's data access to shut this company out, and the company goes out of business.

-----

So, if you want to release Open Source code or Open Data, you must be willing to accept that it will be used in ways that you may find offensive, to you (and/or your constituents). That is how it works.


Update: 2010 10 14: Eight Principles of Open Data from Open Government Data Principles:
  1. Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
  2. Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
  3. Timely Data is made available as quickly as necessary to preserve the value of the data.
  4. Accessible Data is available to the widest range of users for the widest range of purposes.
  5. Machine processable Data is reasonably structured to allow automated processing.
  6. Non-discriminatory Data is available to anyone, with no requirement of registration.
  7. Non-proprietary Data is available in a format over which no entity has exclusive control.
  8. License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
The above cities' licenses are are not compliant with #4 and #6 of these eight principles. See also http://zzzoot.blogspot.com/2010/08/what-is-open-gov-data-sunlight.html

Update: 2010 Nov 7: http://acrosscanadatrails.posterous.com/civicaccess-discuss-importance-of-true-open-d

Saturday, July 24, 2010

University visitor @ Australian National University


Tomorrow is (sadly) my last official day* as a university visitor at the Australian National University (ANU), Canberra.
I've been here since late June, invited by ANU adjunct and Funnelback chief scientist (and ex-CSIRO) David Hawking, to visit the Algorithms and Data Research Group, School of Computer Science, College of Engineering and Computer Science.

I was installed in a lovely office



looking out into the campus,


where I've been working on large scale journal visualization, a continuation of the Torngat project. I've been working on a couple of things, including applying Mulan to the multi-label problem of the corpus I am working with, so I can get precision and recall to evaluate this method empirically. My productivity has been hampered by a recurring stomach problem (which appears to be gone this last week: yay!), so I've not progressed as much as I would have wanted to.... :-(
At the end of last week I gave a presentation at CSIRO (in the same building) on this work entitled: Search refinement: visualizing research journals in semantic space. After this talk I had a discussion with Alex Krumpholz and Hanna Suominen, and it is looking like we will be working together on a project involving Torngat.

I've also enjoyed the company of John Maindonald, (went on a very nice Sunday walk with him and his wife and some of their friends) one of the important players in the R universe. He's arranged an invite for me to talk tomorrow about how I've used R in the Torngat project, with the Canberrra R Users Group.

I also enjoyed an afternoon this past week meeting with the Australian National Data Service (ANDS) people, arranged by the wonderful Monica Omodei (formerly Berko), learning about their success in putting together ANDS and where they were going. They were also interested in Torngat, so I gave them a brief presentation on it.

A bit of a surprise collaboration: I have committed myself to helping improve the single-threaded Lucene indexing benchmark in the DaCapo Java benchmarks, after discussions with ANU's Steve Blackburn, a Java VM and GC guru. I've also committed to implementing a new multi-threaded indexing benchmark. Most of the code will be derived from existing code from my LuSql tool (it is actually from the yet un-released LuSql v1.0 codebase).

While it has been winter (Spring/Fall by Canadian standards...) here in Canberra, I have still been amazed at the fantastic birds that are (still) here. Like nothing we have at home, there are the loud and raucous-yet-endearing sulfur crested cockatoos:



Gang-gang cockatoos, crimson and eastern rosellas, and many others that flittered past in a rush of colour or sang in the distance, unseen and unrecognized by me. These guys below were also fairly common, but I'm not sure what they are:
A number of people have been very helpful and gratious with their time, and I'm going to list them here, in no particular order: Alex Krumpholz & family (thanks for the hike up Black Mountain & dinner & talk afterwards!); Tom Rowlands; Tim Jones; Paul Thomas; David Hawking & Kathy Griffiths; Diane Kossatz; Chelsea Holton; John Maindonald; Monica Omodei; Hanna Suominen; Steve Blackburn.

------------------------------

I think this is Mount Stromlo, (the low hill/mountain to the right, with the mountains of the Brindabella Range (I think) in the background) from Black Mountain. You can see some of the Mount Stromlo Observatory at white dots on the crest of Mt. Stromlo. The observatory and the forest that was on Mt. Stromlo were mostly destroyed in the 2003 Canberra bushfires. When I was up at the observatory earlier in the month there were many burnt-out tree stumps to be seen. And 'roos. :-)








*Today is July 25th: I am 14 hours ahead here in Canberra of the eastern time zone in North America. But blogger thinks I am there on the 24th...

Thursday, July 08, 2010

E-Government ICT Conference

The recent conference E-Government ICT Professionalism and Competences Service Science (IFIP International Federation for Information Processing, IFIP 20th World Computer Congress, Industry-Oriented Conferences, September 7-10, 2008, Milano, Italy) is of interest to those involved with ICT in government and eGovernment in general (although the conference is rather EU-centred).
Note the content is behind a pay-wall, so you can't read the articles unless you belong to an institution that has a subscription or you have one yourself.

Of interest:

Search: "open source":
  • Business Process Monitoring: BT Italy case study
  • The Italian Public Administration Electronic Market: Scenario, Operation, Trends:
    "The selected software to develop the dashboard has been Pentaho suite, which is an open source application. It better fits all the project needs that can be summarized by the following drivers:
    • Low license costs. It has no license costs.
    • Low impact on current systems architecture. It does not need a complex integration with source systems.
    • Availability of “off the shelf” features (reporting and KPIs analysis). It has rich libraries of graphical objects and reports to better show indicators.
    • Short Time to Delivery. The Dashboard has been delivered in three months including a tuning phase in which some new features had been added."

Table of Contents:

Tuesday, June 22, 2010

JCDL2010 Research Data Papers

I am at the 2010 Joint Conference on Digital Libraries (JCDL2010) at Surfer's Paradise in Queensland, Australia. Among the many interesting papers are two papers very relevant to those interested in research data issues.

[JCDL2011 will be held in Ottawa, Canada June 13-17 2011. I am the general chair for the conference]

Digital Libraries for Scientific Data Discovery and Reuse: From Vision to Practical Reality
Jillian Wallis, Matthew Mayernik, Christine Borgman and Alberto Pepe
Abstract. Science and technology research is becoming not only more distributed and collaborative, but more highly instrumented. Digital libraries provide a means to capture, manage, and access the data deluge that results from these research enterprises. We have conducted research on data practices and participated in developing data management services for the Center for Embedded Networked Sensing since its founding in 2002 as a National Science Foundation Science and Technology Center. Over the course of 8 years, our digital library strategy has shifted dramatically in response to changing technologies, practices, and policies. We report on the development of several DL systems and on the lessons learned, which include the difficulty of anticipating data requirements from nascent technologies, building systems for highly diverse work practices and data types, the need to bind together multiple single-purpose systems, the lack of incentives to manage and share data, the complementary nature of research and development in understanding practices, and sustainability.

Discovering Australia's Research Data
Stefanie Kethers, Xiaobin Shen, Andrew Treloar and Ross Wilkinson
Abstract. Access to data crucial to research is often slow and difficult. When research problems cross disciplinary boundaries, problems are exacerbated. This paper argues that it is important to make it easier to find and access data that might be found in an institution, in a disciplinary data store, in a government department, or held privately. We explore how to meet ad hoc needs that cannot easily be supported by a disciplinary ontology, and argue that web pages that describe data collections with rich links and rich text are valuable. We describe the approach followed by the Australian National Data Service (ANDS) in making such pages available. Finally, we discuss how we plan to evaluate this approach.

Saturday, June 12, 2010

Sydney in Winter is great!

I am in Sydney, soon to go to Surfer's Paradise for JCDL2010. It is winter here in Australia, with Sydney being ~2C overnight and 17-22C during the day. Fortunately the days have been sunny and dry:




Had a great tour of the Universitry of Sydney's libraries by John Shipp, university librarian, who was very generous with his time:

Wednesday, April 21, 2010

W3C to lead Linked Open Data Camp at WWW2010

Linked Open Data (LOD) Camp
  • - The year open data went worldwide, by Tim Berners-Lee
  • - Open data deployments in electronic Government, by Sandro Hawke
  • - Linked data: What about privacy?, by Thomas Roessler
  • - Linked Open Data: What about Applications?, by Ivan Herman
  • - Selection of Topics of Discussion for the afternoon sessions
Afternoon lightning talks subjects:

Wiki

Thursday, April 08, 2010

Article: Open Source Econometric Software better performance, accuracy and bug fixing than commercial software

Yalta and Yalta 2010 ("Should Economists Use Open Source Software for Doing Research?") examine the reliability, accuracy and bug fixing time for an Open Source econometric software package and five commercial econometric software packages. They find that after 5 years many of the bugs in the commercial software have not been fixed, whereas similar bugs in the Open Source software are fixed and released within a week of the discovery of the bugs.

Building on the work done by McCullough in 2004 applying a set of tests called Wilkinson's tests to the five commercial software packages, they re-apply these tests to the new versions of the commercial software, and apply the tests to the Open Source econometrics software Gretl.


The idea behind the Yalta and Yalta paper is to evaluate the bug fix time of the Open Source software, and compare this to the fixes -- if any -- and their times, that have been applied to the commercial software, since the 2004 McCullough paper. More specifically, to examine the bugs found by McCullough and see if they have been fixed.


The five packages examined here Yalta and Yalta, and earlier by McCullough are:

Bugs and bug fixing times for Gretl and commercial software

Commercial Packages
PackageBugTime to fix bug
and release new version
Gretl Reading files 3 days
Gretl Rounding error 4 days
Gretl Standard deviation 1 day
Gretl Spearman value 1 day
E_Views Correlations coefficients unit bug <5yrs
LIMDEP All DNAa
RATS ZERO Correlations >5yrs (not fixed)
RATS Singularity correlation estimates >5yrs (not fixed)
SHAZAM Missing values <5yrs
SHAZAM X,BIG,LITTLE,MISS <5yrs
SHAZAM Correlations, Spearman correlation >5yrs (not fixed)
SHAZAM MISS correlation >5yrs (not fixed)
SHAZAM ZERO correlation >5yrs (not fixed)
TSP ZERO correlation <5yrs
TSP Test IIB Failed

a We could not apply the tests on Package2 [LIMDEP] because, unlike the other packages, the demo version offered by the vendor only allows using several built-in data sets. As a result, it was not possible without payment to know whether or not they have fixed the flaws in their product. - Yalta and Yalta 2010

Conclusions

The authors make a number of important points in their conclusions:

  • On the other hand, studies in the last 15 years show that commercial software vendors can also introduce various difficulties to the research process by not correcting the known errors, avoiding to give details on the algorithms, or providing false information regarding their programs. Closed source software can hurt the reliability of computational results by making it impossible to study and verify the programming code performing the myriad functions expected from today's typical econometric package. It also complicates the process of research replication, which is already an exception and not a rule in the field of economics.

  • The open source movement,which has started to pick momentum after 1998, is now resulting in scientific software reaching and in some cases surpassing in terms of features and usability some of the proprietary alternatives. This new paradigm also brings its own set of inefficiencies such as an over-supply or under-supply of certain types of software,a surplus of licenses as well as the potential for wasted effort due to 'hijacking,' 'forking,' and 'abandoning.' When it comes to reliability and accountability, however, FLOSS helps avoid some of the difficulties associated with proprietary programs. Open source development is a transparent and merit based process similar in some ways to academics. The availability of the source code enables its verification by a large number of people with in the economics profession. Because it is free,everyone has access to it.It is flexible and future proof. These not only result in software of a high standard, but also facilitate peer review and help advance research replication.

  • In an attempt to assess reliability and accountability, we applied an entry level test suite of accuracy on the gretl econometric package and discovered a number of software defects. However, because gretl is open source, our experience was considerably different in comparison to earlier studies assessing various proprietary packages...unlike the other studies, all of the errors were corrected within a week of our reporting. Moreover, each time there was a revision to one of the source files, the updated version of the program was immediately available for download and inspection...When we applied the same tests on four widely-used proprietary econometric programs, we found that the various flaws uncovered and reported in an earlier study were not necessarily corrected. Despite the 5 years passing, only two of the software vendors have fixed all of the reported errors and still there were problems in all of the packages that we were able to test.



The authors also list what they consider significant Open Source software in the economic and econometrics space:











ProjectCategoryYear DevelopersSLOC Effort
GNU Octave Numerical analysis 1988 74 853,439 238
Gnumeric Spreadsheet 2001 9 384,341 100
Gnuplot Scientific plotting 1986 6 95,380 24
Gretl Econometrics 2000a 10 361,393 94
Maxima Algebra
1998a
17 616,576 167
PSPP Statistics 1998 3 152,593 39
R Statistics 1997 13 549,780b 151
Sage Mathematics 2005 142 195,602 51
Scilab Numerical analysis 1994 35 1,234,895 341
SciPy Mathematical library 2001 31 455,903 124
Source: Ohloh.net.
a Shows the year the program became Open Source
b Base system only. The more than 1700 contributed R extension packages are not included


New York Times article:
Data Analysts Captivated by R’s Power

References and related references

Thursday, March 04, 2010

Open Data and Open Source Tools: Examine BC Power Transmission

"*pybctc* is a python package that makes access to British Columbia Transmission Corporation (BCTC) electric data easier.

The British Columbia Transmission Corporation is a crown corporation with a mandate to plan, build, and operate the province of British Columbia's electricity transmission system. It publishes valuable information on electicity generation, transmission, and consumption to its website. This information is useful for many purposes including economic analysis, power trading, electric system study, and forecasting. The first step in using such information is to download it an parse it into useful data structures - a task performed by this library. The processed data normally will feed statistical methods, heuristics, and system models to provide a useful analysis of the British Columbia electric system.

The *pybctc* project is hosted at http://bitbucket.org/kc/pybctc


The data this library accesses can be found here: B.C. Transmission Corp historical transmission data.

Sunday, February 28, 2010

Paper: "Sociological implications of scientific publishing"

Sociological implications of scientific publishing: Open access, science, society, democracy, and the digital divide
Ulrich Herb.
First Monday, Volume 15, Number 2 - 1 February 2010

Abstract
Claims for open access are mostly underpinned with

  1. science–related arguments (open access accelerates scientific communication);
  2. financial arguments (open access relieves the serials crisis);
  3. social arguments (open access reduces the digital divide);
  4. democracy–related arguments (open access facilitates participation); and,
  5. socio–political arguments (open access levels disparities).

Using sociological concepts and notions, this article focuses strongly on Pierre Bourdieu’s theory of (scientific) capital and its implications for the acceptance of open access, Michel Foucault’s discourse analysis and the implications of open access for the concept of the digital divide. Bourdieu’s theory of capital implies that the acceptance of open access depends on the logic of power and the accumulation of scientific capital. It does not depend on slogans derived from hagiographic self–perceptions of science (e.g., the acceleration of scientific communication) and scientists (e.g., their will to share their information freely). According to Bourdieu’s theory, it is crucial for open access (and associated concepts like alternative impact metrics) to understand how scientists perceive its potential influence on existing processes of capital accumulation and how open access will affect their demand for status. Foucault’s discourse analysis suggests that open access may intensify disparities, scientocentrism and ethnocentrism. Additionally, several concepts from the philosophy of sciences (Popper, Kuhn, Feyerabend) and their implicit connection to the concept of open access are described in this paper.

Thursday, January 28, 2010

Suite of ecology journals moving to open access research data policy

In the recent editorial (Am Nat 2010. Vol. 175, pp. 145–146) of The American Naturalist, it was announced that the journals "The American Naturalist, Evolution, the Journal of Evolutionary Biology, Molecular Ecology, Heredity, and other key journals in evolution and ecology..." would be introducing data archiving policies supporting access, re-use and long term preservation. These policies are to be put in place in one year, and the example policy for the The American Naturalist is given:
This journal requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, TreeBASE, Dryad, or the Knowledge Network for Biocomplexity. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
They go on to explain the rationale and workings of the policy:
The data‐archiving policy is designed to address several concerns that some researchers may have about data sharing. To protect the ability of individual researchers to use the data that they have collected, the policy allows an embargo period after publication. While the data will be entered into an archive at the time of publication, the data may be restricted from public view for up to a year. This allows the original researcher time to publish other papers based on the data set. The policy also allows longer embargo periods at the discretion of the editor in exceptional cases. In addition, the requirement is only for data that have already been used in the publication in question; other data from the same research project that have not yet been described in a publication need not be archived. Finally, data that are particularly sensitive, such as location information for endangered species subject to poaching, should not be archived in a publicly accessible format. Human subject data should be anonymized (see the recommendations of the National Human Subjects Protection Advisory Committee 2002).

Sunday, January 03, 2010

Mattress Tobogganing in Quebec

Mid-December the new bed was delivered, and the delivery guys were supposed to take away the old bed. But - as I live in the country and it was snowy - the delivery truck got stuck down the road from my place. While the delivery guys were waiting for the tow truck, they decided to carry up the mattress, box spring and bed frame. They also decided to carry down the old mattress and box spring (wrapped in the plastic that the new items had arrived in). Spontaneously they decided to use these as plastic-wrapped toboggans first to toboggan down my driveway and then carry their toboggans around the corner, to the hill on my street, down to their truck.



The mattress proved to be the better of the two, getting some pretty good speed down the road.




I almost regret not having kept the old mattress for tobogganing! :-)