Monday, December 20, 2010

Research Data and Metadata at Risk: Degradation over Time

Research data and metadata - usually derived from an experiment or series of experiments - is in the hot focus of the researcher during the experiment and the subsequent interpretation, paper writing and publishing. But once the researcher has moved on to their next effort, this data is very much at risk. Often (read 'the-rule-rather-than-the-exception'), the data is not properly managed and archived, and resides as a single copy on the researcher's desktop or maybe research server.

In addition, the metadata is minimal or non-existent, and if it does exist is only interpretable by the researcher and their colleagues or students. Over time, the chance that this data will be lost or useful knowledge about it forgotten by the researcher increases, and the information content of the data and metadata rapidly decreases. Some events can seriously accelerate this decrease: data loss (media failure, computer replacement, other serious accidents or failures, etc); change of careers and retirement; and the death of the researcher.

This common scenario was first described in published form in 1997 (to my knowledge) in a paper entitled: Nongeospatial Metadata for the Ecological Sciences (citation below). It included a very expressive diagram, which I have re-created for another paper, and you can see it below:





A higher res PDF can be found here.

The diagram was created using LaTeX and TikZ. The source files can be found here.

From: de la Sablonnière, Auger, Sabourin and Newton. 2012. Facilitating Data Sharing in the Behavioral Sciences. Data Science Journal. Volume 11, 23 March 2012.    After: Michener, W., J. Brunt, J. Helly, T. Kirchner & S. Stafford. 1997. Nongeospatial Metadata for the Ecological Sciences. Ecological Applications 7:1:330-342 DOI: 10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2
The Michener paper is excellent and very much before its time. Its many insights generalize to other research domains.