User:OpenScientist/DataONE Workshop on Data Governance 2011
- 1 Background
- 2 Timeline
- 3 Data governance barriers and challenges
- 3.1 Lack of incentives
- 3.2 Attitudes towards data sharing
- 3.3 Technical difficulties with data sharing
- 3.4 Legal issues
- 3.5 Legitimate reasons for withholding data
- 3.6 Institutional practices
- 3.7 Data processing
- 3.8 Data publishing
- 3.9 Data citation
- 3.10 Metadata
- 3.11 Ontologies
- 3.12 Scaling issues
- 3.13 Persistence
- 4 See also
On Dec 14-15, DataONE hosted a workshop on data governance in Washington, right after CNI 2011 and aimed at providing research funders like NSF-OCI with recommendations regarding the sharing of data related to the research they fund.
The ca. 25 participants were "asked to provide a brief, one-page position paper on data governance barriers and challenges they have experienced or would like to be addressed, such as: license choice (or copy/data rights, contracts, and public domain dedications), attribution, publishing, citing, provenance, metadata, and standards adoption (e.g. for ontologies)."
This page hosts the drafting of that position paper from the perspective of the Open Knowledge Foundation and has been updated with some of the points that have been discussed during the workshop. Anyone is still invited to join in to make this document useful for further activities in this area.
Data governance barriers and challenges
Lack of incentives
- Researchers typically receive no rewards for sharing data
- Data sharing requirements (e.g. Data management plan) published by research funders so far had no teeth, as they were not enforced
Attitudes towards data sharing
- Due to the closed nature of most research projects, data sharing typically takes places considerable time after data acquisition and is often perceived as a burden. For projects conducted in the open, data sharing is mostly part of the routine of taking lab notes.
- Employment contracts often expressly forbid data sharing
- I get the feeling (difficult to evidence), that some people mistakenly assume their employers don't want data shared (even if this isn't actually the case)
- Exploitation of reusers by requiring co-authorship on more than one paper using the data (e.g. ADNI)
- What researchers want is credit, what scientific norms usually give them is citations, what the law requires is attribution. The latter does not guarantee any of the former.
- Funding contracts from industry may impose restrictions on data sharing
Technical difficulties with data sharing
- Large file sizes e.g. >1GB tend to pose a problem for some people. But this still isn't a 'real' technical barrier, it just requires strategy/knowledge to overcome.
Lack of standards for data sharing
There are three main legal formats that could be used to share data - contracts, licenses and waivers. In the following, these are loosely lumped together under "licensing".
- People who generate raw data often don't care for or understand data licencing. Therefore, they either neglect to explicitly do anything to licence their data, or if they do actively choose a licence, they may choose an inappropriate licence because of a lack of understanding if its implications.
If it is not clear under what terms public data is available for reuse, it is effectively not available for reuse.
- Public data available under incompatible licenses (e.g. CC BY-SA and CC BY-NC-SA) cannot be mashed up in a way that would allow resharing the resulting mashup.
- Licenses are not necessarily compatible with things like patient privacy restrictions
If a project reuses public data available under different but compatible licenses (e.g. CC BY and CC BY-SA), any resulting mashup would have to be released under the more restrictive license (in this case, CC BY-SA). Large-scale reuse will thus benefit most from the least restrictions on data reuse, but it will contribute to spreading these restrictions.
Legitimate reasons for withholding data
- Privacy (patients, survey respondents etc.)
- Endangered species or artwork
- Not really geared towards data sharing yet
In many circumstances, the raw data is useless if not accompanied by the protocols and code used to process it.
Via the papers originally reporting about the data
In the references
This is the usual way to give credit for a dataset, but may be too crude to allow for proper citation of a particular piece of information. Also, if the data is only in the paper and not posted in a dedicated database, it is likely in a format that impedes reuse (e.g. a table in a PDF document).
Via supplementary materials
- Data re-use in papers, particularly (but not solely) meta-analyses, is often not properly counted as the references to the datasets used are put in supplementary materials. Authors of those original data then don't get citation credit (Seeber, 2008; Kueffer, 2011).
- A few journals like Nature have taken steps to rectify this problem. Nature index references supplied in a special Methods Supplementary information section, however many articles can still be found even in this journal with un-indexed 'hidden citations' (see Piwowar, 2011 and comments).
Seeber, F. Citations in supplementary information are invisible. Nature 451, 887 (2008). URL http://dx.doi.org/10.1038/451887d Kueffer, C. et al. Fame, glory and neglect in meta-analyses. Trends in Ecology & Evolution 26, 493-494 (2011). URL http://dx.doi.org/10.1016/j.tree.2011.07.007 Piwowar, H. (2011) Indexing citations. http://researchremix.wordpress.com/2011/08/17/indexing-citations/
Via accession numbers
Another common way of citing data, and also not commonly indexed by bibliometrics and other scientometric tools.
Via Data DOIs
- Fits in with classical citation tracking
Via short-lived websites
Many authors who do share their data use their personal or institutional websites for this purpose, and as a consequence, the datasets are often lost a few years down the line, e.g. when the researchers moved elsewhere or when the server or content management system received an upgrade.
Data unaccompanied by proper and consistent metadata is effectively not useable.
Ontologies exist for a huge variety of applications, but most of them are not widely used, not even in the primary area for which they were designed. Ontologies with a more general scope often have compatibility issues with more specialized ones.
Beyond a certain scale (which varies according to circumstances), the sharing of data becomes difficult to manage. Automation is an obvious solution if the kind of data and its processing is fairly established, but not always achievable for new kinds of data.
Permanent archiving is a problem for anything digital, and no widely applicable solution exists. The scale and diversity of datasets (e.g. as compared to papers) make them especially vulnerable to loss or damage. Part of these problems can be remedied by sharing data openly and encouraging mirrors and other ways of systematic copying, indexing and reuse. A complicating factor is that, as mentioned above, data alone is useless, unless accompanied by metadata, protocols and tools for processing and visualization, all of which will have to be preserved in some form as well.
- Open Definition by the Open Knowledge Foundation
- Integrating wikis with scholarly workflows
- Wikipedia article on Data management plan
- Data Management Plan Tool
- Open Access and Open Data policy of the Wikimedia Foundation (draft)
- Wikimedia Foundation's response to the OSTP's Request for Information on Public Access to Digital Data and Scientific Publications
- Publishing Open Data Working Group, includes recommendations on data sharing statements and supplementary materials