R packages for data management and sharing

A basic principle in reproducible research is that the data and code used to generate results should be made available. Ideally, each component of a reproducible report (paper, thesis or similar) should be documented and organized to allow for independent execution of code that supports conclusions (Peng 2011). Publishers and journals have started to highlight reproducible scientific work, sometimes under the term open research¹. Similarly, funding agencies are also taking steps to support or even demand open research practices, including reproducible research. Even though principles are well described and researchers have many incentives to publish reproducible open research, the fail to do so. Reproducible research is still not common practice. There are several reasons why this is the case, reasons such as lack of skills and knowledge on how to do it, insecurity or embarrassment in sharing behind-the-scenes work or fear of inappropriate use (Gomes et al. 2022). Such subjective concerns can be effectively addressed with a small to medium investment in time to learn skills and available solutions (Gomes et al. 2022), but also, more importantly by using reproducible practices throughout your workflow.

Efficiently sharing code and data can be challenging if you do not rely on software and workflows specifically designed for this purpose. The R ecosystem has many advantages, and an obvious one in this context is the possibility to formally combine data, code, and documentation in a portable unit called a package. An R package can be shared through CRAN or code repository systems such as GitHub. Alternatively, a package may be compressed and shared as a single file in a data repository such as DataverseNO. From a data-sharing perspective, an R package may include initial steps in data cleaning, transforming raw, unprocessed data into data suitable for statistical analyses. Such initial steps are part of the package source and may be documented as part of the package. Complex projects often involve multiple datasets that are not matched, as some variables are measured across multiple time points, while others are measured at a single time point but across several domains or items. In such cases, it is not possible to curate a single, combined tabular dataset without losing information or introducing complexity.

In an R package, you can combine several types of data sets that are possible to combine using analytic code for specific purposes. Additionally, “helper functions” can be included in the package to aid users in reproducing analyses or preparing data for novel analyses.

This part of the course presents package development in R as a tool for data curation and sharing, enabling reproducible data analysis.

Wiley marks publications with “open research badges” to indicate shared data, pre-registration and open material. The journal Biostatistics was an early adopter of marking papers with code, data and reproducible marks (Peng 2011).↩︎