7  A tentative workflow for scientific writing

The online remote repository may have several purposes in the context of writing a scientific paper. It will serve as a collaborative repository for you and your collaborators to work on. It can be used to showcase code and computations, thereby supporting your paper. It can also act as a backup and reference for yourself. As mentioned earlier, GitHub provides multiple tools that can make the writing process more effective in creating numerous outputs (a manuscript, a website, issue tracking, and more). In this chapter we will discuss a suggestion for a basic repository template for writing a scientific paper. We will also adress writing workflows for two different scenarios, writing on your own, or in collaboration.

7.1 A basic template for an effective repository

Table 7.1 suggests a basic repository structure for a scientific writing project. The template aims to balance flexibility with generability. We want to be able to tweak the organization of the project if needed, but we also want to feel at home when switching between projects. The first two files in Table 7.1 are important for maintaining a clean workflow; the README file can be used to outline the project and list dependencies. Dependencies, in this context, could be software, data, or any other components needed to reproduce the analysis. The .gitignore file should be actively maintained to ensure that we version control only the files we want. For example, outputs should often be excluded from version control, as they can trigger merge conflicts. At some point in the project, we might want to add output files to the online repository, as these are easier to read. At this point, we need to update .gitignore with new rules.

Table 7.1: Basic repository template for a scientific writing project
file-tree description
.gitignore An up to date registry of what you think git should ignore. Importantly, this will avoid trouble created by adding e.g., large files to the version history. You will also avoid merge conflicts due to changes in e.g., HTML files (output)
README.md A README file used to give a synopsis of the project, describe the structure of the repository and any dependencies (software, data etc.) needed to reproduce the results.
/data When analytic data is stored together with the repository, it goes into the data folder. This folder should be further subdivided.
/data/raw-data Raw data goes into a specific folder, it is raw and might need processing befor analysis.
/data/derived-data Intermediate data processed from raw data, or data derived from analyses can be placed in a specific sub-folder. In principal, all data in the derived data folder is reproducible using scripts.
/R The R folder contains all scripts used to pre-process and analyse data.
/figures/ It is generally tidy practice to keep figure files in a separate folder.
/resources Resources are additional files needed for compiling the manuscript file.
/resources/citations-style.csl A Citation Style Language (CSL) file contains the necessary information for Quarto to render a bibliography from citations entries.
/resources/bibliography.bib A bibliography file contains all entries used for citation. Usually, this is a BibTex file, editable as a plain text file.
manuscript.qmd The manuscript file is the source file for the manuscript output file.

The data folder is subdivided into raw and derived data. The idea is to clearly separate collected data from data created as intermediate data in data cleaning or from, e.g., running statistical models. The derived data folder could, in principle, be deleted and recreated using scripts. The reason for not doing this when developing the analysis, though, is to avoid having to wait for analyses to complete.

The R folder contains all scripts needed to perform the analysis (except figures, see below). R scripts are used to import and clean data, make analyses (run models, clean output), and organize results. R files can be sourced in subsequent scripts and in the manuscript file. My suggestion for a clean repository is to keep figure source files in a separate folder together with the output from these files. We submit a scientific paper, the figures are often submitted as standalone files. Having a dedicated subfolder for this work makes it easier to manage final editing. Figure source files can easily be sourced into the main manuscript file if we want to add them as part of the manuscript output.

The resources folder is used to collect all files needed to build the manuscript output. This usually starts with adding a Citation Style Language (CSL) file and a file containing the bibliography in BibTex format. A CSL file relevant for the planned submission can often be found in the Citation Style Language GitHub repository, which hosts > 1000 different styles for specific scientific journals. A bibliography file can be built using, e.g., your own Zotero library or online tools such as doi2bib. RStudio also has integrated tools for working with citations, which can be used to update a maintained BibTeX file.

The manuscript file is the source file for the manuscript output. This is where we combine the text outline of the paper with R (or other language) code to produce the manuscript. Some analyses can be incorporated into the manuscript file, but heavy computations and data pre-processing might benefit from having dedicated R scripts (placed in the R folder).

7.2 A workflow for solo writing

After setting up a basic repository outline, the primary purpose of keeping a track of updates using the version control system should be to maintain a functional copy of the project. This means that changes committed to the repository history should work and, e.g., not break the rendering of output. This requires that the project be maintained by small incremental changes that are tested before being committed to the version history. An “informal” test could be to inspect the output of rendering the manuscript file. In more complex analyses, a set of tests can be incorporated into the project, ensuring that, for example, derived data are created with expected data types or dimensions, and that statistical models run correctly, etc. Such tests could be created programmatically (see below).

When an update to the project is satisfactory, a commit to the version history is made. By the end of the day, a push to the remote repository makes it function as a backup. Additionally, the capabilities available at GitHub make it possible to keep notes on milestones and issues for the project.

7.2.1 Large revisions - Branches

When a writing project is due for a significant revision, maybe as a result of peer review, we might want to keep an unrevised copy of the project for reference. We can do this by using the branching functionality of git. A branch is a parallel version history branching from a specific time point in a project1. In this context, we are using a branch to incorporate changes into the project without breaking the functionality in the main branch.

To create a new branch, we would use the command line and write git branch revision, where revision is the name of the new branch. This will create a copy of the main branch and initiate a separate version history for the specific branch. We can now incrementally add changes to the new branch and still have the functioning main branch as a reference or fallback. To switch between branches in our project, we use git checkout revision. A clean workflow could make use of GitHub issues together with branching to systematically address reviewer comments (as suggested by e.g., Van den Burg (2019)).

Git is designed to make branching easy and lightweight (Chacon 2014, 63). A branch can develop into a new branch and merge again with the parent branch. This allows for experimental work on a separate branch from the revision branch, enabling safe experimentation with potentially breaking changes to the project.

7.3 Writing in a team

Suppose you own a repository and want to invite collaborators to contribute to it. In that case, you need to add them to the project in GitHub. This is done under Settings and Collaborators. When working as a team, the remote repository becomes more important because it’s where you get the latest version of the project at the start of a writing day. git pull incorporates the latest changes into your local repository, and pushing to the remote makes sure that your changes are made available to all other team members. It may be important to check for updates regularly when working on the same branch.

Using git pull regularly ensures that any merge conflicts can be resolved locally and that the fix is incorporated in your commit. A merge conflict occurs when changes are made to two versions of the same file that git cannot merge. When you encounter a merge conflict, you will need to manually fix it by choosing which version you want to keep. git flags merge conflict by a message similar to this:

$ git merge
Auto-merging manuscript.qmd
CONFLICT (content): Merge conflict in manuscript.qmd
Automatic merge failed; fix conflicts and then commit the result.

You can use git status to see which file is affected by the conflict. When opening the flagged file, you will encounter:

<<<<<<< HEAD
Text in your local version of the file
=======
Text in the remote version of the file
<<<<<<< REMOTE

By removing the whole segment (starting with <<<<<<<< HEAD), and replacing it with a version you think should go into the file, you are resolving the conflict.

7.3.1 Branches and pull requests

Teamwork on writing could also benefit from creating pull requests from branches. A pull request is a formal request to incorporate changes made to a branch into the main (or any selected) branch. When a collaborator has been working on a large change to the project as part of a branch, its incorporation could be reviewed as part of a pull request. GitHub makes it easy to discuss pull requests.

7.4 Doing formal testing

When writing an R package, the use of formal tests is a way to ensure that components that are part of the package (e.g., functions and data) work the way they are intended. Formal tests can be written, for example, to check if a data set used for analysis contains all expected rows. The toy example below evaluates the data frame stored in d ’ and checks its dimensions. The “test” is silently passed if everything works, but returns an error if the test does not return all TRUE. This test could be incorporated as part of sanity checks in R scripts and the manuscript file, and is part of good practice when working on complex projects.

d <- data.frame(var1 = c(1, 2, 3), 
                var2 = c("a", "b", "c")) 

# Check if d is of correct dimensions
# This returns an error of not all expressions are evaluated as TRUE.
stopifnot({
  dim(d)[1] == 3
  dim(d)[2] == 2
})

More complicated tests can be added to the project. For this purpose, testing functions that are part of the testthat package could be helpful. In the context of writing tests for a paper we could use these testing functions as part of scripts, giving us a Test passed when the test was executed with expected results.

7.5 Going further - Make files

In the suggested workflow outlined at the beginning of this chapter, the manuscript file essentially works as a makefile. The makefile sources some R scripts to rerun analyses. However, when a data analysis grows complex, more formalism may be needed to ensure that things work out as intended. The targets package is an R package designed to create data analysis pipelines in R. The idea is that output is only generated when all components of a pipeline are up to date. targets supplies R-based tools to the user for the creation of data pipelines that could include data cleaning, modelling, etc.


  1. See here for a detailed overview.↩︎