4 Reproducible data analysis: Projects and Quarto

A key goal with this crash course is to introduce R, and R friendly tools as part of a workflow for reproducible data analysis. This menas that we want to be able to share not only the results from an analysis but also all the parts that created the results. The degree to which an analysis is reproducible is determined by the availability of (Peng, Dominici, and Zeger 2006):

Peng, R. D., F. Dominici, and S. L. Zeger. 2006. “Reproducible Epidemiologic Research.” Journal Article. Am J Epidemiol 163 (9): 783–89. https://doi.org/10.1093/aje/kwj093.

the data
code to analyse the data
text to describe the code

To make these ingredients even more tasty, we might want to have them nicely stored together. Using the tools we discuss in this course we can think of data analysis projects as self-contained projects with all necessary ingredients. RStudio projects can help you organize your data and code, and text in one place. You can also link your project to an online repository for others to access. In addition, combining computer code and text in one document makes it really easy to accomplish reproducible reporting. This is made possble by the Quarto publishing system.

4.1 RStudio projects and your reproducible report

When you build an analysis using scripts and Quarto files (presented below), you need to have R recognize that you are working in a specific folder. The folder where you keep your source files and any subfolders is the root directory. This directory (or folder) is the top directory in a file system used for your analysis. R will look for data or other files used to generate the report in this folder structure, if we have told R to do so. Think of this folder as ./ (confusing, I know! But bare with me!). Any sub-folders to the root directory can be called things like

./data/ (a folder where you keep data files),
./figures/ (a folder where you output figures from analyses).

A quarto file can be used to generate a report, for convenience it should be placed in the root directory, and could have the “address” ./my_analysis.qmd.

This has several advantages, as long as you stick to one rule: When doing an analysis, always use relative paths (“addresses” to files and folders). Never reference a folder or file by their absolute path.¹

¹ The absolute path for the file I’m writing in now is C:/Users/Daniel1/Documents/projects/r-crash-course/04-quarto-basics.qmd. The relative path is ./04-quarto-basics.qmd. When working in a “project” you may move the folder containing your project to other locations, when using relative paths links between files will not break.

If you want to share your analysis, all you need to do is share the folder with all content with your friend. If you use relative paths, everything will work on your friends computer. If you use absolute paths, nothing will work, unless your friends computer uses the same folder structure (highly unlikely).

RStudio projects makes it easy to jump back and forth between projects and is a way to tell R what folder should be considererd the root directory. The project menu (top right corner in RStudio) contains all your recent projects. When starting a new project, R will create a .Rproj file that contains the settings for your project. If you start a project and click this file, a settings menu will appear where you can customize settings for your particular project.

What does this have to do with my Quarto file? As mentioned above, the source file is often written in a context where you have data and other files that help you create your desired output. By always working in a project makes it easy to keep every file in the right place at the same time as doing interactive analysis. When rendering a Quarto file, R and Quarto will consider the folder where the Quarto file is located as the root directory.

4.2 Getting started with R projects

To start a new project in RStudio:

Press the project menu in the upper right corner, choose “Start a project in a brand new working directory”
In the next menu, select “New Project” and chose a suitable location on your machine for the project to live.
Un-check the option of creating a git repository. We will do this later.
Name the project with an informative name. “Project1” is not good enough, “rproject-tutorial” or “rproject-report-lesson” is better as you will be able to track it down later.

We have now started up a brand new project without version control. The next step is to make sure the setting of the project is up date with our Global settings in RStudio. By clicking the .Rproj file in our files tab, we will open up a settings window. These are the settings for the project. Under General we see that we can set RStudio to handle the workspace and history as default. This means that our global options will be used. The global options regarding workspace should be to never save workspace, do not restore on start up and do not save history. Why? Because this will ensure that your environment are free from any stored objects that will make reproducing results from scripts difficult.

4.2.1 What folder am I in?

The great advantage of an RStudio Project is that it will make it easier to keep everything contained in our folder. To check what folder we are currently in, type getwd() in the console. R should return the full path to our working directory. If this is the case, success! If not, you have probably not succeeded in opening up a project, or you have told R to set another directory as the working directory.

The working directory is the R term for the root directory. It is possible to set the working directory manually. However, we should aim not to do that! The R command setwd() should be avoided!

See R for Data Science, chapter 7 for more details on RStudio projects.

4.3 Authoring reports in Quarto

So much fuzz just for writing a report? Yes, it is a bit more work to get started. The upside is that this system is easier to navigate with increasing complexity compared to a system where text, figures, tables and software are located on different locations in your computer and the final report requires copy-paste operations.

We will focus on the most recent format for authoring reports in R, Quarto. In this section we will introduce the basic building blocks of a report and how to put them together.

4.3.1 The Markdown syntax, and friends

The markup language markdown² enables an author like yourself to format your text in a plain text editor. This has the advantage of keeping formatting explicit and available from the keyboard. In a word editor like MS Word, formatting is sometimes not obvious and you need to point and click make changes. Using markdown in Quarto means that you write in a source document (a .qmd-file) which is rendered into a report. The output will thus be a separate file from the source documents.

² Markdown was introduced in 2004 as a syntax to convert plain text to formatted HTML. Markdown is primarily attributed to John Gruber.

The R-markdown style of markdown includes the ability to combine code in code chunks and embedded in text. This makes it possible to include code output in the final report. Another technical achievement that makes Quarto possible is Pandoc, a general document conversion software. Pandoc can convert files from one format to another, this includes the operations that we will use, from markdown to HTML, PDF or Word. Both markdown and pandoc are free and open source software that makes life easy for us!

4.3.2 Markdown basics

The idea of using markdown is that everything is formatted using plain text. This requires a little bit of extra syntax. We can use bold or italic, ~~striketrough~~ and ^superscript. Lists are also an option as numbered:

Item one
Item two

And, as unordered

Item x
Item y
- With sub item z

Links can be added like this.

A table can be added also, like this:

Column 1	Column2
Item1	Item 2

The whole section above will look like this in your plain text editor:


The idea of using markdown is that everything is formatted in plain text. 
This requires a little bit of extra syntax. We can use **bold** or *italic*, 
~~striketrough~~ and ^superscript^. Lists are also an option as numbered:

1. Item one
2. Item two

And, as unordered

* Item x
* Item y
  + With sub item z
  
Links can be added [like this](https://rmarkdown.rstudio.com/authoring_basics.html).

A table can be added also, like this:

|Column 1|Column2|
|---| ---|
|Item1 | Item 2|

4.3.3 Additional formatting

In addition to plain markdown, we can also write HTML or LaTeX in Quarto files.

HTML is convenient when we want to add formatted text beyond the capabilities of markdown, such as color. Some formatting might be considered more easily remembered such as _subscript and ^superscript. Notice that HTML and markdown syntax can be combined:

Some Markdown text with some blue text, ^superscript.

See here for syntax

HTML is convenient when we want to add formatted text 
beyond the capabilities of markdown, such as 
<span style="color:red">color</span>. Some formatting 
might be considered more easily remembered such as 
<sub>subscript</sub> and <sup>superscript</sup>. 

Notice that HTML and markdown syntax can be combined:

Some Markdown text with <span style="color:blue">some *blue* 
  text, <sup><span style="color:red">super</span>**script**</sup></span>.

LaTeX is another plain text formatting system, or markup language, but it far more complex than markdown. Text formatting using LaTeX is probably not needed for simpler documents as markdown and HTML will be enough. The additional advantage of using LaTeX comes with equations.

Equations can be written inline, such as the standard deviation \(s = \sqrt{\frac{\sum{(x_i - \bar{x})^2}}{n-1}}\). An equation can also be written an placed in the center of the document such as in Equation 4.1.

\[ F=ma \tag{4.1}\]

We are also able to cross-reference the equation Equation 4.1 for force (\(F\)).

A larger collection of equations is sometimes needed to describe a statistical model, as in Equation 4.2.

\[ \begin{aligned} \begin{split} \text{y}_i &\sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1 \text{x}_i, \end{split} \end{aligned} \tag{4.2}\]

The equation above could look like this in your editor, including the tag ({#eq-model}) used for cross-referencing:

$$
\begin{align}
\text{y}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \beta_0 + \beta_1 \text{x}_i,
\end{align}
$$ {#eq-model}

See this wikibook on LaTeX for an overview on mathematics in LaTeX.

4.3.4 Code chunks

Using Quarto syntax we can add a code chunk using:

```{r}
#| label: fig-simple-plot
#| message: false
#| echo: true

dat <- data.frame(a = rnorm(10, 10, 10), 
                  b = runif(10, 1, 20))

plot(dat)

```

We recognize the R code inside the code chunk but we have only touched upon code chunk settings. These are settings that tells R (or Quarto) how to handle the R code and output from the code chunk. message: false indicate that any messages from R code should not be displayed in the output document. echo: true indicates that the code in the code chunk should be displayed in the output document. The label is important as it enables cross-referencing the output. If your code chunk outputs a figure the prefix fig- must be in the label to enable correctly cross-referencing a figure. Likewise, if your code chunk creates an table, the prefix tbl- must be in the label. Possible code chunk settings also include figure and table captions and multiple other settings.³

³ See the Quarto documentation for details. Specifically, see here for execution options for code chunks in quarto. See also Chapter 29 in R for data science for a more extended discussion.

⁴ The world of markup languages is confusing, see here for context on YAML.

Settings can also be specified in the YAML field in Quarto files. Stand-alone Quarto source files can include a field written as YAML, a markup language,⁴ that includes settings that should be common to the whole document, such as author, date etc, or code chunk settings. We might not want to display our code, messages or warnings in the final output. We would specify this in the YAML field as detailed in the example below. Notice also the inclusion of a title and author name as part of the YAML field.

---
title: "A basic quarto report without code"
author: "Name Nameson"
execute:
  echo: false
  message: false
  warning: false
---

4.3.5 Cross-referencing, references and footnotes

We have mentioned cross-referencing above, this basically means referencing specific parts of your document in the text or generated content such as list of figures. A figure might be mentioned in the text, such as Figure 4.1. To insert the cross-reference in text, use the @fig-label syntax where fig- is the required prefix for figures and label is a user defined unique identifier. The label should be included in the code chunk under such as #| label: fig-label. The equivalent prefix for tables is tbl-.

Figure 4.1: This is an example of a Figure with a caption.

We might want to cross-reference a section in our document. This is easily done by inserting a tag at the section header such as {#sec-cross-reference}, this tag can be referenced in text using @sec-cross-reference resulting in Section 4.3.5. The sec- part is the required prefix for a section.⁵

⁵ For additional details on cross-referencing, see the quarto documentation on cross-referencing.

Citations are mandatory in academic writing. Be sure to take advantage of the built in support for citations. When writing in Quarto we can think of a reference as having three parts. The identifier, the reference and the style. We use the identifier when authoring. For example, let’s cite the R for Data Science book, we do this by using the following syntax [@r4ds]. The syntax requires that we have linked a bibliography to the document. The bibliography should include the reference, with the same identifier. The bibliography is a collection of reference entries written in bibtext format (see below). It must be included in the document meta data field (YAML field).

@book{r4ds,
  title={R for data science},
  author={Wickham, Hadley and {\c{C}}etinkaya-Rundel, Mine and Grolemund, Garrett},
  year={2023},
  publisher={" O'Reilly Media, Inc."}
}

Notice the identifier in the first row of the entry. When adding the citation [@r4ds] it will turn out to (Wickham, Çetinkaya-Rundel, and Grolemund 2023) in the formatted text and added to the bottom of the document as a full reference. If we want another citation style we can specify a file responsible for citation styles. The default is the Chicago Manual of Style. Specifying a citation style file in YAML will change the style, for example csl: my-citation-style.csl tells quarto to use the file my-citation-style.csl when formatting citations. This file can be edited or copied from a large collection of possible styles located in the citation style language repository. The repository is hosted on GitHub and searchable, click “Go to file” and type “vancouver” to get examples of CSL files that uses a Vancouver-type citation style.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

⁶ See here for more on the Visual Editor in RStudio.

Notice also, when authoring Quarto files in RStudio you might consider using the Visual Editor. This editor is a bit more WYSIWYG (what you see is what you get), and it includes shortcuts for adding references from different sources, including online databases and local Zotero libraries.⁶

Footnotes can be handy when writing. In the default mode, these will be included as superscript numbers, like this⁷, numbered by order of appearance.

⁷ This is a footnote.

⁸ See the quarto documentation on citations and footnotes, see also Chapter 29 in R for data science for more details.

The syntax for including footnotes is straight forward. Notice that the text for the footnote is included below the paragraph using the identifier created in the text.⁸

Footnotes can be handy when writing. In the default mode, 
these will be included as superscript numbers, like 
this[^footnote], numbered by order of appearance. 

[^footnote]: This is a footnote.

4.4 Additional files and folder structures in a complete analysis project

As we notice from the above discussion, a report authored in Quarto often requires additional files to render properly. We might have a collection of references, some data sets and possibly some analysis files that are not included in the Quarto file. To keep everything organized I recommend a general folder structure for every analysis project. This structure might change as the project grows. The parts listed below are what I usually end up with as a common set in the majority of projects I work with⁹.

⁹ This organization was initially inspired by Karl Broman’s steps towards reproducible science.

4.4.1 The readme-file

The README-file can be, or should be an important file for you. When a project is larger than very tiny, it becoms complex and you should include a README-file to tell others and yourself what the project is about and how it is organized. The inclusion of a README-file is considered standard in data-intensive projects, this is evident as creating a file called README.md in a GitHub folder automatically renders it on the main page of your repository (more about that later). In the readme file you have the opportunity to outline the purpose of your project and explain the organization of your project files.

I find it very helpful to work with the README-file continuously as the project evolves. It helps me remember where the project is going.

A very basic ouline of the README-file can be

# My project

Author: 
Date: 

## Project description 
A description of what this prject is about, the 
purpose and how to get there. 

## Organization of the repository

Files are organized as...

## Changes and logs
2023-08-15: Added a description of the project...

4.4.2 `/resources`

I usually include a sub-folder called resources. Here I keep CSL-files, the bibliography, any styling or templates used to render the report. Keeping this in a separate folder keeps the top-folder clean.

4.4.3 `/data`

The data folder is an important one. Here I keep all data that exists as e.g., .csv or .xlsx files. If I create data in the project, such as combined data sets that are stored for more convienient use, I keep these in a sub-folder (e.g., data/derived-data/)¹⁰. If there is a lot of raw unprocessed data, these might be stored in data/raw-data/ with specific sub-folders.

¹⁰ Again, an important note from Karl Broman, “Organize your data and code”

4.4.4 `/figures`

If you want to make figures for presentations or submission to a journal or as part of your thesis, you might want to save output as .tiff or .pdf files. When doing this it might be a good idea to structure a figure-folder with e.g. figure1.R that renders to e.g. figure1.pdf. If you only include figure output in the Quarto, the figure folder might contain R-scripts that produces the figures. The end results are included in the Quarto document by sourcing the R-script. This detour might make it easier to find code for a specific figure once your project is large enough.

4.4.5 `/R`

R-scripts that are not figures but contains analyses or data cleaning or the like can be stored in R scripts in a specific folder. The reason to keep R scripts separate from a quarto file might be that they are large and produces some output, like a data set, that is later used in the report file. It makes it easier to find and work on specific code without breaking other parts of your project. Actually, it is a good idea to “build” the parts of your analysis as smaller parts.

4.5 Quarto formats

Quarto brings many possibilities for authoring data-driven formats, including but not restricted to websites, books, blogs and presentations. This course website is built using Quarto! A book format can be used to create a thesis with multiple chapters. Advanced customization can be used to follow university guidelines on the stydle of the book/thesis. Quarto makes it possible to take full control over your writing projects.

4.5.1 Microsoft Word intergration in Quarto

Sometimes it is useful to render Quarto documents to a word file. For example when you want to share a report with fellow students who are not familiar with R. Quarto can be used as a source for word documents (.docx).

To create a word document from your qmd-file you need a working installation of Microsoft Word. Settings for the output are specified in the YAML metadata field, and when you want it to create a word file you specify it like this:

---
title: "A title"
author: Daniel Hammarström
date: 2020-09-05
format: docx
---

The format: docx setting tells Quarto to create a word file. If you are not happy with the style of the word document (e.g. size and font of text) you can tell Quarto to use a template file. Save a word file that you have rendered as reference.docx and use specify in the YAML field that you will use this as reference.

---
title: "A title"
author: Daniel Hammarström
date: 2020-09-05
format:
  docx:
    reference-doc: reference.docx
---

Edit styles (Stiler in Norwegian) used in the reference file (right click on the style and edit). For example, editing the “Title” style (Tittel in Norwegian) will change the main titel of the document. After you have edited the document, save it.

When you render the document again, your updated styles will be used your word document.¹¹

¹¹ See the Quarto documentation on word integration here.