Data wrangling and tables, part 2

See part 1 of this workshop here.

A file-type for reports

Tables are generated in R in specific output formats. These formats are commonly html, pdf or docx files. The output format type determines how a table is coded by the table generator. However, common to all outputs is the source file. Reports with tables, figures and text are built from Rmarkdown (Rmd) or quarto (qmd) files.

The transition between Rmarkdown-files and quarto is effortless as they use the same syntax but quarto has benefits when it comes to publishing.

Quarto requires the installation of plug-in software. The workshop is possible to follow using a Rmarkdown file also.

We will talk more about publishing reports later. For now, we just need a source file to create our table. Start a new quarto or Rmarkdown file and save it somewhere on your computer. Preferably as part of a PROJECT

---
title: "Untitled"
---

Table generators

There are at least 15 commonly used packages in R used for tables.1 This really creates a jungle for the user.

This course have previously been focusing on knitr and the kable function as well as flextable. Both has their benefits, however, both also has drawbacks.

During this workshop we will focus on the gt package. This package is promising in bridging gaps in previous packages.

The gt package needs to be installed from CRAN.

A basic table

We will produce a very basic table of group means at baseline from the cyclingstudy data set.

First we load packages and data

library(gt)
library(exscidata)
library(tidyverse)


data("cyclingstudy")

The next step is to summarize the data of interest. We will use VO2.max, sj.max and weight.T1 in our table, averaged over group

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        print()
# A tibble: 3 × 4
  group VO2max    sj weight
  <chr>  <dbl> <dbl>  <dbl>
1 DECR   4864.  31.9   83.5
2 INCR   4988.  31.5   81.3
3 MIX    4419.  28.8   75.3

The output from a code chunk without a table generator will display what your would typically see in the console. To format the output you will need the table generator.

We will add the generator as the last part of our pipe.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt()
group VO2max sj weight
DECR 4864.247 31.90143 83.52857
INCR 4987.845 31.50000 81.25714
MIX 4418.663 28.78000 75.30000

If the code is executed without compiling the source-file you will see a table in the Viewer pane in RStudio.

We proceed by formatting numbers. gt has many formatting functions created to make formatting of cell values easy. The function fmt_numbers() needs the columns to be formatted, otherwise defaults seems to give reasonable output. Be aware however of the sep_mark = "," option.

Exercises (1)
  • Find documentation for the fmt_number function and find out what how you may format numeric columns, including separation of thousands.
  • Look up the argument drop_trailing_zeros in the help page for fmt_number.
cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt() %>%
        fmt_number(columns = VO2max:weight, sep_mark = "")
group VO2max sj weight
DECR 4864.25 31.90 83.53
INCR 4987.84 31.50 81.26
MIX 4418.66 28.78 75.30

For our variables, VO2max may not need to have two decimal points. We can remove it from the first formatting and add another one.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt() %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0) 
group VO2max sj weight
DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30

Column names can be added with the cols_label function, added as the other as part of the pipe

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt() %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub>"), 
                   sj = "Squat jump", 
                   weight = "Weight")
Group

VO2max

Squat jump Weight
DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30

Notice that I’ve added the md() function around “VO2max” and added some additional code. This indicates with html tags that 2max should be written in subscript.

We can also add units to each column name and print them on a different row as long as we use the md() function to wrap formatted text.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt() %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   sj = md("Squat jump<br><small>(cm)"), 
                   weight = md("Weight<br><small>(kg)"))
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30

In addition to html tags, we may use plain markdown inside the md() function. We will talk more about text formatting in coming workshops.

Next we will add a caption. A caption may be added to the gt function directly.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt(caption = "Group characteristics") %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   sj = md("Squat jump<br><small>(cm)"), 
                   weight = md("Weight<br><small>(kg)"))
Group characteristics
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30

An alternative is to add a tab_header with a title. However, we can also add a caption as part of the code chunk. This will replace the caption used in gt. This comes with the additional benefit of being able to label tables. From the quarto documentation we can read that “for tables produced by executable code cells, include a label with a tbl- prefix to make them cross-referenceable”.

The top of the code chunk should therefore look like this

```{r}
#| label: tbl-char
#| tbl-cap: "Group characteristics (code chunk)"
```

The label part must start with tbl- followed by a unique identifier. This identifier can be used to cross reference the table that will be auto numbered using the syntax @tbl-char, which will produce a link to the table: Table 1

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt(caption = "Group characteristics (Caption)") %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   sj = md("Squat jump<br><small>(cm)"), 
                   weight = md("Weight<br><small>(kg)")) %>%
        tab_header(title = "Group characteristics (Title)")
Table 1: Group characteristics (code chunk)
Group characteristics (Caption)
Group characteristics (Title)
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30

Finally we will add footnotes to the table. This is done using the tab_footnote function.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt(caption = "Group characteristics (Caption)") %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   sj = md("Squat jump<br><small>(cm)"), 
                   weight = md("Weight<br><small>(kg)")) %>%
        tab_header(title = "Group characteristics (Title)") %>%
        tab_footnote(footnote = "Values are means")
Table 2: Group characteristics (code chunk)
Group characteristics (Caption)
Group characteristics (Title)
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30
Values are means

We can specify a location for a reference using the locations argument.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
        group_by(group) %>%
        summarise(VO2max = mean(VO2.max), 
                  sj = mean(sj.max),
                  weight = mean(weight.T1)) %>%
        gt(caption = "Group characteristics (Caption)") %>%
        fmt_number(columns = sj:weight) %>%
        fmt_number(columns = VO2max, 
                   decimals = 0)  %>%
        cols_label(group = "Group", 
                   VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   sj = md("Squat jump<br><small>(cm)"), 
                   weight = md("Weight<br><small>(kg)")) %>%
        tab_header(title = "Group characteristics (Title)") %>%
        tab_footnote(footnote = "Values are means for weight", 
        locations = cells_column_labels(columns = weight))
Table 3: Group characteristics (code chunk)
Group characteristics (Caption)
Group characteristics (Title)
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

1
DECR 4,864 31.90 83.53
INCR 4,988 31.50 81.26
MIX 4,419 28.78 75.30
1 Values are means for weight

Adding more statistics

We might want to combine multiple statistics in one table, for example the mean and stamdard deviation for each variable. Below I use a different approach to summarise values per variable and group by first adding a pivot_longer which makes it possible to do this operation without creating multiple new variables.

A mean-and-standard-deviation-variable could be created by combining vectors. The mean and standard deviation is commonly presented as mean (SD). Data from a column of means and a column of SD’s can be combined to create a nice display using the paste0() function. Example:

data.frame(m = c(46.7, 47.89, 43.5),  # A vector of means
           s = c(4.21, 4.666, 3.1)) %>% # A vector of SD's
        mutate(stat = paste0(round(m, 1), 
                             " (",
                             round(s, 1), 
                             ")")) %>%
        print()

However, the gt package has a nice helper function that makes it possible to do this in the gt part of the pipe.

cyclingstudy %>%
        filter(timepoint == "pre") %>%
        select(subject, group, VO2.max, sj.max, weight.T1) %>%
1        pivot_longer(names_to = "variable",
                     values_to = "value",
                     cols = VO2.max:weight.T1) %>%
2        group_by(group, variable) %>%
        summarise(mean = mean(value),
                  sd = sd(value)) %>%
3        pivot_wider(names_from = variable,
                        values_from = c(mean, sd)) %>%
        select(group, mean_VO2.max, sd_VO2.max,
                        mean_sj.max, sd_sj.max,
                        mean_weight.T1, sd_weight.T1) %>%
4        ungroup() %>%
        
        gt(caption = "Group characteristics (Caption)") %>%
        
5        fmt_number(columns = mean_VO2.max:sd_VO2.max, decimals = 0) %>%
        fmt_number(columns = mean_sj.max:sd_weight.T1, decimals = 1) %>%
         
6        cols_merge(columns = c("mean_VO2.max", "sd_VO2.max"),
                   pattern = "<<{1}>> <<({2})>>") %>%
        
        cols_merge(columns = c("mean_sj.max", "sd_sj.max"),
                   pattern = "<<{1}>> <<({2})>>") %>%
        
        cols_merge(columns = c("mean_weight.T1", "sd_weight.T1"),
                   pattern = "<<{1}>> <<({2})>>") %>%
                   

        cols_label(group = "Group", 
                   mean_VO2.max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"), 
                   mean_sj.max = md("Squat jump<br><small>(cm)"), 
                   mean_weight.T1 = md("Weight<br><small>(kg)")) %>%
        tab_header(title = "Group characteristics (Title)") %>%
        tab_footnote(footnote = "Values are means for weight")
1
Using pivot_longer to combine multiple variables in one value/name combination.
2
Here we group by variable and group and create two new variables in a summarised data frame. This requires less code than the above example.
3
Making the table “wide” again and using select to sort the columns.
4
gt tables are group-sensitive. This means that if we have a grouped data frame/tibble, gt will use this grouping when it creates the table.
5
It is good practice to have the same number of decimals in both the mean and SD.
6
This is where we combine the columns using column_merge, for each new column we need a specified column_merge.
`summarise()` has grouped output by 'group'. You can override using the
`.groups` argument.
Table 4: Group characteristics (code chunk)
Group characteristics (Caption)
Group characteristics (Title)
Group

VO2max
(ml min-1)

Squat jump
(cm)

Weight
(kg)

DECR 4,864 (541) 31.9 (2.6) 83.5 (10.7)
INCR 4,988 (488) 31.5 (2.6) 81.3 (7.9)
MIX 4,419 (253) 28.8 (4.0) 75.3 (9.9)
Values are means for weight

Exercise/Homework: Recreating Table 1 in (Haun et al. 2018).

Data from (Haun et al. 2018) are part of the exscidata package as the hypertrophy data set. Access it by using

library(exscidata); data(hypertrophy)

glimpse(hypertrophy)

Try to find as many variables presented in Table 1 in the original publication as possible in the available data and format your version of the table to the best of your ability!

References

Haun, C. T., C. G. Vann, C. B. Mobley, P. A. Roberson, S. C. Osburn, H. M. Holmes, P. M. Mumford, et al. 2018. “Effects of Graded Whey Supplementation During Extreme-Volume Resistance Training.” Front Nutr 5: 84. https://doi.org/10.3389/fnut.2018.00084.

Footnotes

  1. See the list in the R Markdown Cookbook↩︎