---
: "Untitled"
title---
Data wrangling and tables, part 2
See part 1 of this workshop here.
A file-type for reports
Tables are generated in R in specific output formats. These formats are commonly html
, pdf
or docx
files. The output format type determines how a table is coded by the table generator. However, common to all outputs is the source file. Reports with tables, figures and text are built from Rmarkdown (Rmd
) or quarto (qmd
) files.
The transition between Rmarkdown-files and quarto is effortless as they use the same syntax but quarto has benefits when it comes to publishing.
Quarto requires the installation of plug-in software. The workshop is possible to follow using a Rmarkdown file also.
We will talk more about publishing reports later. For now, we just need a source file to create our table. Start a new quarto or Rmarkdown file and save it somewhere on your computer. Preferably as part of a PROJECT
Table generators
There are at least 15 commonly used packages in R used for tables.1 This really creates a jungle for the user.
This course have previously been focusing on knitr
and the kable
function as well as flextable
. Both has their benefits, however, both also has drawbacks.
During this workshop we will focus on the gt
package. This package is promising in bridging gaps in previous packages.
The gt
package needs to be installed from CRAN.
A basic table
We will produce a very basic table of group means at baseline from the cyclingstudy
data set.
First we load packages and data
library(gt)
library(exscidata)
library(tidyverse)
data("cyclingstudy")
The next step is to summarize the data of interest. We will use VO2.max
, sj.max
and weight.T1
in our table, averaged over group
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
print()
# A tibble: 3 × 4
group VO2max sj weight
<chr> <dbl> <dbl> <dbl>
1 DECR 4864. 31.9 83.5
2 INCR 4988. 31.5 81.3
3 MIX 4419. 28.8 75.3
The output from a code chunk without a table generator will display what your would typically see in the console. To format the output you will need the table generator.
We will add the generator as the last part of our pipe.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt()
group | VO2max | sj | weight |
---|---|---|---|
DECR | 4864.247 | 31.90143 | 83.52857 |
INCR | 4987.845 | 31.50000 | 81.25714 |
MIX | 4418.663 | 28.78000 | 75.30000 |
If the code is executed without compiling the source-file you will see a table in the Viewer pane in RStudio.
We proceed by formatting numbers. gt
has many formatting functions created to make formatting of cell values easy. The function fmt_numbers()
needs the columns to be formatted, otherwise defaults seems to give reasonable output. Be aware however of the sep_mark = ","
option.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt() %>%
fmt_number(columns = VO2max:weight, sep_mark = "")
group | VO2max | sj | weight |
---|---|---|---|
DECR | 4864.25 | 31.90 | 83.53 |
INCR | 4987.84 | 31.50 | 81.26 |
MIX | 4418.66 | 28.78 | 75.30 |
For our variables, VO2max may not need to have two decimal points. We can remove it from the first formatting and add another one.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt() %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0)
group | VO2max | sj | weight |
---|---|---|---|
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
Column names can be added with the cols_label
function, added as the other as part of the pipe
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt() %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub>"),
sj = "Squat jump",
weight = "Weight")
Group | VO2max |
Squat jump | Weight |
---|---|---|---|
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
Notice that I’ve added the md()
function around “VO2max” and added some additional code. This indicates with html
tags that 2max
should be written in subscript.
We can also add units to each column name and print them on a different row as long as we use the md()
function to wrap formatted text.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt() %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
sj = md("Squat jump<br><small>(cm)"),
weight = md("Weight<br><small>(kg)"))
Group | VO2max |
Squat jump |
Weight |
---|---|---|---|
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
In addition to html tags, we may use plain markdown inside the md()
function. We will talk more about text formatting in coming workshops.
Next we will add a caption. A caption may be added to the gt
function directly.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt(caption = "Group characteristics") %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
sj = md("Squat jump<br><small>(cm)"),
weight = md("Weight<br><small>(kg)"))
Group | VO2max |
Squat jump |
Weight |
---|---|---|---|
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
An alternative is to add a tab_header
with a title
. However, we can also add a caption as part of the code chunk. This will replace the caption used in gt
. This comes with the additional benefit of being able to label tables. From the quarto documentation we can read that “for tables produced by executable code cells, include a label with a tbl- prefix to make them cross-referenceable”.
The top of the code chunk should therefore look like this
```{r}
#| label: tbl-char
#| tbl-cap: "Group characteristics (code chunk)"
```
The label part must start with tbl-
followed by a unique identifier. This identifier can be used to cross reference the table that will be auto numbered using the syntax @tbl-char
, which will produce a link to the table: Table 1
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt(caption = "Group characteristics (Caption)") %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
sj = md("Squat jump<br><small>(cm)"),
weight = md("Weight<br><small>(kg)")) %>%
tab_header(title = "Group characteristics (Title)")
Group characteristics (Title) | |||
---|---|---|---|
Group | VO2max |
Squat jump |
Weight |
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
Finally we will add footnotes to the table. This is done using the tab_footnote
function.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt(caption = "Group characteristics (Caption)") %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
sj = md("Squat jump<br><small>(cm)"),
weight = md("Weight<br><small>(kg)")) %>%
tab_header(title = "Group characteristics (Title)") %>%
tab_footnote(footnote = "Values are means")
Group characteristics (Title) | |||
---|---|---|---|
Group | VO2max |
Squat jump |
Weight |
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
Values are means |
We can specify a location for a reference using the locations argument.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
group_by(group) %>%
summarise(VO2max = mean(VO2.max),
sj = mean(sj.max),
weight = mean(weight.T1)) %>%
gt(caption = "Group characteristics (Caption)") %>%
fmt_number(columns = sj:weight) %>%
fmt_number(columns = VO2max,
decimals = 0) %>%
cols_label(group = "Group",
VO2max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
sj = md("Squat jump<br><small>(cm)"),
weight = md("Weight<br><small>(kg)")) %>%
tab_header(title = "Group characteristics (Title)") %>%
tab_footnote(footnote = "Values are means for weight",
locations = cells_column_labels(columns = weight))
Group characteristics (Title) | |||
---|---|---|---|
Group | VO2max |
Squat jump |
Weight |
DECR | 4,864 | 31.90 | 83.53 |
INCR | 4,988 | 31.50 | 81.26 |
MIX | 4,419 | 28.78 | 75.30 |
1 Values are means for weight |
Adding more statistics
We might want to combine multiple statistics in one table, for example the mean and stamdard deviation for each variable. Below I use a different approach to summarise values per variable and group by first adding a pivot_longer
which makes it possible to do this operation without creating multiple new variables.
A mean-and-standard-deviation-variable could be created by combining vectors. The mean and standard deviation is commonly presented as mean (SD). Data from a column of means and a column of SD’s can be combined to create a nice display using the paste0()
function. Example:
data.frame(m = c(46.7, 47.89, 43.5), # A vector of means
s = c(4.21, 4.666, 3.1)) %>% # A vector of SD's
mutate(stat = paste0(round(m, 1),
" (",
round(s, 1),
")")) %>%
print()
However, the gt package has a nice helper function that makes it possible to do this in the gt part of the pipe.
%>%
cyclingstudy filter(timepoint == "pre") %>%
select(subject, group, VO2.max, sj.max, weight.T1) %>%
1pivot_longer(names_to = "variable",
values_to = "value",
cols = VO2.max:weight.T1) %>%
2group_by(group, variable) %>%
summarise(mean = mean(value),
sd = sd(value)) %>%
3pivot_wider(names_from = variable,
values_from = c(mean, sd)) %>%
select(group, mean_VO2.max, sd_VO2.max,
mean_sj.max, sd_sj.max,%>%
mean_weight.T1, sd_weight.T1) 4ungroup() %>%
gt(caption = "Group characteristics (Caption)") %>%
5fmt_number(columns = mean_VO2.max:sd_VO2.max, decimals = 0) %>%
fmt_number(columns = mean_sj.max:sd_weight.T1, decimals = 1) %>%
6cols_merge(columns = c("mean_VO2.max", "sd_VO2.max"),
pattern = "<<{1}>> <<({2})>>") %>%
cols_merge(columns = c("mean_sj.max", "sd_sj.max"),
pattern = "<<{1}>> <<({2})>>") %>%
cols_merge(columns = c("mean_weight.T1", "sd_weight.T1"),
pattern = "<<{1}>> <<({2})>>") %>%
cols_label(group = "Group",
mean_VO2.max = md("VO<sub>2max</sub><br><small>(ml min<sup>-1</sup>)"),
mean_sj.max = md("Squat jump<br><small>(cm)"),
mean_weight.T1 = md("Weight<br><small>(kg)")) %>%
tab_header(title = "Group characteristics (Title)") %>%
tab_footnote(footnote = "Values are means for weight")
- 1
-
Using
pivot_longer
to combine multiple variables in one value/name combination. - 2
- Here we group by variable and group and create two new variables in a summarised data frame. This requires less code than the above example.
- 3
- Making the table “wide” again and using select to sort the columns.
- 4
-
gt
tables are group-sensitive. This means that if we have a grouped data frame/tibble,gt
will use this grouping when it creates the table. - 5
- It is good practice to have the same number of decimals in both the mean and SD.
- 6
-
This is where we combine the columns using
column_merge
, for each new column we need a specifiedcolumn_merge
.
`summarise()` has grouped output by 'group'. You can override using the
`.groups` argument.
Group characteristics (Title) | |||
---|---|---|---|
Group | VO2max |
Squat jump |
Weight |
DECR | 4,864 (541) | 31.9 (2.6) | 83.5 (10.7) |
INCR | 4,988 (488) | 31.5 (2.6) | 81.3 (7.9) |
MIX | 4,419 (253) | 28.8 (4.0) | 75.3 (9.9) |
Values are means for weight |
Exercise/Homework: Recreating Table 1 in (Haun et al. 2018).
Data from (Haun et al. 2018) are part of the exscidata
package as the hypertrophy
data set. Access it by using
library(exscidata); data(hypertrophy)
glimpse(hypertrophy)
Try to find as many variables presented in Table 1 in the original publication as possible in the available data and format your version of the table to the best of your ability!
References
Footnotes
See the list in the R Markdown Cookbook↩︎