library(tidyverse) # for data wrangling
library(gt) # for creating tables
library(exscidata) # the dxadata
6 Wrangling data to create your first table
6.1 Introduction
We can use tables to communicate a lot of information in a compact form while maintaining precision. This advantage is why creating tables is essential for effectively communicating data. We can easily create tables programmatically as part of an R markdown or quarto document. R has many “table generator” packages that translate your draft table to an output format of your choice. A great format to start authoring an analysis in is HTML; however, most “table generators” need to know your output format to be properly formatted and work in the output format. Below we will introduce a new package for the purpose of creating tables. This package, gt
has the advantage that it does not require the user to change code when we switch to another output format. The gt
package can create tables in HTML, PDF and word format.
Since we are concerned with reproducibility, we would like to avoid copy-and-paste operations. The strength of writing reports in R markdown or quarto is the ability to combine data, code, and text to produce a formatted output programmatically. Therefore, We will choose a table generator that allows for consistently selecting multiple formats. One such table generator is part of the gt
package.
As mentioned previously, authoring in R markdown and quarto makes little difference. However, we will now focus on the more modern file format, quarto, knowing that examples and tutorials written in R markdown will translate to quarto with few problems.
The basic workflow of creating a table in R markdown or quarto is to transform the data into a nice format and then get this underlying data into the table generator. The table generator is written in a code chunk, and upon rendering the source file, the table generator will create, for example, HTML output. In this chapter, we will introduce some data-wrangling tools since the table we will produce consists of summarized data. The functions we introduce are found in the packages dplyr
and tidyr
. These packages are loaded as part of the tidyverse
package.
6.1.1 Resources
All tidyverse
packages are well documented and generally well represented in help forums. Google is your friend when looking for help.
The gt
package is now a mature package for generating tables in R. This chapter is written on the basis of this package. If you are looking for alternatives, the kable
function from the knitr
package is described in a newly developed book, available online called the R Markdown Cookbook. The package, kableExtra
comes with excellent vignettes for both html and pdf outputs. kableExtra
provides extra functions to customize your basic knitr
table. Note that kabale and kableExtra will only produce output in HTML and pdf-formats. Another package the can create tables in HTML, pdf, and word formats is the flextable
package
6.2 Making “Table 1”
The first table in many reports in sport and exercise studies is the “Participant characteristics” table. This first table summarizes background information on the participants. We will try to create this table based on data from (Hammarström et al. 2020). These data can be found in the exscidata
package. To load the data and other required packages run the following code.
The end result of this exercise can be found below in Table 6.1. This summary table contains the average and standard deviation per group for the variables age, body mass and stature (height) and body fat as a percentage of the body mass. This table is a reproduction of the first part of Table 1 from (Hammarström et al. 2020).
Female | Male | |||
---|---|---|---|---|
Included | Excluded | Included | Excluded | |
N | 18 | 4 | 16 | 3 |
Age (years) | 22 (1.3) | 22.9 (1.6) | 23.6 (4.1) | 24.3 (1.5) |
Mass (kg) | 64.4 (10) | 64.6 (9.7) | 75.8 (11) | 88.2 (22) |
Stature (cm) | 168 (6.9) | 166 (7.6) | 183 (5.9) | 189 (4.6) |
Body fat (%) | 34.1 (5.6) | 28.8 (8.7) | 20.4 (6) | 24.3 (15) |
Values are mean and (SD) |
We have to make several operations to re-create this table. First we can select the columns we want to work with further from the data set that also contains a lot of other variables. Let us start by looking at the full data set. Below we use the function glmipse
from the dplyr
package (which is loaded with tidyverse
).
data("dxadata")
glimpse(dxadata)
Rows: 80
Columns: 59
$ participant <chr> "FP28", "FP40", "FP21", "FP34", "FP23", "FP26", "FP36…
$ time <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre…
$ multiple <chr> "L", "R", "R", "R", "R", "R", "L", "R", "R", "L", "L"…
$ single <chr> "R", "L", "L", "L", "L", "L", "R", "L", "L", "R", "R"…
$ sex <chr> "female", "female", "male", "female", "male", "female…
$ include <chr> "incl", "incl", "incl", "incl", "incl", "excl", "incl…
$ age <dbl> 24.5, 22.1, 26.8, 23.1, 24.8, 24.2, 20.5, 20.6, 37.4,…
$ height <dbl> 170.0, 175.0, 184.0, 164.0, 176.5, 163.0, 158.0, 181.…
$ weight <dbl> 66.5, 64.0, 85.0, 53.0, 68.5, 56.0, 60.5, 83.5, 65.0,…
$ BMD.head <dbl> 2.477, 1.916, 2.306, 2.163, 2.108, 2.866, 1.849, 2.21…
$ BMD.arms <dbl> 0.952, 0.815, 0.980, 0.876, 0.917, 0.973, 0.871, 0.91…
$ BMD.legs <dbl> 1.430, 1.218, 1.598, 1.256, 1.402, 1.488, 1.372, 1.42…
$ BMD.body <dbl> 1.044, 0.860, 1.060, 0.842, 0.925, 0.984, 0.923, 1.01…
$ BMD.ribs <dbl> 0.770, 0.630, 0.765, 0.636, 0.721, 0.737, 0.648, 0.70…
$ BMD.pelvis <dbl> 1.252, 1.078, 1.314, 1.044, 1.154, 1.221, 1.194, 1.32…
$ BMD.spine <dbl> 1.316, 0.979, 1.293, 0.899, 1.047, 1.089, 1.006, 1.14…
$ BMD.whole <dbl> 1.268, 1.082, 1.325, 1.119, 1.181, 1.350, 1.166, 1.24…
$ fat.left_arm <dbl> 1168, 715, 871, 610, 788, 372, 932, 1312, 388, 668, 5…
$ fat.left_leg <dbl> 4469, 4696, 3467, 3023, 3088, 2100, 4674, 5435, 1873,…
$ fat.left_body <dbl> 6280, 4061, 7740, 3638, 6018, 2328, 4896, 9352, 2921,…
$ fat.left_whole <dbl> 12365, 9846, 12518, 7565, 10259, 5048, 10736, 16499, …
$ fat.right_arm <dbl> 1205, 769, 871, 610, 741, 374, 940, 1292, 413, 716, 5…
$ fat.right_leg <dbl> 4497, 4900, 3444, 3017, 3254, 2082, 4756, 5455, 1782,…
$ fat.right_body <dbl> 6082, 3923, 8172, 3602, 5699, 2144, 4705, 8674, 2640,…
$ fat.right_whole <dbl> 12102, 9862, 12856, 7479, 10020, 4821, 10806, 15876, …
$ fat.arms <dbl> 2373, 1484, 1742, 1220, 1529, 747, 1872, 2604, 802, 1…
$ fat.legs <dbl> 8965, 9596, 6911, 6040, 6342, 4182, 9430, 10890, 3655…
$ fat.body <dbl> 12362, 7984, 15912, 7239, 11717, 4472, 9601, 18026, 5…
$ fat.android <dbl> 1880, 963, 2460, 1203, 1933, 527, 1663, 3183, 1240, 1…
$ fat.gynoid <dbl> 5064, 5032, 4779, 3739, 4087, 2740, 5217, 6278, 2309,…
$ fat.whole <dbl> 24467, 19708, 25374, 15044, 20278, 9869, 21542, 32375…
$ lean.left_arm <dbl> 1987, 1931, 2884, 1753, 2652, 2425, 1913, 2266, 3066,…
$ lean.left_leg <dbl> 7059, 7190, 10281, 6014, 8242, 7903, 6829, 8889, 9664…
$ lean.left_body <dbl> 9516, 10693, 13847, 9736, 11387, 10573, 8954, 11482, …
$ lean.left_whole <dbl> 20305, 21778, 29332, 19143, 24185, 22946, 18809, 2431…
$ lean.right_arm <dbl> 2049, 2081, 2888, 1754, 2487, 2439, 1930, 2236, 3253,…
$ lean.right_leg <dbl> 7104, 7506, 10200, 6009, 8685, 7841, 6950, 8923, 9198…
$ lean.right_body <dbl> 9199, 10304, 14593, 9636, 10779, 9733, 8602, 10672, 1…
$ lean.right_whole <dbl> 19605, 21310, 29643, 18792, 23653, 21837, 19407, 2372…
$ lean.arms <dbl> 4036, 4012, 5773, 3508, 5139, 4864, 3843, 4501, 6319,…
$ lean.legs <dbl> 14163, 14696, 20482, 12023, 16928, 15744, 13779, 1781…
$ lean.body <dbl> 18715, 20998, 28440, 19372, 22166, 20306, 17556, 2215…
$ lean.android <dbl> 2669, 2782, 3810, 2455, 2904, 2656, 2297, 3094, 3344,…
$ lean.gynoid <dbl> 6219, 7209, 10233, 5866, 7525, 5970, 5825, 8175, 7760…
$ lean.whole <dbl> 39910, 43088, 58976, 37934, 47837, 44783, 38216, 4804…
$ BMC.left_arm <dbl> 181, 138, 204, 144, 180, 173, 140, 173, 220, 226, 225…
$ BMC.left_leg <dbl> 567, 508, 728, 441, 562, 574, 482, 631, 633, 630, 672…
$ BMC.left_body <dbl> 622, 414, 696, 367, 526, 465, 370, 629, 473, 629, 509…
$ BMC.left_whole <dbl> 1680, 1321, 1945, 1201, 1527, 1580, 1131, 1688, 1544,…
$ BMC.right_arm <dbl> 198, 150, 210, 142, 176, 183, 140, 176, 224, 251, 226…
$ BMC.right_leg <dbl> 574, 514, 739, 431, 552, 565, 491, 641, 622, 636, 690…
$ BMC.right_body <dbl> 592, 428, 730, 351, 502, 409, 358, 582, 420, 616, 483…
$ BMC.right_whole <dbl> 1582, 1288, 1958, 1130, 1451, 1466, 1229, 1668, 1478,…
$ BMC.arms <dbl> 379, 288, 414, 285, 356, 357, 280, 348, 444, 478, 451…
$ BMC.legs <dbl> 1142, 1022, 1467, 872, 1115, 1139, 974, 1272, 1255, 1…
$ BMC.body <dbl> 1214, 842, 1426, 718, 1028, 874, 728, 1211, 893, 1245…
$ BMC.android <dbl> 80, 57, 90, 44, 56, 54, 43, 77, 52, 72, 59, 60, 65, 5…
$ BMC.gynoid <dbl> 314, 285, 427, 245, 299, 262, 241, 379, 335, 378, 332…
$ BMC.whole <dbl> 3261, 2609, 3903, 2331, 2978, 3046, 2360, 3356, 3022,…
We can see that we got 80 rows and 59 columns in the data set. The columns of interest to us are:
- participant
- time
- sex
- include
- age
- height
- weight
- fat.whole
For a full description of the data set, you can type ?dxadata
in your console. The participant
column is good to have to keep track of the data set in the beginning. time
is needed to remove some observation that are not needed. This pre-training table only sums up information from before the intervention starts. sex
is a grouping variable together with include
, Table 1 in (Hammarström et al. 2020) uses Sex and Inclusion in data analysis as grouping for descriptive data. The other variables are used to describe the data sample.
6.2.1 The pipe operator and select
As mentioned above, we will start by selecting the variables we want to work further with. Using the select
function from dplyr
we can select columns that we need. In the code below I will use select as part of a “pipe”. Think of the pipe as doing operations in sequel. Each time you use the pipe operator (%>%
) you say “then do”. The code below translates to:
- Take the data set
dxadata
, then do select()
the following variables, then doprint()
print
, is a function that outputs the results of the operations. In each new function of a pipe, the data that we take with us from the above line ends up as the first argument. A representation of this behavior can be expressed as:
DATA %>% FUNCTION(DATA, ARGUMENTS) %>% FUNCTION(DATA, ARGUMENTS) %>% FUNCTION(DATA)
We do not need to type the data part, instead the pipe operator (%>%
) gathers the data from each step and puts it in the subsequent function.
Copy the code below to your own quarto document and run it. When using quarto you might want to set “Chunk output in console” in the settings menu. In my experience, this makes developing code a bit faster.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
print() # print the output
Notice that I have added short comments after each line to make it clear what I want to accomplish. We will build further on the above code, and this is a common workflow. Using pipes, it is easy to extend the code by adding lines doing certain operations, one at the time. Notice also that the select function uses a list of variable names with include:weight
being short for “take all variables from include to weight”.
6.2.2 Filter observations
The next step will be to filter observations. We need to remove the observations that comes from the post-intervention tests. The time
variable contains to values pre
and post
to remove post-values we need to tell R to remove all observations (rows) containing post
. We will use the filter
function from dplyr. This will be our first experience with logical operators. Let’s try out two alternatives, copy the code to your console to see the results.
## Alternative 1: ##
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter away all observation with "post"
filter(time != "post") %>%
print() # print the output
## Alternative 2: ##
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
print() # print the output
The above code should give the same output. The operator !=
says “not equal to”, the operator ==
says “equal to”. Notice that R uses two equal signs to say equal to. A single equal sign is used as an assignment operator in R.
6.2.3 Create or change variables
The next problem for us is that we need to manipulate or combine information from two variables in order to calculate body fat percentage. The formula that we will use is simply expressing body fat as a percentage of the body weight.
\[\text{Body fat (\%)} = \frac{\text{Body fat (g)}/1000}{\text{Body weight (kg)}} \times 100\] By using the mutate function we can add or manipulate existing variables in a pipe. Mutate takes as arguments a list of new variables:
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
print() # print the output
In the code above, we overwrite the variable fat.whole
with the re-calculated variable.
6.2.4 Grouped operations and summary statistics
In a pipe, we can group the data set giving us opportunities to calculate summary statistics over one or several grouping variables. In Table 1 in (Hammarström et al. 2020), include
and sex
are the two grouping variables. Using the group_by()
function from dplyr
sets the grouping of the data frame. If we use functions that summarizes data, such summaries will be per group. In Table 1 in (Hammarström et al. 2020) the number of participants in each group are specified. We can use the function n()
to calculate the number of observations per group in a mutate
call.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
print() # print the output
The new variable n
now contains the number of observations in each group. For now we can regard this as a new variable. Each participant belongs to a specified group, and this specific group has n
number of members.
We can now go further and use the summarise
function. Instead of adding variables to the existing data set, summarize reduces the data set using some summarizing function, such as mean()
or sd()
. These summary statistics are what we are looking for in our data set. Example of other summarizing functions for descriptive data are min()
and max()
for the minimum and maximum.
We can use the summarise()
function to calculate the mean and standard deviation for the weight variable.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Summarise weight
summarise(weight.m = mean(weight),
weight.s = sd(weight)) %>%
print() # print the output
Try out the code in your own quarto document. The above example gives us what we want, however, it means that we need to type a lot. Instead of needing to make a summary for each variable, we can combine the variables in a long format. To get to the long format we will use the pivot_longer()
function. This function gathers several variables into two columns, one with the variables names as values and a second column with each value from the original variables. In our case we want to gather the variables age
, height
, weight
, fat.whole
and n
. I will call the new variables that we create variable
and value
.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
print()
The cols = age:n
part of pivot_longer specifies what columns to gather. The data set is still grouped by include
and sex
. We may now proceed by summarizing over these groups, however, we need to add another group to specify that we want different values per variable
. To do this we re-specify the grouping. After this we add the summarise
function.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
print()
If you run the above code you will notice that the the standard deviation of each variable is larger than zero except for n
which has no variability. This is because we created it per group and simply calculated it as the sum of observations.
Take a look at Table 1 in (Hammarström et al. 2020). The format of the descriptive statistics are mean (SD), this is a preferred way of reporting these statistics. In order to achieve this we need to “manually” convert the numbers. In the example below, I will start by making a new variable by simply pasting the numbers together. I will also add the parentheses.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = paste0(m, " (", s, ")"))
print()
In mutate(ms = paste0(m, " (", s, ")"))
, the paste0
function simply glues components together to form a string of text. First, the vector of means are being used, then we add a parenthesis, followed by the SD and finally a parenthesis.
If you run the above code you will notice that you end up with numbers looking like this:
167.666666666667 (6.86851298231541)
This is neither good or good looking. We have to take care of the decimal places. There are a number of ways to do this but in this case the function signif
seems to make the situation better. signif
rounds to significant digits. This means that we will get different rounding depending on the “size” of the value. I find signif(m, 3)
to be a good starting point.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")")) %>%
print()
Things are starting to look good. Run the code, what do you think. A problem with the above is that we do not want any variability in the n
variable. So if the variable is n
we do not want that kind of formatting. It is time to add a conditional statement. In dplyr
there are easy-to-use if/else functions. The function if_else
sets a condition, if this is met then we can decide what to do, and likewise decide what to do if it is not met.
This looks something like this inside a dplyr
pipe:
%>%
... if_else(IF_THIS_IS_TRUE, THE_DO_THIS, OTHERWISE_DO_THIS) %>%
print()
If variable
is n
, then we only want to display m
otherwise we want the full code as described above: paste0(signif(m, 3), " (", signif(s, 3), ")")
. We add this to the code:
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = if_else(variable == "n", # If the variable is n
as.character(m), # the only display the mean, otherwise:
paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")"))) %>%
print()
The as.character
part is needed because the output of if_else
must be the same regardless of what the outcome of the test is.
We are getting close to something!
The next step is to remove variables that we do not longer need. The select
function will help us with that. we can remove m
and s
by select(-m, -s)
, the minus sign tells R to remove them from the list of variables in the data set. We can then combine the grouping variables into a include_sex
variable. Similarly to what we did above, we can simply paste them together. Now we will use the paste
(function instead of paste0
). In paste
we specify a separator, maybe _
is a nice alternative. Selecting away the individual variables from the new combined one leaves us with this code and data set.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = if_else(variable == "n", # If the variable is n
as.character(m), # the only display the mean, otherwise:
paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")")),
# Doing a new grouping variable
include_sex = paste(include, sex, sep = "_")) %>%
# removing unnecessary variables after ungrouping
ungroup() %>%
select(-sex, -include, -m, -s) %>%
print()
If ungroup
is not used, we cannot select away variables since they are used to group the data set. We will now perform the last operations before we can make it a table. To make it formatted as in Table 1 in (Hammarström et al. 2020), we can make the present data set wider. Each group as its own column in addition to the variable name column. We will use the opposite function to pivot_longer
, namely pivot_wider
1. pivot_wider
takes a variable or “key” column and a “values” column and divide the values based on the “key”.
1 All this talk about pivot, take a break and watch this clip from the hit series “Friends”, its about “pivot”!
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = if_else(variable == "n", # If the variable is n
as.character(m), # the only display the mean, otherwise:
paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")")),
# Doing a new grouping variable
include_sex = paste(include, sex, sep = "_")) %>%
# removing unnecessary variables after ungrouping
ungroup() %>%
select(-sex, -include, -m, -s) %>%
# pivot wider to match the desired data
pivot_wider(names_from = include_sex,
values_from = ms) %>%
print()
A final step is to format the variable
variable(!). The easiest is to make it a factor variable with specified levels and names. In the factor function we use levels = c("n", "age", "weight", "height", "fat.whole")
to specify the order of values contained in the variable. Using labels = c("N", "Age (years)", "Mass (kg)", "Stature (cm)", "Body fat (%)"
, we set corresponding labels on each level. After we have added this information to the variable we can use arrange
to sort the data set. arrange
will sort the data set based on the order we have given to the variable. select
will help you sort the columns to match what we want.
#| eval: false
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = if_else(variable == "n", # If the variable is n
as.character(m), # the only display the mean, otherwise:
paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")")),
# Doing a new grouping variable
include_sex = paste(include, sex, sep = "_")) %>%
# removing unnecessary variables after ungrouping
ungroup() %>%
select(-sex, -include, -m, -s) %>%
# pivot wider to match the desired data
pivot_wider(names_from = include_sex,
values_from = ms) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"),
labels = c("N", "Age (years)", "Mass (kg)",
"Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%
print()
`summarise()` has grouped output by 'include', 'sex'. You can override using
the `.groups` argument.
# A tibble: 5 × 5
variable excl_female excl_male incl_female incl_male
<fct> <chr> <chr> <chr> <chr>
1 N 4 3 18 16
2 Age (years) 22.9 (1.57) 24.3 (1.46) 22 (1.25) 23.6 (4.11)
3 Mass (kg) 64.6 (9.71) 88.2 (22.4) 64.4 (10.4) 75.8 (10.7)
4 Stature (cm) 166 (7.59) 189 (4.58) 168 (6.87) 183 (5.88)
5 Body fat (%) 28.8 (8.69) 24.3 (15.3) 34.1 (5.64) 20.4 (5.99)
6.2.5 Starting the table generator - The gt()
function.
The next step is to “pipe” everything into the table generator.
%>% # take the dxadata data set
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
# select participant, time, sex, include to height and fat.whole
# Filter to keep all observations with pre
filter(time == "pre") %>%
# Calculate body fat
# fat.whole in grams, needs to be divided by 1000 to express as kg
# Multiply by 100 to get percentage
mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
# Group the data frame and add a variable specifying the number of observations per group
group_by(include, sex) %>%
mutate(n = n()) %>%
# Collect all variables for convenient summarizing
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
# Create a new grouping, adding variable
group_by(include, sex, variable) %>%
# Summarize in two new variables m for mean and s for SD
summarise(m = mean(value),
s = sd(value)) %>%
# Add descriptive statistics together for nice formatting
mutate(ms = if_else(variable == "n", # If the variable is n
as.character(m), # the only display the mean, otherwise:
paste0(signif(m, 3), # Use signif to round to significant numbers
" (",
signif(s, 3),
")")),
# Doing a new grouping variable
include_sex = paste(include, sex, sep = "_")) %>%
# removing unnecessary variables after ungrouping
ungroup() %>%
select(-sex, -include, -m, -s) %>%
# pivot wider to match the desired data
pivot_wider(names_from = include_sex,
values_from = ms) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"),
labels = c("N", "Age (years)", "Mass (kg)",
"Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%
# Piping into the table generator (gt)
gt()
As per our strategy to first summarize and set up the data table, we already have nice first draft. However, we need to format variable names and add column labels. We can also add a footnote.
The gt
package has several functions for manipulating the raw tables created with the gt()
function. The gt
package also use a consistent vocabulary for tables as seen in [this figure]((https://gt.rstudio.com/reference/figures/gt_parts_of_a_table.svg).
First, using tab_footnote()
we can add the footnote indicating that “Values are mean and (SD)”. We do this by piping the whole table, created with gt()
into tab_footnote(footnote = "Values are mean and (SD)")
. We have two columns representing females and two representing males, these can be more clearly separated by adding a spanner column label. This column label adds rows to the table. Using tab_spanner(label = "Female", columns = c("female_incl", "female_excl"))
we add “Female” above the two columns representing females. We can do the same for males. Using cols_label()
we specify new column names to match what we want. The resulting full code can be seen below.
%>%
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
filter(time == "pre") %>%
group_by(sex, include) %>%
mutate(n = n()) %>%
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
group_by(sex, include, variable) %>%
summarise(m = mean(value),
s = sd(value)) %>%
ungroup() %>%
mutate(m = signif(m, digits = 3),
s = signif(s, digits = 2),
ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")),
sex_incl = paste(sex, include, sep = "_")) %>%
::select(-m, -s, - sex, -include) %>%
dplyr
pivot_wider(names_from = sex_incl,
values_from = ms) %>%
select(variable, female_incl, female_excl, male_incl, male_excl) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"),
labels = c("N", "Age (years)", "Mass (kg)",
"Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%
gt() %>%
tab_footnote(footnote = "Values are mean and (SD)") %>%
tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
cols_label(variable = " ",
female_incl = "Included",
female_excl = "Excluded",
male_incl = "Included",
male_excl = "Excluded")
6.2.6 Working with tables in quarto
If we where to use this table in a report created with quarto we would like to be able to cross-reference it. This will work if we add information to the code chunk where the table is created. More specifically we need to set a table label and a table caption. Quarto has built in support for cross referencing figures and tables. Adding chunk options that sets a label and table caption will make it possible to reference the table. Note that the label must start with “tbl-” to make quarto identify it as a table.
```{r}
#| label: tbl-participant-characteristics
#| tbl-cap: "Participant characteristics"
dxadata %>%
select(participant, time, sex, include:weight, fat.whole) %>%
mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
filter(time == "pre") %>%
group_by(sex, include) %>%
mutate(n = n()) %>%
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
group_by(sex, include, variable) %>%
summarise(m = mean(value),
s = sd(value)) %>%
ungroup() %>%
mutate(m = signif(m, digits = 3),
s = signif(s, digits = 2),
ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")),
sex_incl = paste(sex, include, sep = "_")) %>%
dplyr::select(-m, -s, - sex, -include) %>%
pivot_wider(names_from = sex_incl,
values_from = ms) %>%
select(variable, female_incl, female_excl, male_incl, male_excl) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"),
labels = c("N", "Age (years)", "Mass (kg)",
"Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%
gt() %>%
tab_footnote(footnote = "Values are mean and (SD)") %>%
tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
cols_label(variable = " ",
female_incl = "Included",
female_excl = "Excluded",
male_incl = "Included",
male_excl = "Excluded")
```
The above code will produce referable table, as seen in Table 6.2!
%>%
dxadata select(participant, time, sex, include:weight, fat.whole) %>%
mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
filter(time == "pre") %>%
group_by(sex, include) %>%
mutate(n = n()) %>%
pivot_longer(names_to = "variable",
values_to = "value",
cols = age:n) %>%
group_by(sex, include, variable) %>%
summarise(m = mean(value),
s = sd(value)) %>%
ungroup() %>%
mutate(m = signif(m, digits = 3),
s = signif(s, digits = 2),
ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")),
sex_incl = paste(sex, include, sep = "_")) %>%
::select(-m, -s, - sex, -include) %>%
dplyr
pivot_wider(names_from = sex_incl,
values_from = ms) %>%
select(variable, female_incl, female_excl, male_incl, male_excl) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"),
labels = c("N", "Age (years)", "Mass (kg)",
"Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%
gt() %>%
tab_footnote(footnote = "Values are mean and (SD)") %>%
tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
cols_label(variable = " ",
female_incl = "Included",
female_excl = "Excluded",
male_incl = "Included",
male_excl = "Excluded")
`summarise()` has grouped output by 'sex', 'include'. You can override using
the `.groups` argument.
Female | Male | |||
---|---|---|---|---|
Included | Excluded | Included | Excluded | |
N | 18 | 4 | 16 | 3 |
Age (years) | 22 (1.3) | 22.9 (1.6) | 23.6 (4.1) | 24.3 (1.5) |
Mass (kg) | 64.4 (10) | 64.6 (9.7) | 75.8 (11) | 88.2 (22) |
Stature (cm) | 168 (6.9) | 166 (7.6) | 183 (5.9) | 189 (4.6) |
Body fat (%) | 34.1 (5.6) | 28.8 (8.7) | 20.4 (6) | 24.3 (15) |
Values are mean and (SD) |
6.3 An exercise in data wrangling and tables
In total 30 college students performed a heavy-resistance training protocol where training volume was constantly increased over six weeks. In (Haun et al. 2018), a part of the study, focusing on supplementation was reported. In (Haun et al. 2019), participants were divided into two clusters based on training responses and the authors aimed to answer the question what separates high- from low-responders to resistance training.
In this exercise we want to reproduce a big part of Table 1 in (Haun et al. 2019). The Table as re-produced here can be seen below. See the original article for explanation of clusters. To select variables, see the data description in the exscidata
package, the data set is called hypertrophy
.
HIGH (n = 10) | LOW (n = 10) | |
---|---|---|
Age (years) | 20.9 (1.9) | 21.5 (1) |
Training age (years) | 5.5 (2.3) | 5.5 (2) |
Body mass (kg) | 78.8 (8) | 83.1 (12.8) |
DXA LBM (kg) | 62.2 (5.9) | 65.1 (9.7) |
DXA FM (kg) | 13.5 (4.9) | 14.5 (4.9) |
Type II fiber (%) | 59.4 (16.9) | 50.2 (13.6) |
3RM back squat (kg) | 127 (23.3) | 135.4 (14.1) |
Total back squat training volume (kg) | 106610.2 (18679.4) | 111820.8 (12962.5) |
Values are mean and (SD) |
A possble solution
# load the data
data(hypertrophy)
%>%
hypertrophy # Select the variables needed to reproduce the table
::select(PARTICIPANT,
dplyr
GROUP,
CLUSTER,
AGE,
BODYMASS_T1,
TRAINING_AGE,
PERCENT_TYPE_II_T1,
SQUAT_3RM,
DXA_LBM_T1,
DXA_FM_T1, %>%
SQUAT_VOLUME) # Pivot longer to gather all variables
pivot_longer(cols = AGE:SQUAT_VOLUME, names_to = "variable", values_to = "values") %>%
# Remove participants not belonging to a cluster
filter(!is.na(CLUSTER)) %>%
# Create a grouping before summarizing
group_by(CLUSTER, variable) %>%
summarise(m = mean(values),
s = sd(values)) %>%
# For nice printing, paste mean and SD
mutate(m.s = paste0(round(m,1), " (", round(s,1), ")")) %>%
# Select only variables needed for the table
select(CLUSTER, variable, m.s) %>%
# Transform the data set to a wide format based on clusters
pivot_wider(names_from = CLUSTER, values_from = m.s) %>%
# Re-arrange the "variable" variable, correct order with levels, and correct labels
mutate(variable = factor(variable, levels = c("AGE",
"TRAINING_AGE",
"BODYMASS_T1",
"DXA_LBM_T1",
"DXA_FM_T1",
"PERCENT_TYPE_II_T1",
"SQUAT_3RM",
"SQUAT_VOLUME"),
labels = c("Age (years)",
"Training age (years)",
"Body mass (kg)",
"DXA LBM (kg)",
"DXA FM (kg)",
"Type II fiber (%)",
"3RM back squat (kg)",
"Total back squat training volume (kg)"))) %>%
# Sort/order the dataset
arrange(variable) %>%
# Use gt to output the table with appropriate caption and column names.
gt() %>%
cols_label(variable = " ", HIGH = "HIGH (n = 10)", LOW = "LOW (n = 10)") %>%
tab_footnote(footnote = "Values are mean and (SD)")