6  Wrangling data to create your first table

6.1 Introduction

We can use tables to communicate a lot of information in a compact form while maintaining precision. This advantage is why creating tables is essential for effectively communicating data. We can easily create tables programmatically as part of an R markdown or quarto document. R has many “table generator” packages that translate your draft table to an output format of your choice. A great format to start authoring an analysis in is HTML; however, most “table generators” need to know your output format to be properly formatted and work in the output format. Below we will introduce a new package for the purpose of creating tables. This package, gt has the advantage that it does not require the user to change code when we switch to another output format. The gt package can create tables in HTML, PDF and word format.

Since we are concerned with reproducibility, we would like to avoid copy-and-paste operations. The strength of writing reports in R markdown or quarto is the ability to combine data, code, and text to produce a formatted output programmatically. Therefore, We will choose a table generator that allows for consistently selecting multiple formats. One such table generator is part of the gt package.

As mentioned previously, authoring in R markdown and quarto makes little difference. However, we will now focus on the more modern file format, quarto, knowing that examples and tutorials written in R markdown will translate to quarto with few problems.

The basic workflow of creating a table in R markdown or quarto is to transform the data into a nice format and then get this underlying data into the table generator. The table generator is written in a code chunk, and upon rendering the source file, the table generator will create, for example, HTML output. In this chapter, we will introduce some data-wrangling tools since the table we will produce consists of summarized data. The functions we introduce are found in the packages dplyr and tidyr. These packages are loaded as part of the tidyverse package.

6.1.1 Resources

All tidyversepackages are well documented and generally well represented in help forums. Google is your friend when looking for help.

The gt package is now a mature package for generating tables in R. This chapter is written on the basis of this package. If you are looking for alternatives, the kable function from the knitr package is described in a newly developed book, available online called the R Markdown Cookbook. The package, kableExtra comes with excellent vignettes for both html and pdf outputs. kableExtra provides extra functions to customize your basic knitr table. Note that kabale and kableExtra will only produce output in HTML and pdf-formats. Another package the can create tables in HTML, pdf, and word formats is the flextable package

6.2 Making “Table 1”

The first table in many reports in sport and exercise studies is the “Participant characteristics” table. This first table summarizes background information on the participants. We will try to create this table based on data from (Hammarström et al. 2020). These data can be found in the exscidata package. To load the data and other required packages run the following code.

library(tidyverse) # for data wrangling
library(gt) # for creating tables
library(exscidata) # the dxadata

The end result of this exercise can be found below in Table 6.1. This summary table contains the average and standard deviation per group for the variables age, body mass and stature (height) and body fat as a percentage of the body mass. This table is a reproduction of the first part of Table 1 from (Hammarström et al. 2020).

Table 6.1: Participant characteristics
Female Male
Included Excluded Included Excluded
N 18 4 16 3
Age (years) 22 (1.3) 22.9 (1.6) 23.6 (4.1) 24.3 (1.5)
Mass (kg) 64.4 (10) 64.6 (9.7) 75.8 (11) 88.2 (22)
Stature (cm) 168 (6.9) 166 (7.6) 183 (5.9) 189 (4.6)
Body fat (%) 34.1 (5.6) 28.8 (8.7) 20.4 (6) 24.3 (15)
Values are mean and (SD)

We have to make several operations to re-create this table. First we can select the columns we want to work with further from the data set that also contains a lot of other variables. Let us start by looking at the full data set. Below we use the function glmipse from the dplyr package (which is loaded with tidyverse).

data("dxadata")

glimpse(dxadata)
Rows: 80
Columns: 59
$ participant      <chr> "FP28", "FP40", "FP21", "FP34", "FP23", "FP26", "FP36…
$ time             <chr> "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre…
$ multiple         <chr> "L", "R", "R", "R", "R", "R", "L", "R", "R", "L", "L"…
$ single           <chr> "R", "L", "L", "L", "L", "L", "R", "L", "L", "R", "R"…
$ sex              <chr> "female", "female", "male", "female", "male", "female…
$ include          <chr> "incl", "incl", "incl", "incl", "incl", "excl", "incl…
$ age              <dbl> 24.5, 22.1, 26.8, 23.1, 24.8, 24.2, 20.5, 20.6, 37.4,…
$ height           <dbl> 170.0, 175.0, 184.0, 164.0, 176.5, 163.0, 158.0, 181.…
$ weight           <dbl> 66.5, 64.0, 85.0, 53.0, 68.5, 56.0, 60.5, 83.5, 65.0,…
$ BMD.head         <dbl> 2.477, 1.916, 2.306, 2.163, 2.108, 2.866, 1.849, 2.21…
$ BMD.arms         <dbl> 0.952, 0.815, 0.980, 0.876, 0.917, 0.973, 0.871, 0.91…
$ BMD.legs         <dbl> 1.430, 1.218, 1.598, 1.256, 1.402, 1.488, 1.372, 1.42…
$ BMD.body         <dbl> 1.044, 0.860, 1.060, 0.842, 0.925, 0.984, 0.923, 1.01…
$ BMD.ribs         <dbl> 0.770, 0.630, 0.765, 0.636, 0.721, 0.737, 0.648, 0.70…
$ BMD.pelvis       <dbl> 1.252, 1.078, 1.314, 1.044, 1.154, 1.221, 1.194, 1.32…
$ BMD.spine        <dbl> 1.316, 0.979, 1.293, 0.899, 1.047, 1.089, 1.006, 1.14…
$ BMD.whole        <dbl> 1.268, 1.082, 1.325, 1.119, 1.181, 1.350, 1.166, 1.24…
$ fat.left_arm     <dbl> 1168, 715, 871, 610, 788, 372, 932, 1312, 388, 668, 5…
$ fat.left_leg     <dbl> 4469, 4696, 3467, 3023, 3088, 2100, 4674, 5435, 1873,…
$ fat.left_body    <dbl> 6280, 4061, 7740, 3638, 6018, 2328, 4896, 9352, 2921,…
$ fat.left_whole   <dbl> 12365, 9846, 12518, 7565, 10259, 5048, 10736, 16499, …
$ fat.right_arm    <dbl> 1205, 769, 871, 610, 741, 374, 940, 1292, 413, 716, 5…
$ fat.right_leg    <dbl> 4497, 4900, 3444, 3017, 3254, 2082, 4756, 5455, 1782,…
$ fat.right_body   <dbl> 6082, 3923, 8172, 3602, 5699, 2144, 4705, 8674, 2640,…
$ fat.right_whole  <dbl> 12102, 9862, 12856, 7479, 10020, 4821, 10806, 15876, …
$ fat.arms         <dbl> 2373, 1484, 1742, 1220, 1529, 747, 1872, 2604, 802, 1…
$ fat.legs         <dbl> 8965, 9596, 6911, 6040, 6342, 4182, 9430, 10890, 3655…
$ fat.body         <dbl> 12362, 7984, 15912, 7239, 11717, 4472, 9601, 18026, 5…
$ fat.android      <dbl> 1880, 963, 2460, 1203, 1933, 527, 1663, 3183, 1240, 1…
$ fat.gynoid       <dbl> 5064, 5032, 4779, 3739, 4087, 2740, 5217, 6278, 2309,…
$ fat.whole        <dbl> 24467, 19708, 25374, 15044, 20278, 9869, 21542, 32375…
$ lean.left_arm    <dbl> 1987, 1931, 2884, 1753, 2652, 2425, 1913, 2266, 3066,…
$ lean.left_leg    <dbl> 7059, 7190, 10281, 6014, 8242, 7903, 6829, 8889, 9664…
$ lean.left_body   <dbl> 9516, 10693, 13847, 9736, 11387, 10573, 8954, 11482, …
$ lean.left_whole  <dbl> 20305, 21778, 29332, 19143, 24185, 22946, 18809, 2431…
$ lean.right_arm   <dbl> 2049, 2081, 2888, 1754, 2487, 2439, 1930, 2236, 3253,…
$ lean.right_leg   <dbl> 7104, 7506, 10200, 6009, 8685, 7841, 6950, 8923, 9198…
$ lean.right_body  <dbl> 9199, 10304, 14593, 9636, 10779, 9733, 8602, 10672, 1…
$ lean.right_whole <dbl> 19605, 21310, 29643, 18792, 23653, 21837, 19407, 2372…
$ lean.arms        <dbl> 4036, 4012, 5773, 3508, 5139, 4864, 3843, 4501, 6319,…
$ lean.legs        <dbl> 14163, 14696, 20482, 12023, 16928, 15744, 13779, 1781…
$ lean.body        <dbl> 18715, 20998, 28440, 19372, 22166, 20306, 17556, 2215…
$ lean.android     <dbl> 2669, 2782, 3810, 2455, 2904, 2656, 2297, 3094, 3344,…
$ lean.gynoid      <dbl> 6219, 7209, 10233, 5866, 7525, 5970, 5825, 8175, 7760…
$ lean.whole       <dbl> 39910, 43088, 58976, 37934, 47837, 44783, 38216, 4804…
$ BMC.left_arm     <dbl> 181, 138, 204, 144, 180, 173, 140, 173, 220, 226, 225…
$ BMC.left_leg     <dbl> 567, 508, 728, 441, 562, 574, 482, 631, 633, 630, 672…
$ BMC.left_body    <dbl> 622, 414, 696, 367, 526, 465, 370, 629, 473, 629, 509…
$ BMC.left_whole   <dbl> 1680, 1321, 1945, 1201, 1527, 1580, 1131, 1688, 1544,…
$ BMC.right_arm    <dbl> 198, 150, 210, 142, 176, 183, 140, 176, 224, 251, 226…
$ BMC.right_leg    <dbl> 574, 514, 739, 431, 552, 565, 491, 641, 622, 636, 690…
$ BMC.right_body   <dbl> 592, 428, 730, 351, 502, 409, 358, 582, 420, 616, 483…
$ BMC.right_whole  <dbl> 1582, 1288, 1958, 1130, 1451, 1466, 1229, 1668, 1478,…
$ BMC.arms         <dbl> 379, 288, 414, 285, 356, 357, 280, 348, 444, 478, 451…
$ BMC.legs         <dbl> 1142, 1022, 1467, 872, 1115, 1139, 974, 1272, 1255, 1…
$ BMC.body         <dbl> 1214, 842, 1426, 718, 1028, 874, 728, 1211, 893, 1245…
$ BMC.android      <dbl> 80, 57, 90, 44, 56, 54, 43, 77, 52, 72, 59, 60, 65, 5…
$ BMC.gynoid       <dbl> 314, 285, 427, 245, 299, 262, 241, 379, 335, 378, 332…
$ BMC.whole        <dbl> 3261, 2609, 3903, 2331, 2978, 3046, 2360, 3356, 3022,…

We can see that we got 80 rows and 59 columns in the data set. The columns of interest to us are:

  • participant
  • time
  • sex
  • include
  • age
  • height
  • weight
  • fat.whole

For a full description of the data set, you can type ?dxadata in your console. The participant column is good to have to keep track of the data set in the beginning. time is needed to remove some observation that are not needed. This pre-training table only sums up information from before the intervention starts. sex is a grouping variable together with include, Table 1 in (Hammarström et al. 2020) uses Sex and Inclusion in data analysis as grouping for descriptive data. The other variables are used to describe the data sample.

6.2.1 The pipe operator and select

As mentioned above, we will start by selecting the variables we want to work further with. Using the select function from dplyr we can select columns that we need. In the code below I will use select as part of a “pipe”. Think of the pipe as doing operations in sequel. Each time you use the pipe operator (%>%) you say “then do”. The code below translates to:

  • Take the data set dxadata, then do
  • select() the following variables, then do
  • print()

print, is a function that outputs the results of the operations. In each new function of a pipe, the data that we take with us from the above line ends up as the first argument. A representation of this behavior can be expressed as:

DATA %>% FUNCTION(DATA, ARGUMENTS) %>% FUNCTION(DATA, ARGUMENTS) %>% FUNCTION(DATA)

We do not need to type the data part, instead the pipe operator (%>%) gathers the data from each step and puts it in the subsequent function.

Copy the code below to your own quarto document and run it. When using quarto you might want to set “Chunk output in console” in the settings menu. In my experience, this makes developing code a bit faster.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  print() # print the output

Notice that I have added short comments after each line to make it clear what I want to accomplish. We will build further on the above code, and this is a common workflow. Using pipes, it is easy to extend the code by adding lines doing certain operations, one at the time. Notice also that the select function uses a list of variable names with include:weight being short for “take all variables from include to weight”.

6.2.2 Filter observations

The next step will be to filter observations. We need to remove the observations that comes from the post-intervention tests. The time variable contains to values pre and post to remove post-values we need to tell R to remove all observations (rows) containing post. We will use the filter function from dplyr. This will be our first experience with logical operators. Let’s try out two alternatives, copy the code to your console to see the results.

## Alternative 1: ##
dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter away all observation with "post"
  filter(time != "post") %>%
  
  print() # print the output


## Alternative 2: ##
dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  print() # print the output

The above code should give the same output. The operator != says “not equal to”, the operator == says “equal to”. Notice that R uses two equal signs to say equal to. A single equal sign is used as an assignment operator in R.

6.2.3 Create or change variables

The next problem for us is that we need to manipulate or combine information from two variables in order to calculate body fat percentage. The formula that we will use is simply expressing body fat as a percentage of the body weight.

\[\text{Body fat (\%)} = \frac{\text{Body fat (g)}/1000}{\text{Body weight (kg)}} \times 100\] By using the mutate function we can add or manipulate existing variables in a pipe. Mutate takes as arguments a list of new variables:

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  print() # print the output

In the code above, we overwrite the variable fat.whole with the re-calculated variable.

6.2.4 Grouped operations and summary statistics

In a pipe, we can group the data set giving us opportunities to calculate summary statistics over one or several grouping variables. In Table 1 in (Hammarström et al. 2020), include and sex are the two grouping variables. Using the group_by() function from dplyr sets the grouping of the data frame. If we use functions that summarizes data, such summaries will be per group. In Table 1 in (Hammarström et al. 2020) the number of participants in each group are specified. We can use the function n() to calculate the number of observations per group in a mutate call.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  print() # print the output

The new variable n now contains the number of observations in each group. For now we can regard this as a new variable. Each participant belongs to a specified group, and this specific group has n number of members.

We can now go further and use the summarise function. Instead of adding variables to the existing data set, summarize reduces the data set using some summarizing function, such as mean() or sd(). These summary statistics are what we are looking for in our data set. Example of other summarizing functions for descriptive data are min() and max() for the minimum and maximum.

We can use the summarise() function to calculate the mean and standard deviation for the weight variable.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Summarise weight
  summarise(weight.m = mean(weight), 
            weight.s = sd(weight)) %>%
  
  print() # print the output

Try out the code in your own quarto document. The above example gives us what we want, however, it means that we need to type a lot. Instead of needing to make a summary for each variable, we can combine the variables in a long format. To get to the long format we will use the pivot_longer() function. This function gathers several variables into two columns, one with the variables names as values and a second column with each value from the original variables. In our case we want to gather the variables age, height, weight, fat.whole and n. I will call the new variables that we create variable and value.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  print()

The cols = age:n part of pivot_longer specifies what columns to gather. The data set is still grouped by include and sex. We may now proceed by summarizing over these groups, however, we need to add another group to specify that we want different values per variable. To do this we re-specify the grouping. After this we add the summarise function.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  print()

If you run the above code you will notice that the the standard deviation of each variable is larger than zero except for n which has no variability. This is because we created it per group and simply calculated it as the sum of observations.

Take a look at Table 1 in (Hammarström et al. 2020). The format of the descriptive statistics are mean (SD), this is a preferred way of reporting these statistics. In order to achieve this we need to “manually” convert the numbers. In the example below, I will start by making a new variable by simply pasting the numbers together. I will also add the parentheses.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = paste0(m, " (", s, ")"))
  print()

In mutate(ms = paste0(m, " (", s, ")")), the paste0 function simply glues components together to form a string of text. First, the vector of means are being used, then we add a parenthesis, followed by the SD and finally a parenthesis.

If you run the above code you will notice that you end up with numbers looking like this:

167.666666666667 (6.86851298231541)

This is neither good or good looking. We have to take care of the decimal places. There are a number of ways to do this but in this case the function signif seems to make the situation better. signif rounds to significant digits. This means that we will get different rounding depending on the “size” of the value. I find signif(m, 3) to be a good starting point.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = paste0(signif(m, 3), # Use signif to round to significant numbers
                     " (",
                     signif(s, 3), 
                     ")")) %>%
  print()

Things are starting to look good. Run the code, what do you think. A problem with the above is that we do not want any variability in the n variable. So if the variable is n we do not want that kind of formatting. It is time to add a conditional statement. In dplyr there are easy-to-use if/else functions. The function if_else sets a condition, if this is met then we can decide what to do, and likewise decide what to do if it is not met.

This looks something like this inside a dplyr pipe:

... %>%
  if_else(IF_THIS_IS_TRUE, THE_DO_THIS, OTHERWISE_DO_THIS) %>%
  print()

If variable is n, then we only want to display m otherwise we want the full code as described above: paste0(signif(m, 3), " (", signif(s, 3), ")"). We add this to the code:

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = if_else(variable == "n", # If the variable is n
                      as.character(m), # the only display the mean, otherwise:
                        paste0(signif(m, 3), # Use signif to round to significant numbers
                        " (",
                        signif(s, 3), 
                        ")"))) %>%
  print()

The as.character part is needed because the output of if_else must be the same regardless of what the outcome of the test is.

We are getting close to something!

The next step is to remove variables that we do not longer need. The select function will help us with that. we can remove m and s by select(-m, -s), the minus sign tells R to remove them from the list of variables in the data set. We can then combine the grouping variables into a include_sex variable. Similarly to what we did above, we can simply paste them together. Now we will use the paste (function instead of paste0). In paste we specify a separator, maybe _ is a nice alternative. Selecting away the individual variables from the new combined one leaves us with this code and data set.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = if_else(variable == "n", # If the variable is n
                      as.character(m), # the only display the mean, otherwise:
                        paste0(signif(m, 3), # Use signif to round to significant numbers
                        " (",
                        signif(s, 3), 
                        ")")), 
         # Doing a new grouping variable
         include_sex = paste(include, sex, sep = "_")) %>%
  # removing unnecessary variables after ungrouping
  ungroup() %>%
  select(-sex, -include, -m, -s) %>%
  print()

If ungroup is not used, we cannot select away variables since they are used to group the data set. We will now perform the last operations before we can make it a table. To make it formatted as in Table 1 in (Hammarström et al. 2020), we can make the present data set wider. Each group as its own column in addition to the variable name column. We will use the opposite function to pivot_longer, namely pivot_wider1. pivot_wider takes a variable or “key” column and a “values” column and divide the values based on the “key”.

Hammarström, Daniel, Sjur Øfsteng, Lise Koll, Marita Hanestadhaugen, Ivana Hollan, William Apró, Jon Elling Whist, Eva Blomstrand, Bent R. Rønnestad, and Stian Ellefsen. 2020. “Benefits of Higher Resistance-Training Volume Are Related to Ribosome Biogenesis.” Journal Article. The Journal of Physiology 598 (3): 543–65. https://doi.org/10.1113/JP278455.

1 All this talk about pivot, take a break and watch this clip from the hit series “Friends”, its about “pivot”!

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = if_else(variable == "n", # If the variable is n
                      as.character(m), # the only display the mean, otherwise:
                        paste0(signif(m, 3), # Use signif to round to significant numbers
                        " (",
                        signif(s, 3), 
                        ")")), 
         # Doing a new grouping variable
         include_sex = paste(include, sex, sep = "_")) %>%
  # removing unnecessary variables after ungrouping
  ungroup() %>%
  select(-sex, -include, -m, -s) %>%
  # pivot wider to match the desired data
  pivot_wider(names_from = include_sex, 
              values_from = ms) %>%
  print()

A final step is to format the variable variable(!). The easiest is to make it a factor variable with specified levels and names. In the factor function we use levels = c("n", "age", "weight", "height", "fat.whole") to specify the order of values contained in the variable. Using labels = c("N", "Age (years)", "Mass (kg)", "Stature (cm)", "Body fat (%)", we set corresponding labels on each level. After we have added this information to the variable we can use arrange to sort the data set. arrange will sort the data set based on the order we have given to the variable. select will help you sort the columns to match what we want.

#| eval: false

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = if_else(variable == "n", # If the variable is n
                      as.character(m), # the only display the mean, otherwise:
                        paste0(signif(m, 3), # Use signif to round to significant numbers
                        " (",
                        signif(s, 3), 
                        ")")), 
         # Doing a new grouping variable
         include_sex = paste(include, sex, sep = "_")) %>%
  # removing unnecessary variables after ungrouping
  ungroup() %>%
  select(-sex, -include, -m, -s) %>%
  # pivot wider to match the desired data
  pivot_wider(names_from = include_sex, 
              values_from = ms) %>%
    mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"), 
                           labels = c("N", "Age (years)", "Mass (kg)", 
                                      "Stature (cm)", "Body fat (%)"))) %>%
  arrange(variable) %>%
  print()
`summarise()` has grouped output by 'include', 'sex'. You can override using
the `.groups` argument.
# A tibble: 5 × 5
  variable     excl_female excl_male   incl_female incl_male  
  <fct>        <chr>       <chr>       <chr>       <chr>      
1 N            4           3           18          16         
2 Age (years)  22.9 (1.57) 24.3 (1.46) 22 (1.25)   23.6 (4.11)
3 Mass (kg)    64.6 (9.71) 88.2 (22.4) 64.4 (10.4) 75.8 (10.7)
4 Stature (cm) 166 (7.59)  189 (4.58)  168 (6.87)  183 (5.88) 
5 Body fat (%) 28.8 (8.69) 24.3 (15.3) 34.1 (5.64) 20.4 (5.99)

6.2.5 Starting the table generator - The gt() function.

The next step is to “pipe” everything into the table generator.

dxadata %>% # take the dxadata data set
  select(participant, time, sex, include:weight, fat.whole) %>% 
  # select participant, time, sex, include to height and fat.whole
  
  # Filter to keep all observations with pre
  filter(time == "pre") %>%
  
  # Calculate body fat
  # fat.whole in grams, needs to be divided by 1000 to express as kg
  # Multiply by 100 to get percentage
  mutate(fat.whole = ((fat.whole/1000) / weight) * 100) %>%
  
  # Group the data frame and add a variable specifying the number of observations per group
  group_by(include, sex) %>%
  mutate(n = n()) %>%
  # Collect all variables for convenient summarizing
  pivot_longer(names_to = "variable", 
               values_to = "value", 
               cols = age:n) %>%
  # Create a new grouping, adding variable
  group_by(include, sex, variable) %>%
  # Summarize in two new variables m for mean and s for SD
  summarise(m = mean(value), 
            s = sd(value)) %>%
  # Add descriptive statistics together for nice formatting
  mutate(ms = if_else(variable == "n", # If the variable is n
                      as.character(m), # the only display the mean, otherwise:
                        paste0(signif(m, 3), # Use signif to round to significant numbers
                        " (",
                        signif(s, 3), 
                        ")")), 
         # Doing a new grouping variable
         include_sex = paste(include, sex, sep = "_")) %>%
  # removing unnecessary variables after ungrouping
  ungroup() %>%
  select(-sex, -include, -m, -s) %>%
  # pivot wider to match the desired data
  pivot_wider(names_from = include_sex, 
              values_from = ms) %>%
    mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"), 
                           labels = c("N", "Age (years)", "Mass (kg)", 
                                      "Stature (cm)", "Body fat (%)"))) %>%
  arrange(variable) %>%
  
  # Piping into the table generator (gt)

  gt()  

As per our strategy to first summarize and set up the data table, we already have nice first draft. However, we need to format variable names and add column labels. We can also add a footnote.

The gt package has several functions for manipulating the raw tables created with the gt() function. The gt package also use a consistent vocabulary for tables as seen in [this figure]((https://gt.rstudio.com/reference/figures/gt_parts_of_a_table.svg).

First, using tab_footnote() we can add the footnote indicating that “Values are mean and (SD)”. We do this by piping the whole table, created with gt() into tab_footnote(footnote = "Values are mean and (SD)"). We have two columns representing females and two representing males, these can be more clearly separated by adding a spanner column label. This column label adds rows to the table. Using tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) we add “Female” above the two columns representing females. We can do the same for males. Using cols_label() we specify new column names to match what we want. The resulting full code can be seen below.

dxadata %>%
  select(participant, time, sex, include:weight, fat.whole) %>%
  mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
  filter(time == "pre") %>%
  group_by(sex, include) %>%
  mutate(n = n()) %>%

  pivot_longer(names_to =  "variable", 
               values_to = "value", 
               cols = age:n) %>%
  group_by(sex, include, variable) %>%
  summarise(m = mean(value), 
            s = sd(value)) %>%
  ungroup() %>%
  mutate(m = signif(m, digits = 3), 
         s = signif(s, digits = 2), 
         ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")), 
         sex_incl = paste(sex, include, sep = "_")) %>%
  dplyr::select(-m, -s, - sex, -include) %>%

  pivot_wider(names_from = sex_incl, 
              values_from = ms) %>%
  select(variable, female_incl, female_excl, male_incl, male_excl) %>%
  mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"), 
                           labels = c("N", "Age (years)", "Mass (kg)", 
                                      "Stature (cm)", "Body fat (%)"))) %>%
  arrange(variable) %>%
  
  gt() %>%
  tab_footnote(footnote = "Values are mean and (SD)") %>%
  tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
  tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
  cols_label(variable = " ", 
             female_incl = "Included", 
             female_excl = "Excluded", 
             male_incl = "Included", 
             male_excl = "Excluded")

6.2.6 Working with tables in quarto

If we where to use this table in a report created with quarto we would like to be able to cross-reference it. This will work if we add information to the code chunk where the table is created. More specifically we need to set a table label and a table caption. Quarto has built in support for cross referencing figures and tables. Adding chunk options that sets a label and table caption will make it possible to reference the table. Note that the label must start with “tbl-” to make quarto identify it as a table.

```{r}
#| label: tbl-participant-characteristics
#| tbl-cap: "Participant characteristics"

dxadata %>%
select(participant, time, sex, include:weight, fat.whole) %>%
mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
filter(time == "pre") %>%
group_by(sex, include) %>%
mutate(n = n()) %>%

pivot_longer(names_to =  "variable", 
             values_to = "value", 
             cols = age:n) %>%
group_by(sex, include, variable) %>%
summarise(m = mean(value), 
          s = sd(value)) %>%
ungroup() %>%
mutate(m = signif(m, digits = 3), 
       s = signif(s, digits = 2), 
       ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")), 
       sex_incl = paste(sex, include, sep = "_")) %>%
dplyr::select(-m, -s, - sex, -include) %>%

pivot_wider(names_from = sex_incl, 
            values_from = ms) %>%
select(variable, female_incl, female_excl, male_incl, male_excl) %>%
mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"), 
                         labels = c("N", "Age (years)", "Mass (kg)", 
                                    "Stature (cm)", "Body fat (%)"))) %>%
arrange(variable) %>%

gt() %>%
tab_footnote(footnote = "Values are mean and (SD)") %>%
tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
cols_label(variable = " ", 
           female_incl = "Included", 
           female_excl = "Excluded", 
           male_incl = "Included", 
           male_excl = "Excluded")
           
```

The above code will produce referable table, as seen in Table 6.2!

  dxadata %>%
  select(participant, time, sex, include:weight, fat.whole) %>%
  mutate(fat.whole = ((fat.whole / 1000) / weight) * 100) %>%
  filter(time == "pre") %>%
  group_by(sex, include) %>%
  mutate(n = n()) %>%

  pivot_longer(names_to =  "variable", 
               values_to = "value", 
               cols = age:n) %>%
  group_by(sex, include, variable) %>%
  summarise(m = mean(value), 
            s = sd(value)) %>%
  ungroup() %>%
  mutate(m = signif(m, digits = 3), 
         s = signif(s, digits = 2), 
         ms = if_else(variable == "n", as.character(m), paste0(m, " (", s, ")")), 
         sex_incl = paste(sex, include, sep = "_")) %>%
  dplyr::select(-m, -s, - sex, -include) %>%

  pivot_wider(names_from = sex_incl, 
              values_from = ms) %>%
  select(variable, female_incl, female_excl, male_incl, male_excl) %>%
  mutate(variable = factor(variable, levels = c("n", "age", "weight", "height", "fat.whole"), 
                           labels = c("N", "Age (years)", "Mass (kg)", 
                                      "Stature (cm)", "Body fat (%)"))) %>%
  arrange(variable) %>%
  
  gt() %>%
  tab_footnote(footnote = "Values are mean and (SD)") %>%
  tab_spanner(label = "Female", columns = c("female_incl", "female_excl")) %>%
  tab_spanner(label = "Male", columns = c("male_incl", "male_excl")) %>%
  cols_label(variable = " ", 
             female_incl = "Included", 
             female_excl = "Excluded", 
             male_incl = "Included", 
             male_excl = "Excluded")
`summarise()` has grouped output by 'sex', 'include'. You can override using
the `.groups` argument.
Table 6.2: Participant characteristics
Female Male
Included Excluded Included Excluded
N 18 4 16 3
Age (years) 22 (1.3) 22.9 (1.6) 23.6 (4.1) 24.3 (1.5)
Mass (kg) 64.4 (10) 64.6 (9.7) 75.8 (11) 88.2 (22)
Stature (cm) 168 (6.9) 166 (7.6) 183 (5.9) 189 (4.6)
Body fat (%) 34.1 (5.6) 28.8 (8.7) 20.4 (6) 24.3 (15)
Values are mean and (SD)

6.3 An exercise in data wrangling and tables

In total 30 college students performed a heavy-resistance training protocol where training volume was constantly increased over six weeks. In (Haun et al. 2018), a part of the study, focusing on supplementation was reported. In (Haun et al. 2019), participants were divided into two clusters based on training responses and the authors aimed to answer the question what separates high- from low-responders to resistance training.

Haun, C. T., C. G. Vann, C. B. Mobley, P. A. Roberson, S. C. Osburn, H. M. Holmes, P. M. Mumford, et al. 2018. “Effects of Graded Whey Supplementation During Extreme-Volume Resistance Training.” Journal Article. Front Nutr 5: 84. https://doi.org/10.3389/fnut.2018.00084.
Haun, C. T., C G. Vann, C. Brooks Mobley, Shelby C. Osburn, Petey W. Mumford, Paul A. Roberson, Matthew A. Romero, et al. 2019. “Pre-Training Skeletal Muscle Fiber Size and Predominant Fiber Type Best Predict Hypertrophic Responses to 6 Weeks of Resistance Training in Previously Trained Young Men.” Journal Article. Frontiers in Physiology 10 (297). https://doi.org/10.3389/fphys.2019.00297.

In this exercise we want to reproduce a big part of Table 1 in (Haun et al. 2019). The Table as re-produced here can be seen below. See the original article for explanation of clusters. To select variables, see the data description in the exscidata package, the data set is called hypertrophy.

Baseline characteristics at PRE and back squat training volume between clusters
HIGH (n = 10) LOW (n = 10)
Age (years) 20.9 (1.9) 21.5 (1)
Training age (years) 5.5 (2.3) 5.5 (2)
Body mass (kg) 78.8 (8) 83.1 (12.8)
DXA LBM (kg) 62.2 (5.9) 65.1 (9.7)
DXA FM (kg) 13.5 (4.9) 14.5 (4.9)
Type II fiber (%) 59.4 (16.9) 50.2 (13.6)
3RM back squat (kg) 127 (23.3) 135.4 (14.1)
Total back squat training volume (kg) 106610.2 (18679.4) 111820.8 (12962.5)
Values are mean and (SD)
A possble solution
# load the data
data(hypertrophy)


hypertrophy %>%
# Select the variables needed to reproduce the table
  dplyr::select(PARTICIPANT, 
                GROUP, 
                CLUSTER,
                AGE, 
                BODYMASS_T1, 
                TRAINING_AGE,
                PERCENT_TYPE_II_T1, 
                SQUAT_3RM, 
                DXA_LBM_T1,
                DXA_FM_T1, 
                SQUAT_VOLUME) %>%
  # Pivot longer to gather all variables
  pivot_longer(cols = AGE:SQUAT_VOLUME, names_to = "variable", values_to = "values") %>%
  # Remove participants not belonging to a cluster
  filter(!is.na(CLUSTER)) %>%
  # Create a grouping before summarizing
  group_by(CLUSTER, variable) %>%
  summarise(m = mean(values), 
            s = sd(values)) %>%
  # For nice printing, paste mean and SD 
  mutate(m.s = paste0(round(m,1), " (", round(s,1), ")")) %>%
  # Select only variables needed for the table
  select(CLUSTER, variable, m.s) %>%
  # Transform the data set to a wide format based on clusters
  pivot_wider(names_from = CLUSTER, values_from = m.s) %>%
  # Re-arrange the "variable" variable, correct order with levels, and correct labels
  mutate(variable = factor(variable, levels = c("AGE",
                                                "TRAINING_AGE", 
                                                "BODYMASS_T1", 
                                                "DXA_LBM_T1", 
                                                "DXA_FM_T1", 
                                                "PERCENT_TYPE_II_T1", 
                                                "SQUAT_3RM", 
                                                "SQUAT_VOLUME"), 
                                      labels = c("Age (years)", 
                                                "Training age (years)", 
                                                "Body mass (kg)", 
                                                "DXA LBM (kg)", 
                                                "DXA FM (kg)", 
                                                "Type II fiber (%)", 
                                                "3RM back squat (kg)", 
                                                "Total back squat training volume (kg)"))) %>%
  # Sort/order the dataset
  arrange(variable) %>%
  # Use gt to output the table with appropriate caption and column names. 
  gt() %>%
  cols_label(variable = " ", HIGH = "HIGH (n = 10)", LOW = "LOW (n = 10)") %>%
    tab_footnote(footnote = "Values are mean and (SD)")

6.4 References and footnotes