Data wrangling and creating summaries

Introduction

This course is about getting to know how to communicate data. Data are everywhere, sometimes we collect them in the lab, sometimes they are sitting in our computer or gets teased out from other peoples research projects. Before the data can tell us anything we must often do a lot of operations on them, test them in different statistical models, and visualize them. The aim of the course increase your data analysis proficiency.

Skills in data analysis is seldom taught in programs outside data science or statistics. There are courses in statistics and report writing but students often struggle getting the data in to the computer and/or in the right form to do statistical tests before they can write the report. Here we address the parts between the raw data and the report.

R has excellent capabilities for “data wrangling.” This means that we can perform operations to import, clean and transform data to suit downstream analyses.

Resources

Chapter 9-15 in R for data science are good for getting more in depth. We will be using tidyr for create tidy data and dplyr to manipulate data and create summaries.

Learning objectives

After the session, you should be able to answer:

What is a pipe?
What can these functions do?
- mutate()
- select()
- filter()
- group_by() and summarise()
- arrange()
What is tidy data
How does pivot_longer() and pivot_wider() work.

Piping operations

In R we can use the %>% pipe operator to do sequential tasks. This means operations that follow each other. The pipe operator works with packages like dplyr and tidyr. The packages are designed to make it easier to work and transform data from “raw” form to a form that is suitable for statistical modeling and visualizations.

To load dplyr and tidyr into our R session we can use the package tidyverse, in addition to dplyr and tidyrit contains packages for reading data and visualize them.

library(tidyverse)
library(readxl)

When creating a data pipe with %>% we send the results from one function to the first argument of the next function.

function_1(data) %>% function_2(data_transformed) %>% function_2(data_transformed2)

In the pipe above, each function does something with the data. The result from each function gets passed on to the next function. In this way we can read the pipe operator as then do this.

Wait a minute, the first argument in a function?? R is built on functions. Functions do stuff, for example, calculates something. Arguments are the information we give to the function. A function has the form fun(argument_1, argument_2, ...). Each argument tells the function for example what data to use or what operation to do.

We will use the cycling data set for examples. We will start by loading it.

cyclingStudy <- read_excel("./data/cyclingStudy.xlsx", na = "NA") # remember to use the na argument

The above code loads the data into the environment. Let’s say that we want to calculate the squat jump height (sj.max) per kilo gram of body mass (weight.T1). We could describe this operation as

take the cycling data set then do
divide squat jump height with body mass then do
select the important variables then do
show the results.

In code it would look like this.

cyclingStudy %>% # Take the cycling data set, then do
        mutate(sqj.bm = sj.max / weight.T1) %>% # divide sj.max by weight, then do
        select(subject, group, timepoint, sqj.bm) %>% # select the important variables, then
        print() # show the results

The new variable has a funny name but stands for squat jump per body mass, the mutate function is used to create a new variable. The select function is used to select columns in the data frame. The print function prints the results of a pipe.

The pipe puts the results from a function as the first argument in the following function. If the subsequent function has another place for the argument/data but it is not the first argument, you can just . (period) as a place holder to point it where the data/argument should go. The above can also be written as

cyclingStudy %>% # Take the cycling data set, then do
        mutate(., sqj.bm = sj.max / weight.T1) %>% # divide sj.max by weight, then do
        select(., subject, group, timepoint, sqj.bm) %>% # select the important variables, then
        print(.) # show the results

The point . tells R where to put the result from a previous function or some data.

Filter

Now we have tried mutate() and select(). Let’s look into filter().

Filtering is good when you only want to select a few rows of a data frame, corresponding to specific criteria. The cycling data set consists of several time-points. We can see what values the timepoint variable can take using the following code:

cyclingStudy %>%
    distinct(timepoint) %>%
    print()

The distinct function returns unique entries of a variable. Read more about it ?distinct.

We now have information on what values the timepoint variable can take. The study is performed with testing after three meso-cycles. Each meso-cycle was four weeks. If we are interested in data only from the pre-training tests, the timepoint should be pre.

To accomplish this we have to tell R to look for entries that match our criteria exactly.

cyclingStudy %>%
    filter(timepoint == "pre") %>%
    print()

Looking at the output will reveal that the tibble contains 20 rows and 101 columns. This is what we want, we have filtered based on timepoint == "pre". Notice that we are using double equal signs here as the single equal sign is an assign operator. The double equal sign is used for testing equality.

Wait a minute, assign operator?? We have previously seen that using <- assigns a value to an object. For example to store the value 3 in an object that we name three we can use the following code: three <- 3. This is exactly the same as writing three = 3. Why bother using the “arrow.” There are some differences in how = and <- is used by R and the arrow is more general. In functions, we use the equal sign to tell what argument to use. We can also reverse the arrow, this means that we can also asssign the number 3 to the object three by using 3 -> three. My rule is to use <- when assigning variables, and use = in functions and in mutate to assign new variables (the arrow is not allowed in mutate), mostly because i think its easier to read.

So, the double equal sign is used for testing equality. If two values are equal it is TRUE otherwise it is FALSE.

We can test this in code.

2 == 2 # should be TRUE

"pre" == "meso2" # Should be false

"pre" == "pre"

So, under the hood the filtering function tests equality and returns rows that are TRUE

In R we can inverse equality testing by putting a ! before a statement. We can also use the != which means not equal to. Below we are using this logic in two ways.

!("pre" == "pre")

"pre" != "pre"

If we want to filter out only the pre test and keep every other test we can use

cyclingStudy %>%
    filter(timepoint != "pre") %>%
    print()

This will give us a data frame with 60 rows.

If we want to filter to keep two time-points, say pre and meso1 we need to do it a bit different.

cyclingStudy %>%
    filter(timepoint %in% c("pre", "meso1")) %>%
    print()

The timepoint %in% c("pre", "meso1") can be read filter out observations that are in the vector c("pre", "meso1"). Using the equality sign == does not work as no observation is exactly both c("pre", "meso1"). Meaning that we get more than one answer from such test. Test for your self:

"pre" == c("pre", "meso1")

"pre" %in% c("pre", "meso1")

The filter function can also be used to filter based on numeric variables. Let’s say that we want to see rows corresponding to values of squat jump height higher than 30 units.

cyclingStudy %>%
    filter(sj.max > 30) %>%
    print()

This should produce a data frame with 41 rows.

We can use multiple arguments in filter. For example squat jump > 30 and time-point equal to “pre.”

cyclingStudy %>%
    filter(timepoint == "pre", sj.max > 30) %>%
    print()

Did you get 13 rows?

Arrange

A data frame can be arranged based on values of a variable. This may be useful when we want to get an overview of the data. Let’s use the sj.max variable again.

cyclingStudy %>%
    arrange(sj.max) %>%
    print()

Compare the results of the above to below.

cyclingStudy %>%
    arrange(desc(sj.max)) %>%
    print()

The desc() function reverses the order of the arrange() function. This can be done in another way also. By putting a minus sign in front of the variable we want to use for ordering the data frame.

cyclingStudy %>%
    arrange(-sj.max) %>%
    print()

Group by and summarise

A real super-power in the dplyr package is the capabilities to group data by and the summarize. Often we want to know the mean or standard deviation of variables. But we are also interested in doing this within “groups.” The group_by function tells R that operations should be done per specified groupings.

We want to know the mean squat jump per time-point in the cycling data set. We will combine group_by with summarise.

cyclingStudy %>%
    group_by(timepoint) %>%
    summarise(m = mean(sj.max)) %>%
    print()

Notice that the results from the above will return a smaller data frame with two variables. timepoint and m which in this case is short for “mean.”

Notice also that we get some NA (not available). This is because we tried to calculate the mean on a vector with missing values (NA). The mean() function has the capabilities to exclude NAs. For example:

mean(c(2, 3, 4, NA, 30)) # Gives NA

## [1] NA

mean(c(2, 3, 4, NA, 30), na.rm = TRUE) # Gives a mean

## [1] 9.75

na.rm = TRUE means that we don’t want to include NAs in the calculation. We will get the mean of values, not including the missing data.

cyclingStudy %>%
    group_by(timepoint) %>%
    summarise(m = mean(sj.max, na.rm = TRUE)) %>%
    print()

A grouping can contain multiple variables that can form groupings. Let’s say that we want to extend our analysis to time-point and group. We also want to include the standard deviation.

cyclingStudy %>%
    group_by(timepoint, group) %>%
    summarise(m = mean(sj.max, na.rm = TRUE), 
              s = sd(sj.max, na.rm = TRUE)) %>%
    print()

Group by and mutate

Group by can also be used with the mutate function. Let’s say that we want to calculate the squat jump height as a percentage of the time-point specific mean.

\[Squat~jump_t~\%~of~mean_t = \frac{Squat~jump_t}{mean_t} * 100\] Where \(_t\) tells us that it is squat jump at a specific time (t).

cyclingStudy %>%
    group_by(timepoint) %>%
    mutate(sqjmp = (sj.max / mean(sj.max, na.rm = TRUE)) * 100) %>%
    select(subject, timepoint, sj.max, sqjmp) %>%
    print()

It is a good exercise to actually read the code before you execute it. Read a pipe as takke the data, then do group by, then do….

Pivot longer and wider

Data is not always in tidy form, meaning that we do not have one observation per row and one variable per column. The cycling data set contains such a situation as several lactate measurements are gathered in the same time-point. If we want to model lactate threshold, this is a problem.

Let’s first select variables that we may use to do lactate analysis. In the cycling data set lactate measurements were collected in a sub maximal test from 125 to 375 watts. The variables lac.125, lac.150, lac.175, lac.200 and so on are lactate values from each intensity (watt).

cyclingStudy %>%
    select(subject, timepoint, lac.125:lac.375) %>%
    print()

The select function selects subject, timepoint and columns lac.125 to lac.375.

The above code does not produce a tidy data set (each observation a row, each column a variable, each cell a value). The data set can be said to be in wide format, a variable (watt) is spread over multiple columns. We can use the function pivot_longer() to get the data into long form.

The pivot longer takes three important arguments (together with several others). names_to specifies the name of the column that gets all the previous column names (think if it as column names to). values_to specifies the column name where we get the values. The cols argument specifies what columns to use in the operation.

cyclingStudy %>%
    select(subject, timepoint, lac.125:lac.375) %>%
    pivot_longer(names_to = "watt", values_to = "lactate", cols = lac.125:lac.375) %>%
    print()

Notice that the watt column has the exact names of the previous columns. We can remove the prefix by using the names_prefix argument.

cyclingStudy %>%
    select(subject, timepoint, lac.125:lac.375) %>%
    pivot_longer(names_to = "watt", 
                 values_to = "lactate", 
                 cols = lac.125:lac.375, 
                 names_prefix = "lac.") %>%
    print()

Using names_prefix = "lac." removes “lac.” from all values. Notice that the variable is still a character variable indicated by <chr> under the variable name.

We need to use names_transform to tell R that the new column should be a numeric value.

cyclingStudy %>%
    select(subject, timepoint, lac.125:lac.375) %>%
    pivot_longer(names_to = "watt", 
                 values_to = "lactate", 
                 cols = lac.125:lac.375, 
                 names_prefix = "lac.", 
                 names_transform = list(watt = as.numeric)) %>%
    print()

This is a bit non-intuitive. names_transform takes a list that needs a variable that in turn can be defined.

We now have a tidy data set that can be used for modeling/visualizations.

Pivot wider

Sometimes it is convenient to be able to make a data set wide. Even though other solutions exists we might want to calculate the percentage change in a variable. Let’s say sj.max is to be transformed to percentages of the pre-value. Let’s select the needed variables.

cyclingStudy %>%
    select(subject, timepoint, sj.max) %>%
    print()

pivot_wider creates the opposite situation from pivot_longer. Similarly, names_from and values_from specifies what columns should be used to make the data set wide.

cyclingStudy %>%
    select(subject, timepoint, sj.max) %>%
    pivot_wider(names_from = timepoint, values_from = sj.max) %>%
    print()

There are several more arguments not needed in the simple case. Look at ?pivot_wider and vignette("pivot") for more.

From the resulting data set we can calculate a percentage change by using mutate()

cyclingStudy %>%
    select(subject, timepoint, sj.max) %>%
    pivot_wider(names_from = timepoint, values_from = sj.max) %>%
    mutate(change = ((meso3/pre)-1)*100) %>%
    print()

The calculation looks a bit complicated, but it calculates the percentage change from pre to meso3. First we divide the meso3 with pre, if meso3 is greater than pre we will have a number larger than 1. Next we subtract 1, now we have a fraction that represent the change from pre to meso3 and we multiply by 100 to express it as a percentage.

Save pipe operations

Above we have built pipes, function by function. This is a good idea. We can add one function and use print to look at the results. But somwhere, we want to save the data set or results from the pipe. We can use the assign operator in the top of the pipe or the bottom. Remember that the arrow can point to the left and to the right this means that the examples below gives the same results.

percentage_change <- cyclingStudy %>%
    select(subject, timepoint, sj.max) %>%
    pivot_wider(names_from = timepoint, values_from = sj.max) %>%
    mutate(change = ((meso3/pre)-1)*100) %>%
    print()


cyclingStudy %>%
    select(subject, timepoint, sj.max) %>%
    pivot_wider(names_from = timepoint, values_from = sj.max) %>%
    mutate(change = ((meso3/pre)-1)*100) %>%
    print() -> percentage_change

I like the first best because it is easier to read the “intention of the code.”