Data wrangling

Tidy tabular data

Tidy tabular data has
- One variable per column
- One observation per row
- One value per cell

Data wrangling

“Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.”

Data wrangling

Aggregation through summaries
- E.g. variables are summarized over multiple observations
Transformation of data
- E.g. new variables are created on existing data
Arranging data
- E.g. sorting data based on values of observations

`dplyr`

dplyr provides verbs for wrangling data
“Translate thoughts to code”
Using the package we can
- mutate (create) new variables
- select variables
- filter observations
- summarise values
- arrange observations or rows

R functions

Functions are created to perform a specific task
Functions exists in packages, or as user specified in your environment
Function (can) take arguments as input

1important_function(arg1 = "A", arg2 = "B")
2important_function("A", "B")
3important_function()

1: Arguments are named.
2: Arguments are used by their position
3: The function is used with default argument values

R functions in tidyverse

Functions that are specifically written for the tidyverse are “pipeable”, the data argument has the first position, an example

1pipe_function(data = my.data, arg2 = "a", arg3 = "b", arg4 = "etc")
2pipe_function(my.data, arg2 = "a", arg3 = "b", arg4 = "etc")
3pipe_function(my.data)

1: All arguments are named.
2: The first argument not names, specified by position
3: The function is used with default argument values, except the first argument data

Data pipes

Using pipes we can execute data verbs in sequence
The pipe operator passes the “left hand” data to the first position in the following function.

# This is equivalent...
data |>
        pipe_function() 
# ... to this
pipe_function(data)

Why pipe?

Piping makes code more readable, an example

# No pipes
print(fun_c(fun_b(fun_a(data))))

# Using pipes
data |>
        fun_a() |>
        fun_b() |>
        fun_c() |>
        print()

Pipes in R

Two “pipe-operators” are available in R
- |> exists in base R
- %>% is loaded with tidyverse as part of the magrittr package

Two pipe operators in action

library(dplyr)

data |>
 filter(var1 > 10) |>
 mutate(var3 = var1 + var2) |>
 select(var1, varX) |>
 print()

library(dplyr)

data %>% 
 filter(var1 > 10) %>% 
 mutate(var3 = var1 + var2) %>% 
 select(var1, varX) %>% 
 print()

Data placeholder

If the data argument is not the first argument in a function, use a placeholder

data %>% 
        fun(argument)

## Is equivalent to
data %>% 
        fun(., argument)


## If the data is to be used in another place we need the placeholder
data %>% 
        fun(argument, data = .)

Data placeholder a realistic example, and saving output

library(exscidata); library(tidyverse)
1model1 <- cyclingstudy %>%
2        filter(timepoint == "pre") %>%
3        mutate(VO2max.kg = VO2.max / weight.T1) %>%
4        lm(tte ~ VO2max.kg, data = .)
5summary(model1)

1: Specifying an object for saving output, taking the cycling study data
2: Filter to keep only pre-intervention data
3: Creating a new variable, VO_2max relative to body mass
4: Fitting a linear regression model explaining time to exhaustion (tte) with VO_2max
5: Showing the summary from the model

Data wrangling

Slides
Tools
Close

Data wrangling
Tidy tabular data
Data wrangling
Data wrangling
dplyr
R functions
R functions in tidyverse
Data pipes
Why pipe?
Pipes in R
Two pipe operators in action
Data placeholder
Data placeholder a realistic example, and saving output

f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
? Keyboard Help