Data wrangling

Tidy tabular data

  • Tidy tabular data has
    • One variable per column
    • One observation per row
    • One value per cell

Data wrangling


“Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.”

Data wrangling

  • Aggregation through summaries
    • E.g. variables are summarized over multiple observations
  • Transformation of data
    • E.g. new variables are created on existing data
  • Arranging data
    • E.g. sorting data based on values of observations

dplyr

  • dplyr provides verbs for wrangling data
  • “Translate thoughts to code”
  • Using the package we can
    • mutate (create) new variables
    • select variables
    • filter observations
    • summarise values
    • arrange observations or rows

R functions

  • Functions are created to perform a specific task
  • Functions exists in packages, or as user specified in your environment
  • Function (can) take arguments as input
1important_function(arg1 = "A", arg2 = "B")
2important_function("A", "B")
3important_function()
1
Arguments are named.
2
Arguments are used by their position
3
The function is used with default argument values

R functions in tidyverse

  • Functions that are specifically written for the tidyverse are “pipeable”, the data argument has the first position, an example
1pipe_function(data = my.data, arg2 = "a", arg3 = "b", arg4 = "etc")
2pipe_function(my.data, arg2 = "a", arg3 = "b", arg4 = "etc")
3pipe_function(my.data)
1
All arguments are named.
2
The first argument not names, specified by position
3
The function is used with default argument values, except the first argument data

Data pipes

  • Using pipes we can execute data verbs in sequence
  • The pipe operator passes the “left hand” data to the first position in the following function.
# This is equivalent...
data |>
        pipe_function() 
# ... to this
pipe_function(data) 

Why pipe?

  • Piping makes code more readable, an example
# No pipes
print(fun_c(fun_b(fun_a(data))))

# Using pipes
data |>
        fun_a() |>
        fun_b() |>
        fun_c() |>
        print()

Pipes in R

  • Two “pipe-operators” are available in R
    • |> exists in base R
    • %>% is loaded with tidyverse as part of the magrittr package

Two pipe operators in action

library(dplyr)

data |>
 filter(var1 > 10) |>
 mutate(var3 = var1 + var2) |>
 select(var1, varX) |>
 print()
library(dplyr)

data %>% 
 filter(var1 > 10) %>% 
 mutate(var3 = var1 + var2) %>% 
 select(var1, varX) %>% 
 print()

Data placeholder

  • If the data argument is not the first argument in a function, use a placeholder
data %>% 
        fun(argument)

## Is equivalent to
data %>% 
        fun(., argument)


## If the data is to be used in another place we need the placeholder
data %>% 
        fun(argument, data = .) 

Data placeholder a realistic example, and saving output

library(exscidata); library(tidyverse)
1model1 <- cyclingstudy %>%
2        filter(timepoint == "pre") %>%
3        mutate(VO2max.kg = VO2.max / weight.T1) %>%
4        lm(tte ~ VO2max.kg, data = .)
5summary(model1)
1
Specifying an object for saving output, taking the cycling study data
2
Filter to keep only pre-intervention data
3
Creating a new variable, VO2max relative to body mass
4
Fitting a linear regression model explaining time to exhaustion (tte) with VO2max
5
Showing the summary from the model