Summarising data

“Summary” functions

R has several “summary” functions used to summarise data into descriptive characteristics. These could describe the location (mean, median, mode), spread (variance, standard deviation) or rank (minimum, quantile, maximum).

A vector of numeric values (integers or decimal numbers) can be summarised using a set of functions in R that gives us several summary statistics.

# Set a seed for the random number generator
set.seed(1)
# Generate 10 random numbers from a normal distribution with mean 0 and SD 1
x <- rnorm(10)
# Calculate summary statistics
summary(x)
fivenum(x)
What statistics?

What statistics do we get from summary and fivenum? Use the R help pages to explore this!

The summary function gives us a multi-statistic summary of vectors (or data frames). We can get these numbers using built in functions also.

Function call Statistic
mean() Mean
median() Median
sd() Standard deviation
var() Variance
min() Minimum
max() Maximum
quantile() Quantile
What statistics?

Use summary functions above to calculate the summary statistics used in summary(x).

Missing values

A common feature of summary functions are the inability to calculate the mean from a set of values that contain missing values.

To overcome this problem we need to add the na.rm = TRUE argument to our summary function.

E.g.:

x <- c(rnorm(10), NA) 

mean(x)

mean(x, na.rm = TRUE)

Summaries in a pipe

dplyr has a function designed to create summaries. The summarise function will use “summary” functions that returns a single value to summarise the data set.

library(tidyverse)
library(exscidata)


cyclingstudy %>%
        summarise(m = mean(VO2.max, na.rm = TRUE))

Summaries can be create on grouped data frames. A grouped data frame has additional meta data that groups the data set and many dplyr verbs will use the grouping when performing its actions (e.g. mutate, filter, summarise).

To add a grouping to a data frame use group_by(var), where var is a variable you would want to group on.

Construct a grouped data frame and summarise?

Complete the following code chunk

cyclingstudy %>%
        # select participant, time-point and VO2max
        select(subject, timepoint, VO2.max) %>%
        # Group the data frame by timepoint
        
        # Summarise with mean and standard deviation for vo2max
        
        # Print the results
        print()

Another alternative is to use the .by argument in a summary function. To group a summary by timepoint and group we would do

cyclingstudy %>%
        summarise(.by = c(timepoint, group), 
                  m = mean(VO2.max, na.rm = TRUE), 
                  s = sd(VO2.max, na.rm = TRUE))
Summarise data?
  1. What is the average (mean) cmj.max height in group == "INCR" at timepoint == "meso2"?
  2. What is the standard deviation of VO2.max at time-point meso3?

Summarise the number of observations per group

The n() function can be used to give us the group size of a grouped summary.

cyclingstudy %>%
        summarise(n = n(), 
                  .by = c(group, timepoint))

Re-create a summary table

Work in pairs to reproduce this summary data frame:

# A tibble: 3 × 8
  group m.age sd.age m.height sd.height m.weight sd.weight     n
  <chr> <dbl>  <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <int>
1 INCR   34.7   4.79     181       7.87     81.3      7.86     7
2 DECR   38.4   5.59     178.      4.67     83.5     10.7      7
3 MIX    37.8   7.94     179       5.90     75.3      9.87     6