13  Categorical predictors and multiple regression

We have up to now used a single continuous predictor to predict a dependent variables. We will now show that the ordinary regression models can be extended and modified to perform statistical tests such as the t-test. The ordinary regression model is very flexible!

13.1 Linear models can be used instead of t-tests

t-test are designed to compare means. A question you might want to answer using a t-test is how unlikely the results from your test is if there were no true differences between values from two groups (independent t-test), or two samples from the same individuals (paired sample t-test) or comparing a group to a specific value (one sample t-test). Another way of describing these tests are; tests of differences from zero in a one-sample case or differences between groups with paired or unpaired observations.

We can perform t-tests on data from the cyclingstudy data set. In the code below we will select the variable squat jump and filter it from two time-points. pivot_wider is used to put squat jump performance from the two time-points in separate columns. A change score is then calculated.

library(tidyverse)
library(exscidata)


cyc_select <- cyclingstudy %>%
  # select a subset of variables, squat jump max is the outcome of interest.
        select(subject, timepoint, sj.max) %>%
  # time-points of interest
        filter(timepoint %in% c("pre", "meso3")) %>%
  # spread the data based on time-points
        pivot_wider(names_from = timepoint, 
                    values_from = sj.max) %>%
  # create a change score
        mutate(change = meso3 - pre) 

The data above may be used to perform the paired sample t-test, or one sample t-test. These are basically equivalent. In the first case we use both vectors of numbers and test if the difference between them are different from zero. In the other case we calculate the differences first and then test if the mean change-score is different from zero. In the paired sample t-test, the argument paired = TRUE must be added to the t.testcall to make sure you do a paired comparison. In the one-sample case we have to set the mean we want to compare to, in this case zero (mu = 0).

paired <- t.test(cyc_select$meso3, cyc_select$pre, paired = TRUE)
one_sample <- t.test(cyc_select$change, mu = 0)

These tests are equal as we can see from the comparison below. They are also equivalent to a linear model where we simply model the mean of the change score. When fitting the change variable without any predictors we estimate a single parameter in our model, the intercept. The intercept coefficient can be used to test against the same null-hypothesis (change not different from zero), the same results will appear. We can add all result to the same table (Table 13.1).

linear_model <- lm(change ~ 1, data = cyc_select)

 
library(gt)

data.frame(test = c("Paired sample t-test", "One sample t-test", "Linear model"), 
           t.value = c(paired$statistic, one_sample$statistic, coef(summary(linear_model))[1, 3]), 
           p.value = c(paired$p.value, one_sample$p.value, coef(summary(linear_model))[1, 4]), 
           estimate = c(paired$estimate, one_sample$estimate, coef(summary(linear_model))[1, 1]), 
           lwr.ci = c(paired$conf.int[1], one_sample$conf.int[1], confint(linear_model)[1]),
           upr.ci = c(paired$conf.int[2], one_sample$conf.int[2], confint(linear_model)[2]),
           se = c(paired$stderr, one_sample$stderr, coef(summary(linear_model))[1, 2])) %>%
  tibble() %>%
  
  gt() %>%
  fmt_auto() %>%
  cols_label(test = "Test", t.value = md("*t*-value"), p.value = md("*p*-value"), estimate = "Estimate", lwr.ci = "Lower CI", upr.ci = "Upper CI", se = "Standard error")
Table 13.1: Comparing a paired sample t-test, one sample t-testa and a linear regression model
Test

t-value

p-value

Estimate Lower CI Upper CI Standard error
Paired sample t-test −1.788 0.091 −0.891 −1.938 0.156 0.498
One sample t-test −1.788 0.091 −0.891 −1.938 0.156 0.498
Linear model −1.788 0.091 −0.891 −1.938 0.156 0.498

We can also test if there is a true difference in VO2.max change between group INCRand DECR using a t-test.

cyc_select <- cyclingstudy %>%
  # Select appropriate variables and filter time-points
        select(subject,group, timepoint, VO2.max) %>%
        filter(timepoint %in% c("pre", "meso3"), 
               group != "MIX") %>%
  # make the data set wider and calculate a change score (%-change).
        pivot_wider(names_from = timepoint, 
                    values_from = VO2.max) %>%
        mutate(change = meso3-pre) %>%
        print()
# A tibble: 14 × 5
   subject group   pre meso3 change
     <dbl> <chr> <dbl> <dbl>  <dbl>
 1       1 INCR  5629   5330 -299  
 2       2 DECR  4471   4886  415  
 3       3 INCR  5598.  6380  782. 
 4       4 DECR  4944.  5171  227. 
 5       5 DECR  5748     NA   NA  
 6       6 INCR  4633.  4864  231. 
 7      10 INCR  5226   5755  529  
 8      13 DECR  4981.  5045   63.6
 9      15 INCR  4486   4991  505  
10      16 INCR  4604   5087  483  
11      18 DECR  5135.  5315  180. 
12      19 DECR  4018.  4404  386. 
13      20 INCR  4739   5119  380  
14      21 DECR  4753.  4730  -22.5
# Perform a two-sample t-test
t.test(change ~ group, data = cyc_select, var.equal = TRUE)

    Two Sample t-test

data:  change by group
t = -1.0703, df = 11, p-value = 0.3074
alternative hypothesis: true difference in means between group DECR and group INCR is not equal to 0
95 percent confidence interval:
 -503.6842  174.0833
sample estimates:
mean in group DECR mean in group INCR 
          208.2119           373.0123 

Above we use the formula method to specify the t-test (see ?t.test). var.equal = TRUE tells the t.test function to assume equal variances between groups.

The same result will appear when we fit the data to a linear model. The summary function is a generic function, meaning that many type of R objects has summary methods. From summary we get a regression table of estimates. The first row is the intercept, we can interpret this as the mean change in one of the groups (DECR). This rows has all the statistics associated with this estimate including the average (Estimate), standard error, t-value and a p-value.

The second row represents the difference between groups. The INCR group has a change score that is 164.8 units larger than the DECR group. The associated statistics can be used to assess if this difference is large enough to declare surprisingly large if the null hypothesis is actually true.

lin_mod <- lm(change ~ group, data = cyc_select)

summary(lin_mod)

Call:
lm(formula = change ~ group, data = cyc_select)

Residuals:
    Min      1Q  Median      3Q     Max 
-672.01 -141.94   18.76  155.99  409.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    208.2      113.0   1.843   0.0924 .
groupINCR      164.8      154.0   1.070   0.3074  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 276.7 on 11 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.09433,   Adjusted R-squared:  0.01199 
F-statistic: 1.146 on 1 and 11 DF,  p-value: 0.3074

If you compare the two tests, do they tell you the same?

Using var.equal = TRUE in the unpaired t-test we assumed that the variation was similar in both groups. This might not be the case, and R uses the Welch two-sample t-test by default which does not assumes equal variance between groups. Even the Welch two sample t-test can be replicated using a linear model. However, we have to specify it in a slightly different framework using the gls() function from the nlme package.

library(nlme)

welch_twosample <- t.test(change ~ group, data = cyc_select, var.equal = FALSE)

lin_mod_var <- gls(change ~ group, data = cyc_select, weights = varIdent(form = ~1|group), na.action = na.exclude, method = "ML")


welch_twosample
summary(lin_mod_var)

You are not required to master gls at this time. It however shows that the linear model frame work is very flexible as it in this case also can be adjusted to take care of heteroscedasticity.

13.2 Dummy variables

The group variable that we used in the code above introduces something new to the linear model, namely dummy variables. When we put a categorical variable in the lm command, R will code it as a dummy variable. This variable will be zero if the group corresponds to the first level of the categorical variable (coded as a factor variable) and it will be 1 if it is the second level.

In the simplest case (as above) we will get a linear model looking like this:

\[y_i = \beta_0 + \beta_1x_i\]

Where the \(x\) is the grouping variable and set to 0 when we have the reference group and 1 when we have the second level group. The coefficient \(\beta_1\) only kicks in if the group is 1. This also means that when group = 0, \(y\) corresponds to the intercept of the model. If group = 1 we have the intercept + the slope. The slope represents the difference between the intercept (group = 0) and group = 1.

If the grouping variable would have more groups more dummy-variables would have been added by R to create several comparisons between the reference group and compared levels of the variable.

13.2.1 An exercise

Using all groups in the data set, fit a model and interpret the results.

A possible solution
cyc_subset <- cyclingstudy %>%
        select(subject,group, timepoint, VO2.max) %>%
        filter(timepoint %in% c("pre", "meso3")) %>%
        pivot_wider(names_from = timepoint, 
                    values_from = VO2.max) %>%
        mutate(change = 100 * (meso3-pre)/pre) %>%
        print()

mod <- lm(change ~ group, data = cyc_subset)

summary(mod)

# The `DECR` group is the reference group, the intercept shows the mean of this group. Each parameter shows the difference from the reference.

# The same assumptions are made with these kinds of models and they can be checked with the same methods as described above. 

13.3 Multiple regression

Contrary to the t-tests used above, the linear model can be extended by adding predicting variables (independent variables). In a situation where multiple independent variables are included in the model, we control for their relationship to the dependent variable when we evaluate the other variables. Similarly with univariate regression we can examine each individual parameter from the summary.

In a previous example we used height.T1 to predict VO2.max. We might want to add information to the model. We might wonder if the age (age) of participants have a relationship with VO2max. To fit this model, use the code below.

cycling_data <-  cyclingstudy %>%
       # select(subject, timepoint, VO2.max, weight.T1, height.T1) %>%
        filter(timepoint == "pre") %>%
        print()  
          

mod1 <- lm(VO2.max ~  height.T1 + age, data = cycling_data)

summary(mod1)

From the output we can see that there is a negative relationship, when age increases VO2max decrease. We can compare this model to the simpler model by looking at the \(R^2\) value. We fit the simpler model.

mod0 <- lm(VO2.max ~  height.T1, data = cycling_data)

summary(mod0)

We can interpret \(R^2\) as the percentage of the variation explained by the model. When we added more variables to the model we add information that collectively explain a larger portion of the observed data. When adding variables we face the risk of over-fitting our model. With enough variables the model will explain the observed data with less and less uncertainty, however, new data will probably not validate the model.

The same assumptions apply to the multiple regression model as with the univariate regression model. We have to take care that we have homoscedasticity, independent observations and normally distributed errors.