library(tidyverse)
library(exscidata)
<- cyclingstudy %>%
cyc_select # select a subset of variables, squat jump max is the outcome of interest.
select(subject, timepoint, sj.max) %>%
# time-points of interest
filter(timepoint %in% c("pre", "meso3")) %>%
# spread the data based on time-points
pivot_wider(names_from = timepoint,
values_from = sj.max) %>%
# create a change score
mutate(change = meso3 - pre)
13 Categorical predictors and multiple regression
We have up to now used a single continuous predictor to predict a dependent variables. We will now show that the ordinary regression models can be extended and modified to perform statistical tests such as the t-test. The ordinary regression model is very flexible!
13.1 Linear models can be used instead of t-tests
t-test are designed to compare means. A question you might want to answer using a t-test is how unlikely the results from your test is if there were no true differences between values from two groups (independent t-test), or two samples from the same individuals (paired sample t-test) or comparing a group to a specific value (one sample t-test). Another way of describing these tests are; tests of differences from zero in a one-sample case or differences between groups with paired or unpaired observations.
We can perform t-tests on data from the cyclingstudy
data set. In the code below we will select the variable squat jump and filter it from two time-points. pivot_wider
is used to put squat jump performance from the two time-points in separate columns. A change score is then calculated.
The data above may be used to perform the paired sample t-test, or one sample t-test. These are basically equivalent. In the first case we use both vectors of numbers and test if the difference between them are different from zero. In the other case we calculate the differences first and then test if the mean change-score is different from zero. In the paired sample t-test, the argument paired = TRUE
must be added to the t.test
call to make sure you do a paired comparison. In the one-sample case we have to set the mean we want to compare to, in this case zero (mu = 0
).
<- t.test(cyc_select$meso3, cyc_select$pre, paired = TRUE)
paired <- t.test(cyc_select$change, mu = 0) one_sample
These tests are equal as we can see from the comparison below. They are also equivalent to a linear model where we simply model the mean of the change score. When fitting the change
variable without any predictors we estimate a single parameter in our model, the intercept. The intercept coefficient can be used to test against the same null-hypothesis (change not different from zero), the same results will appear. We can add all result to the same table (Table 13.1).
<- lm(change ~ 1, data = cyc_select)
linear_model
library(gt)
data.frame(test = c("Paired sample t-test", "One sample t-test", "Linear model"),
t.value = c(paired$statistic, one_sample$statistic, coef(summary(linear_model))[1, 3]),
p.value = c(paired$p.value, one_sample$p.value, coef(summary(linear_model))[1, 4]),
estimate = c(paired$estimate, one_sample$estimate, coef(summary(linear_model))[1, 1]),
lwr.ci = c(paired$conf.int[1], one_sample$conf.int[1], confint(linear_model)[1]),
upr.ci = c(paired$conf.int[2], one_sample$conf.int[2], confint(linear_model)[2]),
se = c(paired$stderr, one_sample$stderr, coef(summary(linear_model))[1, 2])) %>%
tibble() %>%
gt() %>%
fmt_auto() %>%
cols_label(test = "Test", t.value = md("*t*-value"), p.value = md("*p*-value"), estimate = "Estimate", lwr.ci = "Lower CI", upr.ci = "Upper CI", se = "Standard error")
Test | t-value |
p-value |
Estimate | Lower CI | Upper CI | Standard error |
---|---|---|---|---|---|---|
Paired sample t-test | −1.788 | 0.091 | −0.891 | −1.938 | 0.156 | 0.498 |
One sample t-test | −1.788 | 0.091 | −0.891 | −1.938 | 0.156 | 0.498 |
Linear model | −1.788 | 0.091 | −0.891 | −1.938 | 0.156 | 0.498 |
We can also test if there is a true difference in VO2.max
change between group INCR
and DECR
using a t-test.
<- cyclingstudy %>%
cyc_select # Select appropriate variables and filter time-points
select(subject,group, timepoint, VO2.max) %>%
filter(timepoint %in% c("pre", "meso3"),
!= "MIX") %>%
group # make the data set wider and calculate a change score (%-change).
pivot_wider(names_from = timepoint,
values_from = VO2.max) %>%
mutate(change = meso3-pre) %>%
print()
# A tibble: 14 × 5
subject group pre meso3 change
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 INCR 5629 5330 -299
2 2 DECR 4471 4886 415
3 3 INCR 5598. 6380 782.
4 4 DECR 4944. 5171 227.
5 5 DECR 5748 NA NA
6 6 INCR 4633. 4864 231.
7 10 INCR 5226 5755 529
8 13 DECR 4981. 5045 63.6
9 15 INCR 4486 4991 505
10 16 INCR 4604 5087 483
11 18 DECR 5135. 5315 180.
12 19 DECR 4018. 4404 386.
13 20 INCR 4739 5119 380
14 21 DECR 4753. 4730 -22.5
# Perform a two-sample t-test
t.test(change ~ group, data = cyc_select, var.equal = TRUE)
Two Sample t-test
data: change by group
t = -1.0703, df = 11, p-value = 0.3074
alternative hypothesis: true difference in means between group DECR and group INCR is not equal to 0
95 percent confidence interval:
-503.6842 174.0833
sample estimates:
mean in group DECR mean in group INCR
208.2119 373.0123
Above we use the formula method to specify the t-test (see ?t.test
). var.equal = TRUE
tells the t.test
function to assume equal variances between groups.
The same result will appear when we fit the data to a linear model. The summary
function is a generic function, meaning that many type of R objects has summary methods. From summary
we get a regression table of estimates. The first row is the intercept, we can interpret this as the mean change in one of the groups (DECR
). This rows has all the statistics associated with this estimate including the average (Estimate), standard error, t-value and a p-value.
The second row represents the difference between groups. The INCR
group has a change score that is 164.8 units larger than the DECR
group. The associated statistics can be used to assess if this difference is large enough to declare surprisingly large if the null hypothesis is actually true.
<- lm(change ~ group, data = cyc_select)
lin_mod
summary(lin_mod)
Call:
lm(formula = change ~ group, data = cyc_select)
Residuals:
Min 1Q Median 3Q Max
-672.01 -141.94 18.76 155.99 409.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 208.2 113.0 1.843 0.0924 .
groupINCR 164.8 154.0 1.070 0.3074
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 276.7 on 11 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.09433, Adjusted R-squared: 0.01199
F-statistic: 1.146 on 1 and 11 DF, p-value: 0.3074
If you compare the two tests, do they tell you the same?
Using var.equal = TRUE
in the unpaired t-test we assumed that the variation was similar in both groups. This might not be the case, and R uses the Welch two-sample t-test by default which does not assumes equal variance between groups. Even the Welch two sample t-test can be replicated using a linear model. However, we have to specify it in a slightly different framework using the gls()
function from the nlme
package.
library(nlme)
<- t.test(change ~ group, data = cyc_select, var.equal = FALSE)
welch_twosample
<- gls(change ~ group, data = cyc_select, weights = varIdent(form = ~1|group), na.action = na.exclude, method = "ML")
lin_mod_var
welch_twosamplesummary(lin_mod_var)
You are not required to master gls
at this time. It however shows that the linear model frame work is very flexible as it in this case also can be adjusted to take care of heteroscedasticity.
13.2 Dummy variables
The group variable that we used in the code above introduces something new to the linear model, namely dummy variables. When we put a categorical variable in the lm
command, R will code it as a dummy variable. This variable will be zero if the group corresponds to the first level of the categorical variable (coded as a factor variable) and it will be 1 if it is the second level.
In the simplest case (as above) we will get a linear model looking like this:
\[y_i = \beta_0 + \beta_1x_i\]
Where the \(x\) is the grouping variable and set to 0 when we have the reference group and 1 when we have the second level group. The coefficient \(\beta_1\) only kicks in if the group is 1. This also means that when group = 0, \(y\) corresponds to the intercept of the model. If group = 1 we have the intercept + the slope. The slope represents the difference between the intercept (group = 0) and group = 1.
If the grouping variable would have more groups more dummy-variables would have been added by R to create several comparisons between the reference group and compared levels of the variable.
13.2.1 An exercise
Using all groups in the data set, fit a model and interpret the results.
A possible solution
<- cyclingstudy %>%
cyc_subset select(subject,group, timepoint, VO2.max) %>%
filter(timepoint %in% c("pre", "meso3")) %>%
pivot_wider(names_from = timepoint,
values_from = VO2.max) %>%
mutate(change = 100 * (meso3-pre)/pre) %>%
print()
<- lm(change ~ group, data = cyc_subset)
mod
summary(mod)
# The `DECR` group is the reference group, the intercept shows the mean of this group. Each parameter shows the difference from the reference.
# The same assumptions are made with these kinds of models and they can be checked with the same methods as described above.
13.3 Multiple regression
Contrary to the t-tests used above, the linear model can be extended by adding predicting variables (independent variables). In a situation where multiple independent variables are included in the model, we control for their relationship to the dependent variable when we evaluate the other variables. Similarly with univariate regression we can examine each individual parameter from the summary.
In a previous example we used height.T1
to predict VO2.max
. We might want to add information to the model. We might wonder if the age (age
) of participants have a relationship with VO2max. To fit this model, use the code below.
<- cyclingstudy %>%
cycling_data # select(subject, timepoint, VO2.max, weight.T1, height.T1) %>%
filter(timepoint == "pre") %>%
print()
<- lm(VO2.max ~ height.T1 + age, data = cycling_data)
mod1
summary(mod1)
From the output we can see that there is a negative relationship, when age increases VO2max decrease. We can compare this model to the simpler model by looking at the \(R^2\) value. We fit the simpler model.
<- lm(VO2.max ~ height.T1, data = cycling_data)
mod0
summary(mod0)
We can interpret \(R^2\) as the percentage of the variation explained by the model. When we added more variables to the model we add information that collectively explain a larger portion of the observed data. When adding variables we face the risk of over-fitting our model. With enough variables the model will explain the observed data with less and less uncertainty, however, new data will probably not validate the model.
The same assumptions apply to the multiple regression model as with the univariate regression model. We have to take care that we have homoscedasticity, independent observations and normally distributed errors.