Chapter 1: Linear Mixed Models

Author

Aritz Adin y Jaione Etxeberria

Published

Jan/2026

Exercise 1: plasma data

Below are the results of a randomised complete block experiment to compare the effects on the clotting time of plasma (mins) of four different methods for the treatment of plasma (material extracted from Gallagher, 2023). Samples of plasma were taken from a random sample of 8 volunteers and were subjected to all 4 treatments.

Treatment
Volunteer 1 2 3 4
1 8.4 9.4 9.8 12.2
2 12.8 15.2 12.9 14.4
3 9.6 9.1 11.2 9.8
4 9.8 8.8 9.9 12.0
5 8.4 8.2 8.5 8.5
6 8.6 9.9 9.8 10.9
7 8.9 9.0 9.2 10.4
8 7.9 8.1 8.2 10.0

1. Load the data intored in the Plasma.txt file, convert Volunteer and Treatment variables to factor, and make a descriptive plot to visualize differences between treatments and/or subjects.

Plasma <- read.table("Plasma.txt", header=TRUE)
head(Plasma, n=8)
  Volunteer Treatment Clotting
1         1         1      8.4
2         1         2      9.4
3         1         3      9.8
4         1         4     12.2
5         2         1     12.8
6         2         2     15.2
7         2         3     12.9
8         2         4     14.4
Plasma$Volunteer <- as.factor(Plasma$Volunteer)
Plasma$Treatment <- as.factor(Plasma$Treatment)
str(Plasma)
'data.frame':   32 obs. of  3 variables:
 $ Volunteer: Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ Treatment: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
 $ Clotting : num  8.4 9.4 9.8 12.2 12.8 15.2 12.9 14.4 9.6 9.1 ...

We can make a descriptive plot of the data using the ggplot2 package

library(ggplot2)

ggplot(Plasma, aes(x=Clotting, y=Volunteer, color=Treatment)) +
  geom_point() +
  labs(x="Clotting time (mins)", y="Volunteer") +
  theme_minimal()

We observe systematic differences between subjects and treatments.

2. Which variable should be included as a fixed effect, and which as a random effect? Make a design plot to visually compare the magnitude of the effects of the Treatment and Volunteer factors.

  • We want to compare these particular types of treatments (experimental factor), so we use fixed effects for the Treatment factor.

  • The eight subjects represents a sample from the population about which we wish to make inferences (random factor), so we use random effects to model the Volunteer factor.

plot.design(Clotting ~ Treatment + Volunteer, data=Plasma)

Average clotting time for each level of the factors Treatment and Volunteer
  • We see that the variability associated with the Treatment factor is lower than the variability associated with the Volunteer factor.

  • We also see that the average clotting time according to the treatment type is in the order \(T1 \leq T2 \leq T3 \leq T4\).

3. Write the linear mixed model equation.

\[y_{ij} = \beta_j + u_i + \epsilon_{ij}, \quad i=1,\ldots,8, \quad j=1,\ldots,4,\] \[u_i \sim N(0,\sigma^2_u), \quad \epsilon_{ij} \sim N(0,\sigma^2)\] where

  • \(\beta_j\) is the mean clotting time from the \(j\)-th treatment
  • \(u_i\) is a random variable associated with the \(i\)-th individual
  • \(\epsilon_{ij}\) are independent random errors

Using matrix notation

\[\boldsymbol{y} = \begin{pmatrix} \boldsymbol{I}_4 \\ \vdots \\ \boldsymbol{I}_4 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_4 \end{pmatrix} + \begin{pmatrix} \boldsymbol{1}_4 & \boldsymbol{0} & \ldots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{1}_4 & \ldots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{1}_4 \\ \end{pmatrix} \boldsymbol{u} + \boldsymbol{\epsilon}\]

4. Test if random effects are necessary.

## Fit the models ##
library(lme4)

plasma.null <- lm(Clotting ~ -1 + Treatment, data=Plasma)
plasma.lmer <- lmer(Clotting ~ -1 + Treatment + (1|Volunteer), data=Plasma)

## LRT for the variance component sigma2_u ##
LRT <- -2*(logLik(plasma.null,REML=T)-logLik(plasma.lmer,REML=T))
mean(1-pchisq(LRT,df=c(0,1)))
[1] 2.290542e-07
## We can also use the AIC/BIC to compare the models ##
data.frame(AIC=c(AIC(plasma.null), AIC(plasma.lmer)),
           BIC=c(BIC(plasma.null), BIC(plasma.lmer)),
           row.names=c("plasma.null","plasma.lmer"))
                 AIC      BIC
plasma.null 134.8699 142.1986
plasma.lmer 107.8852 116.6796

5. Check if the treatment effect is significant.

We compare nested models (with and without Treatment factor) using the LRT with the anova() function base on fitting through the ML method

M1 <- lmer(Clotting ~ -1 + (1|Volunteer), data=Plasma, REML=FALSE)
M2 <- lmer(Clotting ~ -1 + Treatment + (1|Volunteer), data=Plasma, REML=FALSE)
anova(M1,M2)
Data: Plasma
Models:
M1: Clotting ~ -1 + (1 | Volunteer)
M2: Clotting ~ -1 + Treatment + (1 | Volunteer)
   npar    AIC   BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
M1    2 145.57 148.5 -70.784   141.568                         
M2    6 107.80 116.6 -47.902    95.804 45.764  4  2.757e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude that the Treatment effect is significant.

6. Verify whether the model assumptions are satisfied.

a) Assessing assumptions on the within-group errors

plot(plasma.lmer)

plot(plasma.lmer, Volunteer ~ resid(., type="pearson"), abline=0, lty=2)

res <- resid(plasma.lmer, type="pearson")
qqnorm(res)
qqline(res)

shapiro.test(res)

    Shapiro-Wilk normality test

data:  res
W = 0.94578, p-value = 0.1093
  • We observe that the residuals are centred at zero and normally distributed, but there seems to be a lack of constant variance in the residuals (heterocedasticy).

b) Assessing assumptions on the random-effects

u <- unlist(ranef(plasma.lmer)$Volunteer)
qqnorm(u)
qqline(u)

shapiro.test(u)

    Shapiro-Wilk normality test

data:  u
W = 0.76565, p-value = 0.01215
  • Again, the are some issues with the assumptions for the random effects.

7. Examine the model results (estimated fixed effects and variance components). Compute and interpret the intra-class correlation coefficient. Compute the predicted random effects and fitted values.

a) Estimated fixed effects and 95% confidence intervals

fixef(plasma.lmer)
Treatment1 Treatment2 Treatment3 Treatment4 
    9.3000     9.7125     9.9375    11.0250 
beta.CI <- confint(plasma.lmer, parm="beta_")
beta.CI
              2.5 %   97.5 %
Treatment1 8.000009 10.59999
Treatment2 8.412509 11.01249
Treatment3 8.637509 11.23749
Treatment4 9.725009 12.32499
i <- length(unique(Plasma$Volunteer))
j <- length(unique(Plasma$Treatment))

plot(1:j, fixef(plasma.lmer), pch=19, cex=0.5, xaxt="n",
     xlim=0.5+c(0,j), ylim=range(c(beta.CI))*c(0.9,1.1),
     xlab="Type", ylab="Estimated fixed effects")
axis(1, at=1:j, labels=rownames(beta.CI))
segments(1:j, beta.CI[,"2.5 %"], 1:j, beta.CI[,"97.5 %"])

We can use the emmeans() function to compute pairwise comparisons between treatments

library(emmeans)
emmeans(plasma.lmer, pairwise~Treatment, infer=T)$contrasts
 contrast                estimate    SE df lower.CL upper.CL t.ratio p.value
 Treatment1 - Treatment2   -0.412 0.405 21    -1.54   0.7162  -1.019  0.7405
 Treatment1 - Treatment3   -0.637 0.405 21    -1.77   0.4912  -1.574  0.4139
 Treatment1 - Treatment4   -1.725 0.405 21    -2.85  -0.5963  -4.260  0.0018
 Treatment2 - Treatment3   -0.225 0.405 21    -1.35   0.9037  -0.556  0.9440
 Treatment2 - Treatment4   -1.312 0.405 21    -2.44  -0.1838  -3.241  0.0189
 Treatment3 - Treatment4   -1.087 0.405 21    -2.22   0.0412  -2.686  0.0616

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 4 estimates 
P value adjustment: tukey method for comparing a family of 4 estimates 

Two out of six pairwise comparisons are statistically significant at the 5% level:

  • The mean clotting time for treatment 4 is significantly longer than that for treatments 1 and 2.

b) Estimated variance components and 95% confidence intervals

VarCorr(plasma.lmer)
 Groups    Name        Std.Dev.
 Volunteer (Intercept) 1.63005 
 Residual              0.80987 
confint(plasma.lmer, parm="theta_")
Computing profile confidence intervals ...
           2.5 %   97.5 %
.sig01 0.9562558 2.794239
.sigma 0.5849697 1.035247

Intra-class correlation coefficient: \(\rho=\frac{\sigma^2_u}{\sigma^2_u+\sigma^2} = \frac{1.63^2}{1.63^2+0.81^2}=0.802\). Which means that 80.1% of the overall variability can be attributed to the differences between the individuals.

c) Compute the predicted random effects and fitted values

t(ranef(plasma.lmer)$Volunteer)
                      1        2           3        4         5          6
(Intercept) -0.04120702 3.608557 -0.06475388 0.123621 -1.501113 -0.1824882
                     7         8
(Intercept) -0.5827849 -1.359832
plasma.fitted <- cbind(Plasma,
                       Clotting.fit=fitted(plasma.lmer),
                       error=resid(plasma.lmer))
head(plasma.fitted, n=8)
  Volunteer Treatment Clotting Clotting.fit       error
1         1         1      8.4     9.258793 -0.85879298
2         1         2      9.4     9.671293 -0.27129298
3         1         3      9.8     9.896293 -0.09629298
4         1         4     12.2    10.983793  1.21620702
5         2         1     12.8    12.908557 -0.10855720
6         2         2     15.2    13.321057  1.87894280
7         2         3     12.9    13.546057 -0.64605720
8         2         4     14.4    14.633557 -0.23355720
plot(plasma.fitted$Clotting, plasma.fitted$Clotting.fit,
     xlab="Observed", ylab="Predicted", main="Clotting time (mins)")
lines(c(0,20), c(0,20))

Exercise 2: spider data

Oxbrough et al. (2005) investigated how spider communities change over forestation cycles in conifer and broadleaf plantations. They identified environmental and structural features of the habitat than can be used as indicators of spider biodiversity. Different plots were surveyed, each comprising 5 to 7 sampling sites separated by a minimum of 50 metres. More than 100 species of spiders were observed.

The Spiders.txt file contains some of the data recorded from the original study (material extracted from Zuur et al., 2013). We are interested in analyzing the relationship between the spider diversity in each site with some environmental explanatory variables. The data set contains the following variables:

  • DivIndex: Variable of interest. Lower values of this index indicates less abundance of different species.
  • HerbLayer: Percentage of Herb Layer Cover
  • Litter: Percentage of Litter Content
  • GroundVeg: Percentage of Ground Vegetation
  • Plot: Factor indicating the surveyed plot.

1. Load the data intored in the Spiders.txt file and convert Plot variable to factor. Make descriptive graphs to visualize relationships between the variable of interest and the explanatory variables, taking into account the Plot factor.

Spiders <- read.table("Spiders.txt", header=T)
Spiders$Plot <- as.factor(Spiders$Plot)
str(Spiders)
'data.frame':   168 obs. of  5 variables:
 $ DivIndex : num  1.05 1.13 1 0.87 0.94 1.01 1.2 0.38 0.71 0.82 ...
 $ HerbLayer: num  0.27 0.45 0.63 0.49 0.32 0.73 0.68 0.2 0.39 0.38 ...
 $ GroundVeg: num  0.01 0.01 0.01 0.01 0.03 0.27 0.15 0.15 0.59 0.35 ...
 $ Litter   : num  0 0 0 0 0 0 0 0.63 0.06 0.49 ...
 $ Plot     : Factor w/ 30 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...

We can make descriptive graphs of the data using the ggplot2 package

library(ggplot2)

ggplot(Spiders, aes(x=HerbLayer, y=DivIndex, color=Plot)) +
  geom_point() +
  labs(x="Percentage of Herb Layer Cover", y="Diversity Index") +
  theme_minimal()

ggplot(Spiders, aes(x=Litter, y=DivIndex, color=Plot)) +
  geom_point() +
  labs(x="Percentage of Litter Content", y="Diversity Index") +
  theme_minimal()

ggplot(Spiders, aes(x=GroundVeg, y=DivIndex, color=Plot)) +
  geom_point() +
  labs(x="Percentage of Ground Vegetation", y="Diversity Index") +
  theme_minimal()

Design graph to compare average diversity indexes for each level of the Plot factor

plot.design(DivIndex ~ Plot, data=Spiders)

2. Test if random effects are necessary

## Fit the models ##
library(lme4)

spiders.null <- lm(DivIndex ~ 1 + HerbLayer + Litter + GroundVeg, data=Spiders)
spiders.lmer <- lmer(DivIndex ~ 1 + HerbLayer + Litter + GroundVeg + (1|Plot), data=Spiders)

## LRT for the variance component sigma2_u ##
LRT <- -2*(logLik(spiders.null,REML=T)-logLik(spiders.lmer,REML=T))
mean(1-pchisq(LRT,df=c(0,1)))
[1] 0.0002102636

3. Check which environmental variables should be included in the model.

We compare nested models:

M0 <- lmer(DivIndex ~ 1 + (1|Plot), data=Spiders, REML=FALSE)
M1 <- lmer(DivIndex ~ 1 + HerbLayer + (1|Plot), data=Spiders, REML=FALSE)
M2 <- lmer(DivIndex ~ 1 + HerbLayer + Litter + (1|Plot), data=Spiders, REML=FALSE)
M3 <- lmer(DivIndex ~ 1 + HerbLayer + Litter + GroundVeg + (1|Plot), data=Spiders, REML=FALSE)
anova(M0,M1,M2,M3)
Data: Spiders
Models:
M0: DivIndex ~ 1 + (1 | Plot)
M1: DivIndex ~ 1 + HerbLayer + (1 | Plot)
M2: DivIndex ~ 1 + HerbLayer + Litter + (1 | Plot)
M3: DivIndex ~ 1 + HerbLayer + Litter + GroundVeg + (1 | Plot)
   npar     AIC     BIC  logLik -2*log(L)   Chisq Df Pr(>Chisq)    
M0    3 -186.68 -177.31  96.339   -192.68                          
M1    4 -198.50 -186.01 103.252   -206.50 13.8261  1  0.0002005 ***
M2    5 -205.88 -190.26 107.940   -215.88  9.3759  1  0.0021985 ** 
M3    6 -204.26 -185.52 108.132   -216.26  0.3830  1  0.5360224    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude that the only the HerbLayer and Litter covariates are statistically significant.

We can also check if a interaction between these two variables should be included in the model

M4 <- lmer(DivIndex ~ 1 + HerbLayer + Litter + HerbLayer*Litter + (1|Plot), data=Spiders, REML=FALSE)
anova(M2,M4)
Data: Spiders
Models:
M2: DivIndex ~ 1 + HerbLayer + Litter + (1 | Plot)
M4: DivIndex ~ 1 + HerbLayer + Litter + HerbLayer * Litter + (1 | Plot)
   npar     AIC     BIC logLik -2*log(L)  Chisq Df Pr(>Chisq)
M2    5 -205.88 -190.26 107.94   -215.88                     
M4    6 -204.09 -185.34 108.05   -216.09 0.2089  1     0.6476

4. Fit the final linear mixed model and write its equation

We fit the following linear mixed model \[y_{ij} = \beta_0 + \beta_1 \times HerbLayer_{ij} + \beta_2 \times Litter_{ij} + u_i + \epsilon_{ij}\] \[u_i \sim N(0,\sigma^2_u), \quad \epsilon_{ij} \sim N(0,\sigma^2)\] where

  • \(y_{ij}\) is the diversity index at site \(j\) in plot \(i\)
  • \(\beta_0\) is a global intercept
  • \(\beta_1\) and \(\beta_2\) are regression coefficients associated with the continuous covariates HerbLayer and Litter, respectively
  • \(u_i\) is a random variable associated with the \(i\)-th surveyed plot
  • \(\epsilon_{ij}\) are independent random errors
Model <- lmer(DivIndex ~ 1 + HerbLayer + Litter + (1|Plot), data=Spiders)
summary(Model)
Linear mixed model fit by REML ['lmerMod']
Formula: DivIndex ~ 1 + HerbLayer + Litter + (1 | Plot)
   Data: Spiders

REML criterion at convergence: -201.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6298 -0.4518  0.0329  0.5959  2.6373 

Random effects:
 Groups   Name        Variance Std.Dev.
 Plot     (Intercept) 0.004145 0.06438 
 Residual             0.013854 0.11770 
Number of obs: 168, groups:  Plot, 30

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.93674    0.01994  46.985
HerbLayer    0.16628    0.04301   3.866
Litter      -0.22189    0.07077  -3.135

Correlation of Fixed Effects:
          (Intr) HrbLyr
HerbLayer -0.609       
Litter    -0.278  0.026

5. Verify whether the model assumptions are satisfied.

a) Assessing assumptions on the within-group errors

plot(Model)

plot(Model, Plot ~ resid(., type="pearson"), abline=0, lty=2)

res <- resid(Model, type="pearson")
qqnorm(res)
qqline(res)

shapiro.test(res)

    Shapiro-Wilk normality test

data:  res
W = 0.96873, p-value = 0.0007681

b) Assessing assumptions on the random effects

u <- unlist(ranef(Model)$Plot)
qqnorm(u)
qqline(u)

shapiro.test(u)

    Shapiro-Wilk normality test

data:  u
W = 0.96304, p-value = 0.3695

6. Examine the model results (estimated fixed effects and variance components). Compute and interpret the intra-class correlation coefficient. Make a plot of observed vs predicted diversity index values.

a) Estimated fixed effects and 95% confidence intervals

fixef(Model)
(Intercept)   HerbLayer      Litter 
  0.9367414   0.1662767  -0.2218901 
beta.CI <- confint(Model, parm="beta_")
Computing profile confidence intervals ...
beta.CI
                  2.5 %     97.5 %
(Intercept)  0.89767549  0.9755674
HerbLayer    0.08255604  0.2501822
Litter      -0.36550678 -0.0822844

b) Estimated variance components and 95% confidence intervals

VarCorr(Model)
 Groups   Name        Std.Dev.
 Plot     (Intercept) 0.064379
 Residual             0.117703
confint(Model, parm="theta_")
Computing profile confidence intervals ...
            2.5 %     97.5 %
.sig01 0.03582034 0.09292245
.sigma 0.10450819 0.13254556

Intra-class correlation coefficient: \(\rho=\frac{\sigma^2_u}{\sigma^2_u+\sigma^2} = \frac{0.0644^2}{0.0644^2+0.1177^2}=0.23\). Which means that only 23% of the overall variability can be attributed to the differences between the surveyed plots.

c) Compare observed and predicted diversity index values

spiders.fitted <- cbind(Spiders,
                        DivIndex.fit=fitted(Model),
                        error=resid(Model))
head(spiders.fitted)
  DivIndex HerbLayer GroundVeg Litter Plot DivIndex.fit       error
1     1.05      0.27      0.01      0    1    0.9863934  0.06360665
2     1.13      0.45      0.01      0    1    1.0163232  0.11367684
3     1.00      0.63      0.01      0    1    1.0462530 -0.04625298
4     0.87      0.49      0.01      0    1    1.0229742 -0.15297423
5     0.94      0.32      0.03      0    1    0.9947072 -0.05470719
6     1.01      0.73      0.27      0    1    1.0628806 -0.05288065
plot(spiders.fitted$DivIndex, spiders.fitted$DivIndex.fit,
     xlab="Observed", ylab="Predicted", main="Diversity Index",
     xlim=c(0.4,1.5), ylim=c(0.4,1.5))
lines(c(0,2),c(0,2))

Exercise 3: split-plot experiment on varieties of oats

These data have been introduced by Yates (1935) as an example of a split-plot design (material extracted from Durban, 2014). The experimental units were arranged into six block using a \(3 \times 4\) full factorial design, with three varieties of oats and four nitrogen concentrations. The term full factorial means that every variety was used with every nitrogen concentration.

Oats <- read.table("Oats.txt", header=T, stringsAsFactors = T)
head(Oats, n=12)
   Block     Variety nitro yield
1      I     Victory   0.0   111
2      I     Victory   0.2   130
3      I     Victory   0.4   157
4      I     Victory   0.6   174
5      I Golden Rain   0.0   117
6      I Golden Rain   0.2   114
7      I Golden Rain   0.4   161
8      I Golden Rain   0.6   141
9      I  Marvellous   0.0   105
10     I  Marvellous   0.2   140
11     I  Marvellous   0.4   118
12     I  Marvellous   0.6   156
str(Oats)
'data.frame':   72 obs. of  4 variables:
 $ Block  : Factor w/ 6 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Variety: Factor w/ 3 levels "Golden Rain",..: 3 3 3 3 1 1 1 1 2 2 ...
 $ nitro  : num  0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 ...
 $ yield  : int  111 130 157 174 117 114 161 141 105 140 ...
library(ggplot2)

ggplot(Oats, aes(x=factor(nitro), y=yield, color=Variety)) +
  geom_line(aes(group=Variety)) + 
  geom_point() + 
  labs(x="Nitrogen concentration", y="Yield") +
  ggtitle("Yield of oats by variety and nitrogen level") +
  facet_wrap(~ Block, ncol=2) +
  theme_minimal()

Design graph to compare average yields for each level of Block, Variety and nitro factors

plot.design(yield ~ Block*Variety*factor(nitro), data=Oats)

1. Test if random effects are necessary

2. Choose the correct structure for the fixed effects

3. Fit the final linear mixed model and write its equation

4. Verify whether the model assumptions are satisfied.

a) Assessing assumptions on the within-group errors

b) Assessing assumptions on the random-effects

5. Examine the model results.

a) Estimated fixed effects and 95% confidence intervals

b) Estimated variance components and 95% confidence intervals

c) Predicted random effects and yield values