multilevel model change predictions

Estimating and visualizing multilevel models for change in R

Longitudinal data is fascinating as it enables us to examine change in time, better understand causal relationships, and explain events and their timing. To use this type of data, we typically need to move beyond classical statistical methods, such as OLS regression and ANOVA, to models that can deal with the extra complexity of the data.

One popular model for analyzing longitudinal data is the MultiLevel Model for Change (MLMC). This enables the estimation of change in time while accounting for the hierarchical nature of the data (multiple points in time nested within individuals). It is similar to the Latent Growth Model, but it is estimated using the multilevel model framework (also known as hierarchical modeling or random effects) and uses the long data format (each row is a specific time point for an individual).

More precisely, the MLMC can help:

  • understand how change happens in time
  • explain change using time-varying and time-constant predictors
  • decompose variance in between and within variation
  • can efficiently deal with continuous time, unbalanced data (not all individuals are present at all times), and different timings (not everyone gives data at the same time)

Here, I’ll give a high-level introduction to MLMC, how to estimate it in R, and how to visualize change using it.

Preparing the data

You can follow this guide by running the code in R on your computer. Our examples will use synthetic (simulated) data modeled after Understanding Society, a comprehensive panel study from the UK. The actual data can be accessed for free from the UK Data Archive.

Access the code used here.

Access the data here.

In a previous blog, we cleaned the data, and have both the long and the wide formats. We will use that cleaned data here. If you want to follow along, you can download the data here and all the code from here.

Before we start, make sure you have the tidyverse and lme4 packages installed and loaded by running the following code:

install.packages("lme4")
install.packages("tidyverse")

library(lme4)
library(tidyverse)

We will mainly use the lme4 package to run multilevel models and tidyverse for some light data cleaning.

Importing and exploring the data

Next, we will load the data prepared in the previous post. This contains long and wide format data in an “RData” file. We will load the long-format data and have a quick look at it:

load("./data/us_clean_syn.RData")

glimpse(usl)
## Rows: 204,028
## Columns: 32
## $ pidp        <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11…
## $ wave        <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4…
## $ age         <hvn_lbll> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 4…
## $ hiqual      <hvn_lbll> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ single      <hvn_lbll> 1, 1, 1, 0, 0, 0, 0, 0, 1, NA, NA, NA, 0, NA, 0, NA, …
## $ fimngrs     <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ sclfsato    <hvn_lbll> -9, 6, 2, 1, 6, 6, -8, 3, 2, NA, NA, NA, -9, NA, 5, N…
## $ sf12pcs     <dbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, NA, 46.16, 33.18…
## $ sf12mcs     <dbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, NA, 37.02, 46.80…
## $ istrtdaty   <hvn_lbll> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009,…
## $ sf1         <hvn_lbll> 2, 2, 1, 3, 4, 3, 3, 3, 5, NA, NA, NA, 2, NA, 2, NA, …
## $ present     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, …
## $ gndr.fct    <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male…
## $ single.fct  <fct> Single, Single, Single, In relationship, In relationship, …
## $ urban.fct   <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, Ur…
## $ degree.fct  <fct> Degree, Degree, Degree, Degree, No degree, No degree, No d…
## $ vote6.fct   <fct> Not very, Not at all, Not at all, Not at all, Not at all, …
## $ sati        <dbl> NA, 6, 2, 1, 6, 6, NA, 3, 2, NA, NA, NA, NA, NA, 5, NA, NA…
## $ sati.fct    <fct> NA, Mostly satisfied, Mostly dissatisfied, Completely diss…
## $ fimngrs.cap <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ logincome   <dbl> 8.099818, 8.297170, 8.196070, 6.757514, 6.809039, 6.577861…
## $ year        <dbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, NA, …
## $ sati.ind    <dbl> 3.000000, 3.000000, 3.000000, 3.000000, 5.000000, 5.000000…
## $ sati.dev    <dbl> NA, 3.000000, -1.000000, -2.000000, 1.000000, 1.000000, NA…
## $ sf12pcs.ind <dbl> 57.65250, 57.65250, 57.65250, 57.65250, 48.82667, 48.82667…
## $ sf12pcs.dev <dbl> -3.1425000, 5.3275000, -0.6825000, -1.5025000, 5.1033333, …
## $ sf12mcs.ind <dbl> 52.75250, 52.75250, 52.75250, 52.75250, 42.96333, 42.96333…
## $ sf12mcs.dev <dbl> 2.977500, -16.532500, 7.267500, 6.287500, 3.516667, 2.4266…
## $ waves       <int> 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4…
## $ sati.lag    <dbl> NA, NA, 6, 2, NA, 6, 6, NA, NA, 2, NA, NA, NA, NA, NA, 5, …
## $ sf12pcs.lag <dbl> NA, 54.51, 62.98, 56.97, NA, 53.93, 46.39, NA, NA, 33.18, …
## $ sf12mcs.lag <dbl> NA, 55.73, 36.22, 60.02, NA, 46.48, 45.39, NA, NA, 46.80, …

Before we get into MLMC, let’s examine the data we want to analyze. Let’s imagine we are interested in how income changes over time and what the main sources of that change. More precisely, we want to separate between variation, how people change compared to others, and within variation, how people vary relative to their own average/trend.

Let’s explore the data. First, let’s look at the long data that we will use for the modeling and graphs:

head(usl, n = 10)
## # A tibble: 10 × 32
##     pidp  wave      age hiqual single fimngrs sclfsato sf12pcs sf12mcs istrtdaty
##    <dbl> <dbl> <hvn_lb> <hvn_> <hvn_>   <dbl> <hvn_lb>   <dbl>   <dbl> <hvn_lbl>
##  1     5     1       28      1      1   3284.       -9    54.5    55.7      2009
##  2     5     2       28      1      1   4002.        6    63.0    36.2      2010
##  3     5     3       28      1      1   3617.        2    57.0    60.0      2011
##  4     5     4       28      1      0    850.        1    56.2    59.0      2012
##  5     6     1       80      9      0    896         6    53.9    46.5      2010
##  6     6     2       80      9      0    709         6    46.4    45.4      2011
##  7     6     3       80      9      0    702        -8    NA      NA        2012
##  8     6     4       80      9      0    829.        3    46.2    37.0      2013
##  9     7     1       60      3      1   1458.        2    33.2    46.8      2009
## 10     7     2       60      3     NA     NA        NA    NA      NA          NA
## # ℹ 22 more variables: sf1 <hvn_lbll>, present <lgl>, gndr.fct <fct>,
## #   single.fct <fct>, urban.fct <fct>, degree.fct <fct>, vote6.fct <fct>,
## #   sati <dbl>, sati.fct <fct>, fimngrs.cap <dbl>, logincome <dbl>, year <dbl>,
## #   sati.ind <dbl>, sati.dev <dbl>, sf12pcs.ind <dbl>, sf12pcs.dev <dbl>,
## #   sf12mcs.ind <dbl>, sf12mcs.dev <dbl>, waves <int>, sati.lag <dbl>,
## #   sf12pcs.lag <dbl>, sf12mcs.lag <dbl>

We see that each row is a combination of individual (variable “pidp”) and time (variable “wave”). This is also the format we need for doing visualization using ggplot2.

To see what we will be modeling, we can make a simple graph showing the average change in time for the whole data as well as a line for the change of each individual (intro to graphs for longitudinal data here):

ggplot(usl, aes(wave, logincome, group = pidp)) +
  geom_line(alpha = 0.01) +
  stat_summary( 
    aes(group = 1),
    fun = mean,
    geom = "line",
    size = 1.5,
    color = "red"
  ) +
  theme_bw() + # nice theme
  labs(x = "Wave", y = "Logincome") # nice labels
change in time of income in the UK

So we see a positive average change in time but also quite a lot of variation in how people change (gray lines all over the place). MLMC can estimate both of these things at the same time!

What is multilevel modeling?

Multilevel modeling is an extension of regression modeling in which we can decompose different sources of variation. Why is this important? Traditional, OLS regression assumes that all the cases are independent. This implies that there is no correlation between the observed measures due to things like clustering. This is often not true in social science data. For example, individuals are nested within households, neighborhoods, regions, and countries. Students are nested in classes, schools, areas, and countries. This nested structure makes individuals similar to each other. For example, health outcomes will probably be similar for people living in the same neighborhoods because of self-selection (similar income and education), quality of health provision, or air quality. If this is true, we can’t treat people from the same neighborhood as independent.

Multilevel modeling solves this issue by estimating random effects (i.e., variation) for the different levels present in the data. This way, the regression coefficients (like standard errors) will be corrected for this nested structure. Additionally, these models can estimate how much variation comes from each level. This can be very informative. For example, by decomposing the variation of student outcomes at the student level, class level, and school level, we can understand the main factors impacting the students’ grades. This can inform theory and policy.

Between and within variation

Longitudinal data is also nested. Measures at different points in time are nested within an individual. These observations are not independent, as stable individual characteristics (like genes, personality, or context) lead to consistent responses within the individuals and across time. So, multilevel modeling can help correct this nesting and facilitate the separation of two primary sources of variation in longitudinal data: between variation and within variation. The former refers to how individuals are different from each other in the outcome of interest (e.g., one individual has, on average, a higher income compared to another), while the latter refers to how the measure is different in a particular year compared to their individual average (e.g., do they have a lower or higher income compared to their average income).

To better understand these sources of variation, let’s look at this graph that shows three hypothetical individuals and how they change in time:

three trajectories in time in a longitudinal study

 

The trends capture two sources of variation. We can imagine creating a linear trend for each individual to understand how that works. Now, each individual has their own trajectory:

between and within variation decomposition in longitudinal data

When we discuss between variation, we refer to how the individual lines are different from each other. On the other hand, when we discuss within variation, we refer to the difference between each individual’s trend and their observed scores (represented by dots in the above graph).

To make this even clearer, we could completely separate the two sources of variation. On the one hand, we could extract just the individual trends, which would capture just the between variation. On the other hand, we can subtract these trends from the data, and we would be left just with the within variation:

 

The multilevel model for change allows us to concurrently model these two sources of variation to indicate what is happening at the individual level. At the same time, it would estimate the average effects, indicating the overall patterns in the data.

The unconditional change model (a.k.a. random effects)

Now that we have a basic understanding of multilevel modeling and longitudinal data we can try them out. We typically start from a simple model that aims to separate within and between variation. We can define the model as follows:

Yij=γ00+ξ0i+ϵij

Where:

  • Yij is the variable of interest (for example, log income), and it varies by individual (i) and time (j)
  • γ00 is the intercept in the regression. We can interpret it as the grand mean or the average of the outcome over all the individuals and time points
  • ξ0i is the between variation and tells us how different people are from each other in their income. If everyone had the same income, this would be 0. The more different they are, the larger this coefficient will be
  • ϵij this is the residual but also has here the interpretation of within variation and tells us how much each individual varies around their average. The larger this coefficient, the more individuals fluctuate in their outcomes

Now that we know what we want to model, let’s see how to do it in R and lme4. For estimating multilevel models, we will use the lmer() command. We need to give the data and the formula. Our outcome is “logincome”, so that is on the left side of ~. We have “1” on the right side, which stands for the intercept. In the brackets, we define the random effects. Here we say that we want the intercept, (“1”) to vary by individual (“| pidp”). This all leads to this syntax:

# unconditional means model (a.k.a. random effects model)
m0 <- lmer(data = usl, logincome ~ 1 + (1 | pidp))

# check results
summary(m0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: logincome ~ 1 + (1 | pidp)
##    Data: usl
## 
## REML criterion at convergence: 488875.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1929 -0.1640  0.1208  0.3555  4.0537 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  pidp     (Intercept) 1.1198   1.0582  
##  Residual             0.8712   0.9334  
## Number of obs: 152509, groups:  pidp, 50995
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 6.782034   0.005373    1262

So, let’s interpret the main coefficients:

  • under “Fixed effects,” we have the “(Intercept)”, 6.78, which is the grand mean (γ00) and tells us that over all the time points and individuals, the average log income is 7.16
  • under “Random effects,” everything that varies by “pidp” represents between variation. In this case, the between variation for the intercept (ξ0i) is 1.12
  • under “Random effects,” the “Residual” coefficient represents within variation. In this case, the within variation for the intercept (ϵij) is 0.87

To make the between and within variation easier to understand, we can calculate the IntraClass Coefficient (ICC):

ρ=σ0/(σ0+σϵ)=1.12/(1.12+0.87)=0.56

Where the σ0 stands for the between variation and σϵ within variation The equation is just a ratio of the two and tells us the proportion of variation that is between individuals. In our case, it shows that around 56% of the variation in log income is caused by differences between people while the remaining (~ 44%) is within people. Substantively, this would indicate that the differences between people in income are more important than within ones (i.e., Bob can work as much as he wants, but it’s doubtful that he will become the next Jeff Bezos).

To better understand what the model does, we can predict the scores and do a graph showing the predicted individual score (lines) and the observed ones (points) for five individuals:

usl$pred_m0 <- predict(m0, newdata = usl, allow.new.levels = TRUE)

set.seed(1234)
random_people <- unique(usl$pidp) |> sample(3)

usl %>% 
  filter(pidp %in% random_people) %>% # select just five individuals
  ggplot(aes(wave, pred_m0, color = as.factor(pidp))) +
  geom_point(aes(wave, logincome)) + # points for observer logincome
  geom_smooth(              method = lm, se = FALSE, size = 0.7) + # linear line based on prediction#
    geom_smooth(data = usl, group = 1, method = lm, 
                se = FALSE, color = "black") + # linear line based on prediction
  theme_bw() + # nice theme
  labs(x = "Wave", y = "Logincome") + # nice labels
  theme(legend.position = "none") # hide legend

 

predicted scores based on a random effect model

A few things to notice here. We are predicting just one score for each individual, so at this point, we don’t have any change in time (lines are horizontal). The differences between the lines represent between variation (e.g., blue is higher than green) while the difference between lines and points of the same color represents within variation (observed individual scores are different to their prediction).

The unconditional change model

This model helps give us an idea of how much variation we have at each level, but we also want to look at changes in time! So, let’s expand the model:

Yij=γ00+γ10TIMEij+ξ0i+ϵij

In this model, we add time (“wave” in our case) as a predictor. Now γ00 will represent average income when time is 0 while γ10 represents the rate of change of income when time goes up by 1.

To make things easier to interpret, let’s change our time variable a little. We want it to start from 0, so the γ00 has a nice interpretation of expected log income at the start of the study. So here we use a new variable where we subtract 1 from wave:

# create new variable starting from 0
usl <- mutate(usl, wave0 = wave - 1)

# see how it looks like
count(usl, wave, wave0)
## # A tibble: 4 × 3
##    wave wave0     n
##   <dbl> <dbl> <int>
## 1     1     0 51007
## 2     2     1 51007
## 3     3     2 51007
## 4     4     3 51007

Now we can run our model by adding the new time variable as a fixed effect:

# unconditional change model (a.k.a. MLMC)
m1 <- lmer(data = usl, logincome ~ 1 + wave0 + (1 | pidp))
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: logincome ~ 1 + wave0 + (1 | pidp)
##    Data: usl
## 
## REML criterion at convergence: 486596.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.4071 -0.1733  0.1190  0.3684  4.1306 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  pidp     (Intercept) 1.1099   1.0535  
##  Residual             0.8565   0.9254  
## Number of obs: 152509, groups:  pidp, 50995
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 6.661570   0.005903 1128.58
## wave0       0.107170   0.002231   48.03
## 
## Correlation of Fixed Effects:
##       (Intr)
## wave0 -0.424

The main difference with the previous model is in the fixed effects part. Now, we interpret the intercept (6.66) as the expected log income at the beginning of the study (wave 1). The effect of “wave0”, 0.107, tells us the average rate of change with the passing of a wave. So, individual log income slowly increases each wave.

Again, let’s visualize the results based on the model:

usl$pred_m1 <- predict(m1, newdata = usl, allow.new.levels = T)

usl %>% 
  filter(pidp %in% random_people) %>% # select just five individuals
  ggplot(aes(wave, pred_m1, color = as.factor(pidp))) +
  geom_point(aes(wave, logincome)) + # points for observer logincome
  geom_smooth(method = lm, se = FALSE, size = 0.7) + # linear line based on prediction
  geom_smooth(data = usl, group = 1, 
              method = lm, se = FALSE, color = "black") + # linear line based on prediction
  theme_bw() + # nice theme
  labs(x = "Wave", y = "Logincome") + # nice labels
  theme(legend.position = "none") # hide legend
predicted change in time based on multilevel model for change with no between variation for time

So now we see a positive trend which is due to the coefficient for time. We also notice that the lines are parallel. This means that we assume the change in time is the same for all individuals. In more technical terms, we assume there is no between variation in the rate of change. This is typically quite a strong assumption. In our case, we wouldn’t expect that given the first graph we created. So, let’s expand the model to include the between variation in the rate of change:

Yij=γ00+γ10TIMEij+ξ0i+ξ1iTIMEij+ϵij

Now, we see that we have two sources of between variation. The coefficient ξ0i represents the between variation at the start of the study, while ξ1i represents the between variation in the rate of change. This means that we allow individuals to have different logincomes at wave 1 and different trends.

To run such a model in lme4 we can simply add “wave0” to the random part of the model:

# unconditional change model (a.k.a. MLMC) with re for change
m2 <- lmer(data = usl, logincome ~ 1 + wave0 + (1 + wave0 | pidp))
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: logincome ~ 1 + wave0 + (1 + wave0 | pidp)
##    Data: usl
## 
## REML criterion at convergence: 481633.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.6628 -0.1560  0.1112  0.3260  4.6631 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  pidp     (Intercept) 1.64416  1.2822        
##           wave0       0.08809  0.2968   -0.71
##  Residual             0.71638  0.8464        
## Number of obs: 152509, groups:  pidp, 50995
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 6.657833   0.006575 1012.62
## wave0       0.115675   0.002460   47.02
## 
## Correlation of Fixed Effects:
##       (Intr)
## wave0 -0.624
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00502323 (tol = 0.002, component 1)

You can ignore the warning regarding convergence for now.

In the results, we see that the random part of the model now has two coefficients that vary by “pidp”. The “(Intercept)”, 1.64, represents between variation at the starting point of the study (ξ0i) while, the coefficient for “wave0”, 0.088, represents between variation in the rates of change (ξ1i).

If we now investigate the predictions, we see that individuals are allowed to have both different starting points and different trends:

# let's look at prediction base on this model
usl$pred_m2 <- predict(m2, newdata = usl, allow.new.levels = T)


usl %>% 
  filter(pidp %in% random_people) %>% # select just five individuals
  ggplot(aes(wave, pred_m2, color = as.factor(pidp))) +
  geom_point(aes(wave, logincome)) + # points for observer logincome
  geom_smooth(method = lm, se = FALSE, size = 0.7) + # linear line based on prediction
  geom_smooth(data = usl, group = 1, 
              method = lm, se = FALSE, color = "black") + # linear line based on prediction
  theme_bw() + # nice theme
  labs(x = "Wave", y = "Logincome") + # nice labels
  theme(legend.position = "none") # hide legend
predicted trajectories of change in logincome based on multilevel model for change

The larger the ξ0i coefficient, the more prominent the difference between people at the start of the study, while larger ξ1i indicates more diverse rates of change.

Conclusions

Hopefully, this gives you an idea about what is the multilevel model for change, how to estimate it in R, and how to visualize this change. This model is similar to the Latent Growth Model and a made an intro to that here. You can learn how to expand the model to include time-constant predictors and time-varying predictors here and here.


Was the information useful?

Consider supporting the site by: