multilevel versus latent growth model trajectories

Comparing the multilevel model with the Latent Growth Model

In previous blog posts, I have introduced two of the most popular models for estimating change in time: the multilevel model for change (MLM) and the Latent Growth Model (LGM). In my experience teaching this topic, researchers tend to select one of these models depending on the framework they are more familiar with: regression/multilevel vs. Structural Equation Modeling. As a result, they know less about the other method and when to use it. Both of these models answer similar research questions: how does change happen? and how elements differ in their patterns of change?

There are a couple of reasons why you should try to understand both the multilevel model for change and the latent growth model. Firstly, while the two methods are similar, they have different strengths and weaknesses. As such, it might make sense to switch to the other approach in certain situations. Secondly, by default, these models make slightly different assumptions you must know. Thirdly, you should understand how both of these work to be able to engage with the academic literature.

In this post, I’m going to compare the two methods, discuss their differences in assumptions, and explain how to test them. Finally, I will discuss some of each method’s strengths and weaknesses.

Set-up

First, let’s set up things for the comparison. I’m going to use the “lavaan” to run the LGM. This package was developed to run Structural Equation Models and is well-suited to run LGM. We will use “lme4” to run the MLM. Here, I load the two packages (they are already installed):

# package for LGM
library(lavaan)

# package for MLM
library(lme4)

One crucial difference between the two models is that they use data that is structured differently. The LGM uses the wide data format, where each row represents an individual, and variables appear in multiple columns to represent the value at each wave. Here is how the wide data looks like:

# wide data
usw

## # A tibble: 8,752 x 5
##    pidp  logincome_1 logincome_2 logincome_3 logincome_4
##    <fct>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 1            7.07        7.20        7.21        7.16
##  2 2            7.15        7.10        7.15        6.86
##  3 3            6.83        6.93        7.57        7.27
##  4 4            7.49        7.63        7.52        7.46
##  5 5            8.35        5.57        7.78        7.73
##  6 6            7.80        7.82        7.89        6.16
##  7 7            6.98        7.00        7.01        7.04
##  8 8            7.03        7.26        7.23        7.35
##  9 9            6.39        5.66        6.54        6.64
## 10 10           8.28        8.36        8.07        7.92
## # ... with 8,742 more rows

The MLM uses the data in long format where each row is a combination of individual and wave. This has fewer variables but is longer as values for different waves appear in the same column:

# long data
usl

## # A tibble: 35,008 x 4
##    pidp   wave wave0 logincome
##    <fct> <dbl> <dbl>     <dbl>
##  1 1         1     0      7.07
##  2 1         2     1      7.20
##  3 1         3     2      7.21
##  4 1         4     3      7.16
##  5 2         1     0      7.15
##  6 2         2     1      7.10
##  7 2         3     2      7.15
##  8 2         4     3      6.86
##  9 3         1     0      6.83
## 10 3         2     1      6.93
## # ... with 34,998 more rows

The statistical models

I have introduced the two models in previous posts (the multilevel model for change and the Latent Growth Model) so do check out those if this is new to you. I want to present the statistical notation of the two models next to each other to see the similarities.

The notation for the multilevel model for change is:

Yij = γ00 + γ10TIMEij + ξ0i + ξ1iTIMEij + ϵij

The notation for the Latent Growth Model is:

yj = α0 + α1λj + ζ00 + ζ11λj + ϵj

A few things to note. Typically, the individual subscript, i, is missing from SEM notation. Also, time is an explicit variable in the MLM, but it is coded as λ in LGM, and we need to code it by hand. Otherwise, they are very similar.

Just a reminder of what all this Greek means:

  • Yij/yj is the variable of interest (logincome for us) that changes in time (j).
  • γ000 represents the average value at the start of the data collection.
  • γ101 is the average rate of change in time.
  • ξ0i00 is the between variation at the start of the data, basically summarizing how different the individual starting points are compared to the average starting points.
  • ξ1i/ζ11 is the between variation in the rate of change. They summarize how different the individual slopes of change are from the average change.
  • ϵij/ϵj is the within variation or how different are the observed scores of each individual compared to their expected value.

Comparing the results

Next, let’s compare two simple models where we explore the change in time of log income over four waves of the Understanding Society survey.

The first model we run is the LGM (which we cover in more depth in this post)

# latent growth model
model <- ' i =~ 1*logincome_1 + 1*logincome_2 + 1*logincome_3 +
                  1*logincome_4
            s =~ 0*logincome_1 + 1*logincome_2 + 2*logincome_3 + 
                  3*logincome_4'

lgm1 <- growth(model, data = usw)

summary(lgm1, standardized = TRUE)

## lavaan 0.6-9 ended normally after 51 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
##                                                       
##   Number of observations                          8752
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                83.795
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   i =~                                                                  
##     logincome_1       1.000                               0.828    0.854
##     logincome_2       1.000                               0.828    0.924
##     logincome_3       1.000                               0.828    0.949
##     logincome_4       1.000                               0.828    0.958
##   s =~                                                                  
##     logincome_1       0.000                               0.000    0.000
##     logincome_2       1.000                               0.148    0.165
##     logincome_3       2.000                               0.296    0.339
##     logincome_4       3.000                               0.444    0.513
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   i ~~                                                                  
##     s                -0.052    0.003  -16.641    0.000   -0.425   -0.425
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .logincome_1       0.000                               0.000    0.000
##    .logincome_2       0.000                               0.000    0.000
##    .logincome_3       0.000                               0.000    0.000
##    .logincome_4       0.000                               0.000    0.000
##     i                 7.048    0.010  715.699    0.000    8.512    8.512
##     s                 0.052    0.003   19.277    0.000    0.353    0.353
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .logincome_1       0.254    0.007   36.846    0.000    0.254    0.271
##    .logincome_2       0.200    0.004   47.413    0.000    0.200    0.249
##    .logincome_3       0.197    0.004   48.753    0.000    0.197    0.258
##    .logincome_4       0.177    0.006   31.356    0.000    0.177    0.237
##     i                 0.686    0.013   52.076    0.000    1.000    1.000
##     s                 0.022    0.001   16.826    0.000    1.000    1.000

Next, we run the MLM using the long data (which we cover in more depth in this post)

mlm1 <- lmer(data = usl, logincome ~ 1 + wave0 + 
             (1 + wave0 | pidp))
summary(mlm1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: logincome ~ 1 + wave0 + (1 + wave0 | pidp)
##    Data: usl
## 
## REML criterion at convergence: 69594.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.1893 -0.2258  0.0543  0.3052  4.9783 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  pidp     (Intercept) 0.7057   0.8401        
##           wave0       0.0238   0.1543   -0.46
##  Residual             0.2039   0.4515        
## Number of obs: 35008, groups:  pidp, 8752
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 7.044933   0.009846  715.52
## wave0       0.053728   0.002716   19.78
## 
## Correlation of Fixed Effects:
##       (Intr)
## wave0 -0.518

To make it easier to compare, we can extract the main coefficients in a table:

CoefficientMultilevelLatent growth
Fixed effect: intercept7.0457.048
Fixed effect: slope0.0540.052
Between variance: intercept0.7060.686
Between variance: slope0.0240.022
Table comparing coefficients from multilevel mode and latent growth model

First, we see that the estimates are very similar but not identical. The main reason for that is the residuals or within variation. The MLM has only one coefficient (0.204), while the LGM has four coefficients. And this is the big assumption the MLM has by default. It assumes residuals, or within variation, are the same at different points in time. The LGM, by default, does not assume that and estimates a coefficient for each wave.

This assumption can be important. These coefficients are substantively interesting as they tell us about the amount of within variation left unexplained. If residuals are not equal in time, this coefficient can be incorrect. Furthermore, other coefficients in the model can be biased as a result.

Restricting the LGM

While there is some debate about whether the assumption of equal residuals in time is reasonable, I believe the best way to deal with it is to investigate it empirically. This can be easily done in LGM. We can run the model again, but this time with the restriction that the residuals are equal in time. We can then compare this model with the prior one to decide on this assumption.

# latent growth model with restriction
model <- ' i =~ 1*logincome_1 + 1*logincome_2 + 1*logincome_3 +
                  1*logincome_4
            s =~ 0*logincome_1 + 1*logincome_2 + 2*logincome_3 + 
                  3*logincome_4

            logincome_1 ~~ a*logincome_1
            logincome_2 ~~ a*logincome_2
            logincome_3 ~~ a*logincome_3
            logincome_4 ~~ a*logincome_4'

lgm2 <- growth(model, data = usw)

summary(lgm2, standardized = TRUE)

## lavaan 0.6-9 ended normally after 50 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
##   Number of equality constraints                     3
##                                                       
##   Number of observations                          8752
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               195.275
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   i =~                                                                  
##     logincome_1       1.000                               0.840    0.881
##     logincome_2       1.000                               0.840    0.932
##     logincome_3       1.000                               0.840    0.961
##     logincome_4       1.000                               0.840    0.961
##   s =~                                                                  
##     logincome_1       0.000                               0.000    0.000
##     logincome_2       1.000                               0.154    0.171
##     logincome_3       2.000                               0.309    0.353
##     logincome_4       3.000                               0.463    0.530
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   i ~~                                                                  
##     s                -0.060    0.003  -20.757    0.000   -0.463   -0.463
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .logincome_1       0.000                               0.000    0.000
##    .logincome_2       0.000                               0.000    0.000
##    .logincome_3       0.000                               0.000    0.000
##    .logincome_4       0.000                               0.000    0.000
##     i                 7.045    0.010  715.556    0.000    8.387    8.387
##     s                 0.054    0.003   19.782    0.000    0.348    0.348
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .logincom_1 (a)    0.204    0.002   93.552    0.000    0.204    0.224
##    .logincom_2 (a)    0.204    0.002   93.552    0.000    0.204    0.251
##    .logincom_3 (a)    0.204    0.002   93.552    0.000    0.204    0.267
##    .logincom_4 (a)    0.204    0.002   93.552    0.000    0.204    0.267
##     i                 0.706    0.013   54.639    0.000    1.000    1.000
##     s                 0.024    0.001   22.261    0.000    1.000    1.000

Now, we see that the residual is the same at each time. If we make a table with all the coefficients, we now see that they are identical for LGM and MLM:

CoefficientMultilevelLatent growthLatent growth restricted
Fixed effect: intercept7.0457.0487.045
Fixed effect: slope0.0540.0520.054
Between variance: intercept0.7060.6860.706
Between variance: slope0.0240.0220.024
Within variation0.2040.2540.204
Comparing the multilevel model with the latent growth model with restrictions

We can compare the model with restrictions and the original one to see which fits the data best. The anova() command is an easy way to do this:

anova(lgm1, lgm2)

## Chi-Squared Difference Test
## 
##      Df   AIC   BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
## lgm1  5 69483 69547  83.795                                  
## lgm2  8 69589 69631 195.275     111.48       3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the fit indices indicate that the original model, which had different residuals at each point, is a better fit to the data. This would mean that the assumption of equality within variation in time might not be appropriate for this particular data.

When to use each model

In addition to this assumption regarding the within variation, which can be freed and tested, there are a couple of other things to consider when choosing between the multilevel model for change and latent growth models.

The MLM is handy when analyzing data with continuous time or when data is collected at different points in time for each individual. Because it uses long data, it can easily deal with these situations, which can be problematic for the LGM. Additionally, if you have many time points, it might be easier to write up the model.

Conversely, the LGM is useful if you want to use some of the other tools available in the SEM framework. For example, you can easily do multi-group analysis, comparing trends for different groups. You can also combine the LGM with the mixture or latent class to run the Mixture Latent Growth Model. This makes it possible to find clusters of people based on their time changes. Additionally, you can include the LGM in path models, making it possible to examine the relationship between the rate of change and other variables of interest. You can also correct for measurement errors by using second-order latent growth models and investigate invariance in time. Finally, SEM can effectively deal with missing data by using Full Information Maximum Likelihood.

Conclusions

Hopefully, that gave you an idea of the strengths and weaknesses of the multilevel model for change and the latent growth models. Both can be valuable tools for understanding individual-level change in time. There might be situations where one is a better fit than the other, though.


Was the information useful?

Consider supporting the site by: