Introduction to Path and Mediation Analysis with R

Mediation and path analysis are two of the most popular methods in the social sciences because they allow researchers to move beyond simple regression and study complex networks of relationships. Instead of looking only at one outcome at a time, these methods make it possible to examine how multiple variables influence each other. For example, we might want to understand how parental education shapes income, not only through a direct effect but also indirectly through educational attainment and self-esteem. In psychology, we might study whether stress affects health outcomes directly or indirectly by reducing sleep quality. By separating direct, indirect, and total effects, path and mediation analysis provide a clearer picture of the mechanisms that connect variables—making them powerful tools for answering “how” and “why” questions in research.

In this post, we will introduce path analysis as a method for empirically testing mediation relationships.

Access the code used here.

Access the data here.

Investigating mediation using path analysis

Path analysis can be considered a special case of Structural Equation Modeling (SEM) in which we include only observed variables. Both path analysis and SEM are typically represented using graphs to facilitate understanding of the model of interest. An example of a path diagram is shown below.

Basic path analysis diagram showing a mediation model

Mediation analysis builds on path analysis by focusing on how and through what mechanisms one variable influences another. In a path diagram, variables are shown as squares (observed variables), while single-headed arrows represent causal paths or regression slopes. For example, if we want to know whether parental education influences income through education, we would draw an arrow from parental education to education, and another from education to income. The indirect path formed by these arrows shows the mediating process. By following the arrows in the diagram, we can separate direct effects (“c”, the arrow from X to Y) from indirect effects (“a” and “b” in the figure above).

Sometimes a predictor no longer has a direct effect on the outcome once the mediator is included. This is called complete mediation. For instance, if parental education affects income only through education, the direct path from parental education to income would be close to zero once education is controlled for. In contrast, if both the direct and indirect effects remain strong, we have partial mediation. Complete mediation is important because it suggests that the mediator fully explains the mechanism linking the two variables. Identifying such cases helps researchers build stronger theoretical models and design interventions that target the key mediating variable.

Basic path analysis diagram showing a complete mediation

Real-world data often involves more than one mediator or multiple outcomes. Imagine a study in public health where researchers aim to understand the impact of childhood socioeconomic status (SES) on adult health outcomes. SES may have a direct effect on health, but it may also operate indirectly through multiple mediators, such as educational attainment, job opportunities, lifestyle factors (including diet, exercise, and smoking), and psychological stress. In this case, the path diagram would contain several squares for each variable, with arrows tracing both direct and indirect routes from childhood SES to adult health. By estimating all these paths simultaneously, path and mediation analysis can reveal not just whether SES matters, but how exactly it translates into long-term health inequalities.

Let’s examine an example model that can be used to investigate the impact of parents’ education on an individual’s income. The path analysis can be represented in this graph:

In this model, we have five variables: parental education, education, IQ score, self-esteem, and respondent income. In such graphs, single-headed arrows represent regression slopes. For example, the arrow from parental education to income (labelled “f”) represents a regression slope. Squares with arrows coming toward them are outcomes in regression equations, while those from which arrows leave are the predictors. In this model, we have three regressions:

  1. Education (“edu”) is explained by parental education (arrow “c”) and IQ (arrow “a”)
  2. Self-esteem is explained by education (arrow “b”)
  3. Income is explained by parental education (arrow “f”), education (arrow “d”), and self-esteem (arrow “e”)

To locate these indirect effects, we can follow the paths in the graph. For example, parental education has an indirect effect on self-esteem through education. We can calculate this effect in a path diagram by multiplying the coefficients from the path, in this case, by multiplying the “c” coefficient with “b”.

If we look at the effect of parental education on income, we see that it has both a direct effect (arrow “f”) and an indirect effect. More precisely, there are two possible indirect effects: going only through education (arrows “c” and “d”) or going through education and self-esteem (arrows “c”, “b”, and “e”). We can calculate the size of the two indirect effects by multiplying the coefficients on the path. For the first path, this would be equal to cd, while for the second one, it would be cbe.

As we can see, this framework divides the relationships into direct and indirect effects. Using path analysis we can also sum these effects up to calculate the total effect of a variable on another. To do this, we add the direct and the indirect effects. For example, for the relationship between parental education and income, we have:

 total_{parent_edu \rightarrow income} = direct + indirect = f + c \times d + c \times b \times e

If arrows are absent in path/SEM diagrams, we expect the relationship between those two variables to be 0. For example, the fact that we do not include an arrow from parental education to IQ implies that we fix this relationship to be 0 (so, no relationship). Such decisions regarding which regression coefficients to include and which to fix to 0 can be based on theory and prior empirical results.

Using R and lavaan to estimate direct and indirect effects

To make these ideas more concrete, let’s look at a real example using the Understanding Society dataset. Here, we aim to determine whether education level (“degree”) directly influences mental health (SF-12 MCS score) or if part of this effect is mediated through income. This is a classic mediation problem: education may improve health partly because it increases income, which in turn shapes mental well-being.

Before running the model, we need to perform some data cleaning. The lavaan package does not always work well with factors. To make sure the coding is correct, we will make a new version of the “degree” variable that will be coded as a dummy (1 or 0) to make it easier to interpret:

usw <- mutate(usw,
              degree = ifelse(degree.fct == "Degree", 1, 0))

count(usw, degree.fct, degree)
## # A tibble: 3 × 3
##   degree.fct degree     n
##   <fct>       <dbl> <int>
## 1 Degree          1 16547
## 2 No degree       0 34349
## 3 <NA>           NA   111

Now we can run our path model. We will first write down the equations, then use the sem() command to estimate the model, and finally, look at the results.

library(lavaan)

model <- 'sf12mcs_1  ~ degree
         logincome_1 ~ degree
         sf12mcs_1  ~ logincome_1'

fit <- sem(model, data = usw)

summary(fit, standardized = T)
## lavaan 0.6-19 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                         47350       51007
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   sf12mcs_1 ~                                                           
##     degree            1.003    0.101    9.891    0.000    1.003    0.047
##   logincome_1 ~                                                         
##     degree            0.737    0.015   50.647    0.000    0.737    0.227
##   sf12mcs_1 ~                                                           
##     logincome_1       0.162    0.031    5.211    0.000    0.162    0.025
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .sf12mcs_1       102.222    0.664  153.867    0.000  102.222    0.997
##    .logincome_1       2.224    0.014  153.867    0.000    2.224    0.949

The results show an interesting pattern. Education has a direct association with mental health, with the coefficient of 1.003 indicating that people with higher education report slightly higher mental health scores. At the same time, education is strongly and positively related to income, with a coefficient of 0.737 for logged income. Income, in turn, shows a positive effect on mental health, with a coefficient of 0.162. Taken together, these findings suggest that education shapes mental health both directly and indirectly through its effect on income.

The output is informative, but it can be challenging to understand at times. The semPlot package provides a neat way to visualize the mediation model as a path diagram:

library(semPlot)


semPaths(
  fit,
  what = "path",
  whatLabels = "est",
  residuals = FALSE,
  rotation = 2,
)
R lavaan mediation analysis output showing path coefficients from education to income to mental health.

This diagram shows variables as boxes, arrows as paths, and estimated coefficients on each line. It provides a quick overview of the strength of each effect and which pathways are statistically significant.

One of the biggest advantages of path and mediation analysis is that it allows us to decompose effects into direct, indirect, and total components. Instead of manually multiplying coefficients, we can tell lavaan to calculate these quantities for us. This has the advantage of also calculating standard errors and p-values.

For this, we need to label the main paths in the model and explicitly define the indirect and total effects. In this specification, the paths are labeled “a”, “b”, and “c”, which makes it easy to refer to them when calculating the effects. The indirect effect of education on mental health through income is expressed as the product of “a” and “b”, while the total effect is defined as the sum of the direct effect “c” and the indirect effect.

model <- 'sf12mcs_1  ~ c*degree
         logincome_1 ~ a*degree
         sf12mcs_1  ~ b*logincome_1

        # indirect effect (a*b)
          ab := a*b
          
        # total effect
          total := c + (a*b)'

fit <- sem(model, data = usw)

summary(fit, standardized = T)
## lavaan 0.6-19 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                         47350       51007
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   sf12mcs_1 ~                                                           
##     degree     (c)    1.003    0.101    9.891    0.000    1.003    0.047
##   logincome_1 ~                                                         
##     degree     (a)    0.737    0.015   50.647    0.000    0.737    0.227
##   sf12mcs_1 ~                                                           
##     logincom_1 (b)    0.162    0.031    5.211    0.000    0.162    0.025
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .sf12mcs_1       102.222    0.664  153.867    0.000  102.222    0.997
##    .logincome_1       2.224    0.014  153.867    0.000    2.224    0.949
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     ab                0.120    0.023    5.184    0.000    0.120    0.006
##     total             1.122    0.099   11.365    0.000    1.122    0.052

The results are the same as before, but we now also obtain the indirect and total effects. The indirect effect, calculated as the product of the path from education to income and the path from income to mental health, is 0.120. When the direct and indirect effects are combined, the total effect amounts to 1.122, confirming that the overall influence of education on mental health in this model is positive and that most of the relationship comes from the direct effect.

This illustrates the strength of mediation analysis: rather than stopping at a single regression, we can break down how much of the effect flows directly versus indirectly, which helps answer questions about mechanisms and pathways.

Expanding path models for mediation

Path analysis can be extended in many different directions. We can add control variables, include multiple mediators, model sequential processes, or move toward full structural equation modeling with latent variables. Each of these extensions helps us answer more realistic questions about social processes.

Including control variables

In real-world research based on observational data, it’s rarely enough to only consider the main variables of interest. Other factors may also influence both the mediator and the outcome. If we don’t control for these, our mediation results might be biased. Extending the model to include these controls makes the results more credible.

As an example of how we can extend the model, here is the updated version with controls for age and gender. The focus remains on the direct and indirect effects, but now we are accounting for these two confounders.

model <- 'sf12mcs_1  ~ c*degree + age + gndr.fct
         logincome_1 ~ a*degree + age + gndr.fct
         sf12mcs_1  ~ b*logincome_1 + age + gndr.fct

        # indirect effect (a*b)
          ab := a*b
          
        # total effect
          total := c + (a*b)'

fit2 <- sem(model, data = usw)

summary(fit2, standardized = T)
## lavaan 0.6-19 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
## 
##                                                   Used       Total
##   Number of observations                         47350       51007
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   sf12mcs_1 ~                                                           
##     degree     (c)    1.200    0.102   11.796    0.000    1.200    0.056
##     age               0.033    0.003   12.499    0.000    0.033    0.058
##     gndr.fct         -1.684    0.094  -17.972    0.000   -1.684   -0.083
##   logincome_1 ~                                                         
##     degree     (a)    0.787    0.014   55.124    0.000    0.787    0.242
##     age               0.015    0.000   41.489    0.000    0.015    0.182
##     gndr.fct         -0.315    0.013  -23.353    0.000   -0.315   -0.102
##   sf12mcs_1 ~                                                           
##     logincom_1 (b)    0.029    0.032    0.905    0.365    0.029    0.004
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .sf12mcs_1       101.208    0.658  153.867    0.000  101.208    0.987
##    .logincome_1       2.121    0.014  153.867    0.000    2.121    0.905
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     ab                0.023    0.025    0.905    0.365    0.023    0.001
##     total             1.222    0.099   12.398    0.000    1.222    0.057

Adding age and gender as control variables changes the picture in several ways. The direct effect of education on mental health becomes slightly stronger, from 1.003 to 1.200. The effect of education on income also increases slightly, from 0.737 to 0.787. By contrast, the link between income and mental health, which was previously significant, is now very weak and not statistically different from zero. As a result, the indirect effect of education on income is no longer significant, suggesting that what appeared to be mediation before was largely explained by age and gender. The total effect of education on mental health remains close to its earlier value, at 1.222 compared with 1.122, confirming that education continues to predict mental health overall. The controls themselves seem to be important, with older individuals reporting slightly higher mental health and income, while women report both lower mental health and lower income.

In short, without controls, we saw evidence of mediation through income. With controls, the mediation path disappears, indicating that part of what appeared to be mediation was actually due to age and gender differences. This illustrates why including controls is crucial in applied path analysis; otherwise, we risk attributing indirect effects to the wrong mechanisms.

More complex mediation pathways in path analysis

So far, we’ve looked at a basic mediation model with a single mediator. But real social processes are rarely that simple. Often, multiple pathways connect predictors and outcomes. For example, education may affect health not only through income but also through factors such as life satisfaction, which can be influenced by income. By allowing for multiple mediators, path analysis can capture a more realistic web of relationships.

In the model below, we extend the analysis by adding life satisfaction (sati_1) as a second mediator:

Multiple mediation pathways in R path analysis: education affects income, satisfaction, and mental health through several indirect effects

In this extended path analysis, education influences both income and life satisfaction directly, with paths from degree to income and from degree to satisfaction. Income, in turn, is specified as a predictor of life satisfaction, which creates a sequential pathway from education to income to satisfaction. Both income and satisfaction are then allowed to affect mental health, capturing two additional routes through which education can have an impact. To make these mechanisms explicit, three indirect effects are defined: the pathway from education through income to mental health, the longer chain from education through income, satisfaction, and then to mental health, and the route from education directly to satisfaction and then to mental health. The total effect is then calculated as the sum of the direct path and all of these indirect components, giving a complete picture of how education influences mental health across multiple pathways.

Here is the code for running such a model:

model <- '
    sf12mcs_1 ~ c*degree
    logincome_1 ~ a*degree
    sati_1 ~ d*degree + e*logincome_1
    sf12mcs_1 ~ b*logincome_1 + f*sati_1
    
  # indirect effects
    ab := a*b
    aef := a*e*f
    df := d*f
    
  # total effect
    total := c + ab + aef + df
'

fit3 <- sem(model, data = usw)

summary(fit3)

Running this extended model shows how the effect of education is distributed across several pathways. Education continues to have a direct association with mental health, while income provides one route of mediation on its own. A second, more complex chain emerges when income influences satisfaction, which in turn contributes to mental health, capturing the idea that education indirectly shapes well-being through both economic and psychological channels (although this is not significant here). There is also a pathway in which education directly affects satisfaction, which in turn influences mental health. By summing these together, the total effect of education reflects both its direct role and its various indirect contributions. Rather than focusing only on a single route, this approach reveals how education exerts its influence through multiple overlapping processes, offering a more nuanced understanding of the mechanisms that sustain social inequalities.

The semPlot package can again help make sense of this more complex diagram:

semPaths(
  fit3,
  what = "path",
  whatLabels = "est",
  residuals = FALSE,
  rotation = 2,
  edge.label.cex = 1.2,
  nCharNodes = 10,
  nCharEdges = 5,
  label.cex = 1.2,
  sizeMan = 8
)
Comprehensive path and mediation analysis diagram in R illustrating direct and indirect effects of education on mental health through income and satisfaction

The plot clearly shows multiple mediators and pathways, with the estimated coefficients displayed on each arrow. Visualizing these models is especially useful once they become more complex, as it helps to see at a glance how the variables connect.

Conclusions on path models and mediation

Mediation and path analysis allow us to move beyond simple regression and uncover the mechanisms that link predictors and outcomes. In this post, we examined how education can impact mental health both directly and indirectly, and how these pathways are altered when control variables are introduced or multiple mediators are considered. The ability to separate effects into direct, indirect, and total components makes path analysis a powerful tool for testing theories in the social sciences.

At the same time, the examples here were based on cross-sectional data, which only capture relationships at a single point in time. Many of the most interesting questions are longitudinal in nature. We may want to know how changes in education, income, or life satisfaction affect mental health across the life course, whether early-life advantages accumulate to shape outcomes years later, or whether mediation effects are stable across different stages of life.

Path analysis naturally extends to longitudinal models such as autoregressive models, cross-lagged panel models, and latent growth models. These approaches allow us to investigate not only whether mediation occurs, but also when and how it unfolds over time. In future posts, we will explore these longitudinal extensions and demonstrate how the same principles of direct and indirect effects can be applied to repeated measures data, thereby opening the way to richer insights into how social processes develop.


Was the information useful?

Consider supporting the site: