Attrition bias: what is it and how to measure it

Longitudinal studies are powerful because they allow us to observe change over time. But this ability also has a big drawback: if people drop out of the study, we lose information about their trajectories. This is called attrition, and it can bias our results. In this post, you will find out what attrition is, how to calculate attrition rates, and how to assess attrition bias.

What is attrition?

Attrition occurs when respondents who participated earlier in a study are no longer observed in later waves. This can happen for various reasons: they may move away, lose interest, or be unable to participate due to health issues. This process is in addition to the challenges of missing data that occur in cross-sectional studies.

For example, in a cross-sectional study, we have two main types of missing data. Item or question missing appears when a respondent participates in the study but does not answer a specific question. This can be due to various reasons, such as the question being sensitive, confusing, or simply skipped. On the other hand unit missing or non-response occurs when a respondent does not participate in the survey at all, which can be due to reasons such as inability to contact or refusal to participate.

We can visualise this as below, where “1” represents an answer and “0” represents missing data.

non-response patterns in a cross-sectional study

Our ideal is the first unit, or individual, that answers all the questions. Individuals 2 and 3 participate in the study but decide not to answer some questions. This is item missing. Individual 4 does not participate in the study at all, which is unit missing.

If we collect data from the same cases multiple times, as in a longitudinal study, we can have different patterns of missing:

attrition patterns in a longitudinal study. can be used to understand attrition bias

Again, our ideal is the first unit, or individual, that participates in all waves. Individuals 2 and 3 participate in the first study, but then they drop out. We see that they exhibit different patterns of participation, with individual 3 later rejoining the study. Individual 4 does not participate in the study at all.

In general, we define attrition as cases that participated at least once in the study but then dropped out at some point. Based on this definition, individuals 1 and 2 have attrited.

Why is attrition important?

When people drop out of a study, we lose information about their trajectories. This can lead to biased results if the people who drop out are systematically different from those who remain in the study. For example, if people with lower income are more likely to drop out of a study on income and well-being, then our results may overestimate the average income and well-being of the population. The extent to which this is a problem is estimated using attrition bias.

The second issue is that our samples shrink as cases drop out. This can lead to less precise estimates and less statistical power to detect effects. To understand this process, we can use attrition rates.

How to calculate attrition rate?

One way to understand attrition is to examine patterns of participation in the study. As an illustration, we will use data based on the first four waves of Understanding Society. We will first load the tidyverse package for data clearing and exploration, and haven to read the data from Stata files.

Understanding Society data stores each wave of data in a separate file. So, we need to load each wave and then merge them. Each file has a respondent identifier variable called “pidp”, which we will use for the merger.

library(tidyverse)
library(haven)

us1 <- read_dta("./data/a_indresp_syn.dta")
us2 <- read_dta("./data/b_indresp_syn.dta")
us3 <- read_dta("./data/c_indresp_syn.dta")
us4 <- read_dta("./data/d_indresp_syn.dta")

Before we bring the data together, we will create an indicator telling us whether a respondent participated in each wave. By default, when you merge data, R does not indicate if a case is present in one dataset or both (unlike Stata, for example). A way around this is to create an indicator in each wave that indicates whether the individual is present. After we merge the data, if an individual has missing information on these indicators, it means that they did not participate in that wave.

us1 <- mutate(us1, present_1 = T)
us2 <- mutate(us2, present_2 = T)
us3 <- mutate(us3, present_3 = T)
us4 <- mutate(us4, present_4 = T)

We then merge waves using the respondent identifier. Here we will use the left_join() command. This will keep all the cases from the first dataset we input, and append information from the second dataset. In this study, people can join later, for example, if a new member joins a family. To keep things simple, we will assume we care only about the original sample, so we will use the first wave as our main dataset and merge the others with it. If we wanted to include new members, we could use a different merge type, such as full_join(), which would keep all cases from both datasets.

We will merge the data in a stepwise way, first merging the first two waves, then merging the result with the third wave and finally merging that result with the fourth wave.

us12 <- left_join(us1, us2, by = "pidp")
us123 <- left_join(us12, us3, by = "pidp")
us1234 <- left_join(us123, us4, by = "pidp")

Now “us1234” has individuals who participated in the first wave and information from four waves. We can now explore the patterns of participation in the study. For example, we can look at how many people participated in the first two waves by using the code below, where count() creates a table, and then we also calculate the proportions using mutate():

count(us1234, present_1, present_2) |> 
  mutate(prop = n/sum(n))

## # A tibble: 2 × 4
##   present_1 present_2     n  prop
##   <lgl>     <lgl>     <int> <dbl>
## 1 TRUE      TRUE      38291 0.751
## 2 TRUE      NA        12716 0.249

Here we see that 38,291 people participated in both waves, while 12,716 dropped out after the first wave. We can now also calculate the attrition rate, which is simply the proportion of people who dropped out after the first wave. In this case, it is 12,716/(38,291+12,716) = 0.25, which means that 25% of the original sample dropped out after the first wave.

We can follow this procedure for multiple waves. Let’s imagine we want to analyse four data waves. We can use all four indicators:

count(us1234, present_1, present_2, present_3, present_4) |> 
  mutate(prop = n/sum(n))

## # A tibble: 8 × 6
##   present_1 present_2 present_3 present_4     n    prop
##   <lgl>     <lgl>     <lgl>     <lgl>     <int>   <dbl>
## 1 TRUE      TRUE      TRUE      TRUE      26748 0.524  
## 2 TRUE      TRUE      TRUE      NA         4146 0.0813 
## 3 TRUE      TRUE      NA        TRUE       1618 0.0317 
## 4 TRUE      TRUE      NA        NA         5779 0.113  
## 5 TRUE      NA        TRUE      TRUE       1581 0.0310 
## 6 TRUE      NA        TRUE      NA          622 0.0122 
## 7 TRUE      NA        NA        TRUE        332 0.00651
## 8 TRUE      NA        NA        NA        10181 0.200

We see now more complex participation patterns. After four waves, we have lost almost half of the sample. The most common patterns of participation are: participating in all four waves (52%), participating in the first wave and dropping out (20%), participating in the first two waves (11%) or dropping out after the third wave (8%).

When we analyse more than two waves of data, there are different ways to define attrition rates. Probably the most common one is to compare people who participated in all the waves you plan to analyse with those who dropped out at some point. This is also in line with the definition of balanced data in longitudinal data.

If we plan to analyse all four waves here, we can create an indicator to determine whether someone is present in all four waves. We can do this using the case_when() command, which allows us to create a new variable based on multiple conditions. In this case, we check if the person is present in all four waves. If they are, we give them a value of “False” (as they did not attrit), and if they are not, we give them a value of “NA” (as we do not know if they attrited or not).

us1234 <- mutate(
  us1234,
  attrition = case_when(
    present_1 == T &
      present_2 == T &
      present_3 == T & 
      present_4 == T ~ F,
    TRUE ~ T
  ))

Based on this definition, the attrition rate in the first four waves is 48%:

count(us1234, attrition) |> 
  mutate(prop = n/sum(n))

## # A tibble: 2 × 3
##   attrition     n  prop
##   <lgl>     <int> <dbl>
## 1 FALSE     26748 0.524
## 2 TRUE      24259 0.476

Attrition versus item missing

Attrition depends on whether an individual participates in a particular wave. But in addition to this, we still have the challenge of item missing, which is when a respondent participates in a wave but does not answer a specific question. It is useful to separate these two types of missing data because they may have different mechanisms and solutions.

As an illustration, let’s look at the patterns of missing data in income. We will take the original income variables and create a new version that indicates whether an individual has missing data in each wave. The code below uses the mutate_at() command, which allows us to apply a function to multiple variables at once. Here, we use vars(matches("grs")) to select all variables that have “grs” in their name, which are the income variables in our dataset. We then apply is. na () to check whether the value is missing. This creates new variables with the suffix “_miss” that indicate whether the corresponding original variable is missing.

us1234 <- mutate_at(us1234, vars(matches("grs")), list("miss" = ~ is.na(.)))

Now we can explore the patterns of missing income. For example, we can look at how many people have missing data on income in each wave:

us1234 |> 
  count(a_fimngrs_dv_miss, b_fimngrs_dv_miss, 
        c_fimngrs_dv_miss, d_fimngrs_dv_miss) |> 
  mutate(prop = n/sum(n)) |>
  arrange(desc(n))

## # A tibble: 16 × 6
##    a_fimngrs_dv_miss b_fimngrs_dv_miss c_fimngrs_dv_miss d_fimngrs_dv_miss     n
##    <lgl>             <lgl>             <lgl>             <lgl>             <int>
##  1 FALSE             FALSE             FALSE             FALSE             26635
##  2 FALSE             TRUE              TRUE              TRUE              10175
##  3 FALSE             FALSE             TRUE              TRUE               5771
##  4 FALSE             FALSE             FALSE             TRUE               4178
##  5 FALSE             FALSE             TRUE              FALSE              1629
##  6 FALSE             TRUE              FALSE             FALSE              1611
##  7 FALSE             TRUE              FALSE             TRUE                632
##  8 FALSE             TRUE              TRUE              FALSE               334
##  9 TRUE              FALSE             FALSE             FALSE                13
## 10 TRUE              TRUE              TRUE              TRUE                 12
## 11 TRUE              TRUE              FALSE             FALSE                 8
## 12 TRUE              FALSE             TRUE              TRUE                  5
## 13 TRUE              FALSE             FALSE             TRUE                  1
## 14 TRUE              FALSE             TRUE              FALSE                 1
## 15 TRUE              TRUE              FALSE             TRUE                  1
## 16 TRUE              TRUE              TRUE              FALSE                 1
## # ℹ 1 more variable: prop <dbl>

We see that 26,635 have income in all four waves, while the rest show different patterns of missing. While this is useful, it is actually combining attrition and item missing. If we want to better understand how important attrition or item missingness is in this process, we have two options.

The first approach is to look only at individuals who are present in all the waves and then calculate the number of items missing:

us1234 |> 
  filter(attrition == F) |>
  count(a_fimngrs_dv_miss, b_fimngrs_dv_miss, 
        c_fimngrs_dv_miss, d_fimngrs_dv_miss) |> 
  mutate(prop = n/sum(n)) |>
  arrange(desc(n))

## # A tibble: 7 × 6
##   a_fimngrs_dv_miss b_fimngrs_dv_miss c_fimngrs_dv_miss d_fimngrs_dv_miss     n
##   <lgl>             <lgl>             <lgl>             <lgl>             <int>
## 1 FALSE             FALSE             FALSE             FALSE             26635
## 2 FALSE             FALSE             FALSE             TRUE                 46
## 3 FALSE             TRUE              FALSE             FALSE                35
## 4 FALSE             FALSE             TRUE              FALSE                13
## 5 TRUE              FALSE             FALSE             FALSE                13
## 6 TRUE              TRUE              FALSE             FALSE                 5
## 7 FALSE             FALSE             TRUE              TRUE                  1
## # ℹ 1 more variable: prop <dbl>

Now it becomes clear that most of the missing income data are due to attrition. Among those who participated in all four waves, 99% have income data in all four.

The second approach is to look at the unit and the item missing at the same time. For example, for the first two waves, these are the patterns of missing data:

us1234 |> 
  count(present_1, a_fimngrs_dv_miss, present_2, b_fimngrs_dv_miss) |> 
    mutate(prop = n/sum(n)) |>
  arrange(desc(n))

## # A tibble: 6 × 6
##   present_1 a_fimngrs_dv_miss present_2 b_fimngrs_dv_miss     n     prop
##   <lgl>     <lgl>             <lgl>     <lgl>             <int>    <dbl>
## 1 TRUE      FALSE             TRUE      FALSE             38213 0.749   
## 2 TRUE      FALSE             NA        TRUE              12702 0.249   
## 3 TRUE      FALSE             TRUE      TRUE                 50 0.000980
## 4 TRUE      TRUE              TRUE      FALSE                20 0.000392
## 5 TRUE      TRUE              NA        TRUE                 14 0.000274
## 6 TRUE      TRUE              TRUE      TRUE                  8 0.000157

Again, it is clear that the main mechanism is people dropping out of the study.

What is attrition bias?

So far, we looked at the rate at which people drop out of the study. But the more important question is whether this attrition is random or systematic. If the people who drop out differ from those who remain in the study, we have attrition bias. This can lead to incorrect results and misleading conclusions.

We can calculate attrition bias by comparing the characteristics of people who dropped out with those who remained in the study. Typically, we use information from the first waves to do this, since we have information on all respondents at the beginning of the study. Alternatively, other data sources, such as administrative records or information aboutwhere people live, can be used to compare the characteristics of people who dropped out with those who remained in the study.

Here, we will explore attrition bias by comparing the characteristics of people who dropped out with those who remained in the study, using wave 1 data. For example, we can look at the attrition rate separately by sex:

us1234 |> 
  group_by(sex) |>
  summarise(attrition_rate = mean(attrition))

## # A tibble: 2 × 2
##   sex    attrition_rate
##   <fct>           <dbl>
## 1 Male            0.489
## 2 Female          0.465

In this case, men have a higher attrition rate (48.8%) than women (46.4%). This indicates that we have attrition bias on gender. The differences are not huge, but they are there.

As another example, we can look at the average age of people who dropped out compared to those who remained in the study:

us1234 |> 
  group_by(attrition) |>
  summarise(age = mean(age, na.rm = T))

## # A tibble: 2 × 2
##   attrition   age
##   <lgl>     <dbl>
## 1 FALSE      47.5
## 2 TRUE       43.4

Here we see that people who dropped out are, on average, younger (43.3 years) than those who remained in the study (47.6 years). This indicates that we have attrition bias on age as well.

Estimating attrition bias using regressions

A more systematic way to assess the extent of attrition bias is to run a regression model in which the outcome is the attrition indicator and the predictors are the characteristics of people at the beginning of the study. The advantage of this approach is that it accounts for the potential bias of other variables. In this way, we can better understand the main mechanisms related to attrition.

As an example, we can run a logistic regression model where the outcome is attrition in the first four waves and the predictors are sex, age, single, urban and degree:

m1 <- glm(attrition ~ sex + age + single + urban + degree,
          data = us1234, family = "binomial")

exp(cbind(OR = coef(m1), confint(m1)))

##                        OR     2.5 %    97.5 %
## (Intercept)     1.2653067 1.1928472 1.3421986
## sexFemale       0.8891965 0.8582038 0.9213022
## age             0.9883368 0.9873441 0.9893292
## singleYes       1.3336241 1.2852941 1.3837777
## urbanrural area 0.8771820 0.8393900 0.9166397
## degreeNo degree 1.3012695 1.2528007 1.3516534

We can interpret the results as odds ratios, where values above 1 indicate a higher chance of attrition and values below 1 indicate a lower chance of attrition. For example, females have approximately 11% lower odds of attrition than men, while each additional year of age is associated with a 1% lower odds of attrition. The results indicate that all variables affect attrition bias.

To make the results a little more intuitive, we can also look at the predicted probabilities. This will create an indicator of the attrition risk for each individual in the dataset, based on their characteristics. We can then use this indicator to explore how individual characteristics relate to the likelihood of attrition.

us1234$attrit_pred <- predict(m1, type = "response", newdata = us1234)

For example, we can look at the predicted probabilities of attrition by age using a graph:

ggplot(us1234, aes(age, attrit_pred)) +
  geom_point(alpha = 0.1)

attrition bias by age in a longitudinal study

The relationship between age and attrition is now clearer.

As another example, we could examine the predicted attrition probabilities by degree. This allows us to see how the chance of dropping out depends on education:

us1234 |> 
  group_by(degree) |>
  summarise(pred_attrition = mean(attrit_pred))

## # A tibble: 3 × 2
##   degree    pred_attrition
##   <chr>              <dbl>
## 1 Degree             0.433
## 2 No degree          0.495
## 3 <NA>              NA

We see that people with a degree have a predicted attrition rate of 43%, compared with 49% for those without a degree.

The predicted attrition probabilities can also be used to create weights that correct for attrition bias in our analyses. This is a more advanced topic that we will cover in a future post.

Conclusions on attrition and attrition bias

Attrition is a common problem in longitudinal studies, and it can lead to smaller samples and biased results if the people who drop out are systematically different from those who remain in the study. As a result, it is essential to calculate attrition rates and estimate the extent of attrition bias. This can be done by comparing the characteristics of people who dropped out with those who remained in the study, either through simple comparisons or by running regression models. By understanding attrition and its potential bias, we can take steps to mitigate its effects and improve the validity of our longitudinal research.

Was the information useful?

Consider supporting the site:

Subscribe to newsletter 📬

Buy a coffee ☕

Buy the book 📖

Take a course 🏫

Longitudinal Analysis

Longitudinal methods, made clear