Data structures for longitudinal analysis: wide versus long

Posted by:

|

On:

|

Longitudinal data is essential for understanding how individuals, organizations and societies change. But what exactly is longitudinal data? How do we create, store and reshape such data to answer our research questions?

What is longitudinal data?

Different academic fields have different definitions of longitudinal data. To investigate individual-level change and apply the methods often discussed in this blog, longitudinal data needs to meet one important criterion: data has to be collected multiple times from the same elements. Elements can be individuals, organizations, societies, sensors and so on. This definition is not met by designs such as repeated cross-sectional studies (where there are new individuals each wave) but is met by designs such as panel studies (where the same individuals are interviewed regularly). This definition covers a wide range of other data sources, such as administrative data and sensor data.

Types of variables

Before we discuss how to create and store longitudinal data, it is important to understand the different types of variables we can have in this context. When working with longitudinal data, we can consider variables as time-varying or time-constant. For example, if we collect a variable such as general satisfaction each wave, we can treat it as time-varying. As a result, for each individual at each wave, this value could change (although it is not required). Other variables that do not change in time, such as year of birth, are treated as time-constant variables. It is enough to collect this information once from each individual, as we do not expect it to change.

There are also some situations where we can choose if we treat a variable as time constant or time-varying. For example, we could recode birth year as age, which is a time-varying variable. Similarly, we could take each individual’s average of “general satisfaction” and make that variable time constant. In such situations, the decision to treat a variable as time constant or variable should be based on theoretical considerations, the data available and the modelling strategy used.

Creating longitudinal data

Typically, data is stored separately for each wave in larger studies. For example, after downloading Understanding Society, we have one dataset for each wave and each level/type (e.g., individual-level data, household-level data, relationship data). To do longitudinal data analysis, we need to bring these datasets together.

For example, imagine we have these two datasets:

data_w1
## # A tibble: 4 × 3
##   id         sf1 birthy
##   <chr>    <dbl>  <dbl>
## 1 68004087     4   1986
## 2 68008847     2   2001
## 3 68009527     3   1978
## 4 68009521     5   1988
data_w2
## # A tibble: 4 × 2
##   id         sf1
##   <chr>    <dbl>
## 1 68004087     3
## 2 68008847     3
## 3 68009527     3
## 4 68009522     4

To bring the data together, we need a unique ID for each individual (the “id” column in this case) and decide if variables will be treated as time-constant or time-varying. In this example, we can treat “sf1” (a measure of general health) as a time-varying (it is measured at multiple points in time and can change) while we can treat “birthy” as a time-constant. The latter is measured just once and does not change in time.

We can then use one of two strategies to merge the data. The first strategy is to merge the datasets by column. There are different commands for this, but the …_join() commands in the “tidyverse” package are very useful here. To bring all the data together, we can use full_join() by adding the datasets and the unique ID as inputs.

Before we attempt this type of data merging, we need to make sure that the time-varying variables have unique names. It is usually good practice to add a wave-specific suffix:

library(tidyverse)
data_w1b <- rename(data_w1, sf1_1 = sf1)
data_w1b
## # A tibble: 4 × 3
##   id       sf1_1 birthy
##   <chr>    <dbl>  <dbl>
## 1 68004087     4   1986
## 2 68008847     2   2001
## 3 68009527     3   1978
## 4 68009521     5   1988
data_w2b <- rename(data_w2, sf1_2 = sf1)
data_w2b
## # A tibble: 4 × 2
##   id       sf1_2
##   <chr>    <dbl>
## 1 68004087     3
## 2 68008847     3
## 3 68009527     3
## 4 68009522     4

Now we can join the data:

data_w12 <- full_join(data_w1b, data_w2b, by = "id")
data_w12
## # A tibble: 5 × 4
##   id       sf1_1 birthy sf1_2
##   <chr>    <dbl>  <dbl> <dbl>
## 1 68004087     4   1986     3
## 2 68008847     2   2001     3
## 3 68009527     3   1978     3
## 4 68009521     5   1988    NA
## 5 68009522    NA     NA     4

We now have five individuals in the data and four variables. Three of the individuals are present in both waves, while an individual is present in just one of the waves. This could be due to missing data (people dropping out) or new people joining the study.

There are different versions of the ..._join() commands that can facilitate selecting the cases we want. You can find out more here.

The data we created is known as unbalanced data. This means that we include in the data any case, even if it is present just once. If, on the other hand, we wanted to select only cases present in all the waves, also known as balanced data, we could use inner_join():

inner_join(data_w1b, data_w2b, by = "id")
## # A tibble: 3 × 4
##   id       sf1_1 birthy sf1_2
##   <chr>    <dbl>  <dbl> <dbl>
## 1 68004087     4   1986     3
## 2 68008847     2   2001     3
## 3 68009527     3   1978     3

The alternative way to combine data is to join them by row. Before that, it is good practice to create a “time” or “wave” variable so we know where the data comes from.

data_w1c <- mutate(data_w1, wave = 1)
data_w1c
## # A tibble: 4 × 4
##   id         sf1 birthy  wave
##   <chr>    <dbl>  <dbl> <dbl>
## 1 68004087     4   1986     1
## 2 68008847     2   2001     1
## 3 68009527     3   1978     1
## 4 68009521     5   1988     1
data_w2c <- mutate(data_w2, wave = 2)
data_w2c
## # A tibble: 4 × 3
##   id         sf1  wave
##   <chr>    <dbl> <dbl>
## 1 68004087     3     2
## 2 68008847     3     2
## 3 68009527     3     2
## 4 68009522     4     2

We can then use the bind_rows() command to do the merger:

data_l12 <- bind_rows(data_w1c, data_w2c)
data_l12
## # A tibble: 8 × 4
##   id         sf1 birthy  wave
##   <chr>    <dbl>  <dbl> <dbl>
## 1 68004087     4   1986     1
## 2 68008847     2   2001     1
## 3 68009527     3   1978     1
## 4 68009521     5   1988     1
## 5 68004087     3     NA     2
## 6 68008847     3     NA     2
## 7 68009527     3     NA     2
## 8 68009522     4     NA     2

Notice that because the time constant variable is present only in wave 1, it only has missing information in the second one (NA). We could copy the values from the first wave to all the others by first grouping the data by individual and then taking the maximum of the variable:

data_l12 %>% 
  group_by(id) %>% 
  mutate(birthy = max(birthy, na.rm = T))
## # A tibble: 8 × 4
## # Groups:   id [5]
##   id         sf1 birthy  wave
##   <chr>    <dbl>  <dbl> <dbl>
## 1 68004087     4   1986     1
## 2 68008847     2   2001     1
## 3 68009527     3   1978     1
## 4 68009521     5   1988     1
## 5 68004087     3   1986     2
## 6 68008847     3   2001     2
## 7 68009527     3   1978     2
## 8 68009522     4   -Inf     2

Notice that one individual has -Inf because they have no information available (they joined the study in wave 2).

Which strategy should we use when merging longitudinal data? My general recommendation is to use the first approach, where we merge by adding columns (also known as mutating joins). One reason is that it gives more control regarding which cases to include in the data. It also makes it easier to identify errors and understand the patterns of participation. Finally, in large datasets, we tend to have a mix of time-constant and time-varying variables. Mergers by rows can be complicated in such situations.

Data structures

The two different strategies for merging longitudinal data also highlight that such data can be stored in two different ways: wide and long formats. Merging the columns leads to a wide format. Here is another example with four waves:

## # A tibble: 3 × 6
##   id       sf1_1 sf1_2 sf1_3 sf1_4 birthy
##   <chr>    <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 68004087     4     3     3     4   1986
## 2 68008847     2     3     4     1   2001
## 3 68009527     3     3     3     3   1978

We see that this is similar to cross-sectional data because each row represents data for an individual (with a unique “id”). If we have information from multiple waves, it appears in different columns. For example, “sf1” is measured four times. The suffix after “_” represents the wave in which the data was collected. We can also have time-constant variables that do not have the suffix (like birth year: “birthy”). In this type of data, if we want to look at change in time, we need to look at how values change within each row. For example, the person with id “68004087” has an initial decrease on “sf1”, and then, by wave 4, it returns to the original value. On the other hand, the individual with id “68009527” shows no change at all. We call this the wide data format, as new waves of data will bring in more columns, making the data wider.

Alternatively, we can store the data in the long format. This looks like this:

## # A tibble: 12 × 4
##          id  wave   sf1 birthy
##       <dbl> <dbl> <dbl>  <dbl>
##  1 68004087     1     4   1986
##  2 68004087     2     3   1986
##  3 68004087     3     3   1986
##  4 68004087     4     4   1986
##  5 68008847     1     2   2001
##  6 68008847     2     3   2001
##  7 68008847     3     4   2001
##  8 68008847     4     1   2001
##  9 68009527     1     3   1978
## 10 68009527     2     3   1978
## 11 68009527     3     3   1978
## 12 68009527     4     3   1978

Now, each row is a combination of person and time. Because each person participated in the study four times, there were four different rows. Now we have a new variable, “wave”, which tells us when the data on that row was collected. This way of storing data is called the long format, as new waves of data make the data longer.

One format is not better than the other; they have different strengths and uses. The wide format is somewhat easier to work with and is the format used by models based on the Structural Equation Modeling framework, like the latent growth model and the cross-lagged model. It is also easier to use if you want to create correlation matrices and transition tables. The long format is used by some other models, like the fixed effects model and the multilevel model for change. It is also generally easier to clean and visualise the data in the long format. It is possible to restructure the data relatively easily from one format to another as we will discuss next.

Reshaping data wide to long

Imagine we just imported and merged our data. Often, cleaning longitudinal data is easier to do in the long format. There are multiple ways of doing this, but a flexible approach is the pivot_longer() command from “tidyverse”. Here, we are going to use the following options:

  • the data,
  • the columns that are time-varying (notice here that we select the time constant variables and put “!” to take the opposite, i.e., the other variables),
  • the separator that is between the name of the variable and the wave number,
  • the pattern in the name. Here, we say that the first part of the name is the variable name (defined as “.value”) and that the suffix should be a new variable called “wave”.

You can find out more about this command by running vignette("pivot") in the R console.

Here is the final command:

data_long <- pivot_longer(
  data_wide,
  cols = !c(id, birthy),
  names_sep = "_",
  names_to = c(".value", "wave")
  )
data_long
## # A tibble: 12 × 4
##    id       birthy wave    sf1
##    <chr>     <dbl> <chr> <dbl>
##  1 68004087   1986 1         4
##  2 68004087   1986 2         3
##  3 68004087   1986 3         3
##  4 68004087   1986 4         4
##  5 68008847   2001 1         2
##  6 68008847   2001 2         3
##  7 68008847   2001 3         4
##  8 68008847   2001 4         1
##  9 68009527   1978 1         3
## 10 68009527   1978 2         3
## 11 68009527   1978 3         3
## 12 68009527   1978 4         3

Reshaping data long to wide

Imagine that we have finished cleaning the data and now want to run a Latent Growth Model. For such a model, we need the data in wide format. The command we will use is pivot_wider() (the opposite of pivot_longer()). We use the following inputs:

  • name of long data
  • time-varying variables (values_from)
  • how to separate the suffix (names_sep)
  • where to take the suffix from (names_from)
pivot_wider(
  data_long,
  values_from = !c(id, birthy),
  names_sep = "_",
  names_from = wave
  ) %>% 
  select(-starts_with("wave"))
## # A tibble: 3 × 6
##   id       birthy sf1_1 sf1_2 sf1_3 sf1_4
##   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 68004087   1986     4     3     3     4
## 2 68008847   2001     2     3     4     1
## 3 68009527   1978     3     3     3     3

By default, this creates wave variables for each wave, which is not very useful in the wide format. The select() command removes them from the data.

Conclusions

Here, we covered some of the key decisions we need to consider when preparing longitudinal data for analysis. First, we must decide how to treat the variables, time constant or varying. Second, we must decide how to merge the data and what cases to keep. Here, I recommend merging the data by columns (using the …_join() commands) and creating wide data. You can decide to use balanced or unbalanced data depending on the research question. Third, I recommend restructuring the data in the long format to make it easier to clean. The final data version (long or wide) will depend on how you plan to use it. Visualization using ggplot and models such as fixed effects, multilevel models for change and survival models work with data in the long format while models such as cross-lagged or latent growth use the wide data.


Was the information useful?

Consider supporting the site by: