Complete guide to visualizing longitudinal data in R

Longitudinal data can be complex as it includes multiple cases with observations at different points in time. This complexity grows with missing data patterns, nested structures like individuals within households, and various variable types (like time-constant versus time-varying. Visualizing this complicated data can help us understand it better. This post will cover the main types of graphs used to explore longitudinal data using R. We will apply these techniques to the synthetic data from our previous post that replicates a large social science panel study. Visualization complements data exploration using tables and summary statistics we covered in a previous post.

Preparing for visualization

You can follow this guide by running the code in R on your own computer. Our examples will use synthetic (simulated) data modelled after Understanding Society, a comprehensive panel study from the UK. The real data can be accessed for free from the UK Data Archive.

Access the code used here.
Access the data here.

In a previous blog, we cleaned the data and stored it in both the long and the wide formats. We will use that cleaned data here. If you want to follow along, you can download the data from here and all the code from here.

Before we start, make sure you have the tidyverse and haven packages installed and loaded by running the following code:

install.packages("tidyverse")

library(tidyverse)
library(haven)

Although the haven package is not required for importing data; it simplifies working with and recoding data originally imported using it. We will use the ggplot2 package from the tidyverse for visualization.

Next, we will load the data prepared in the previous post. This contains both the long and the wide format data in an “RData” file. We will load the long-format data and have a quick look at it:

load("./data/us_clean_syn.RData")

glimpse(usl)

## Rows: 204,028
## Columns: 32
## $ pidp        <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11…
## $ wave        <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4…
## $ age         <dbl+lbl> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 42…
## $ hiqual      <dbl+lbl> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, …
## $ single      <dbl+lbl>  1,  1,  1,  0,  0,  0,  0,  0,  1, NA, NA, NA,  0, NA…
## $ fimngrs     <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ sclfsato    <dbl+lbl> -9,  6,  2,  1,  6,  6, -8,  3,  2, NA, NA, NA, -9, NA…
## $ sf12pcs     <dbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, NA, 46.16, 33.18…
## $ sf12mcs     <dbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, NA, 37.02, 46.80…
## $ istrtdaty   <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, …
## $ sf1         <dbl+lbl>  2,  2,  1,  3,  4,  3,  3,  3,  5, NA, NA, NA,  2, NA…
## $ present     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, …
## $ gndr.fct    <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male…
## $ single.fct  <fct> Single, Single, Single, In relationship, In relationship, …
## $ urban.fct   <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, Ur…
## $ degree.fct  <fct> Degree, Degree, Degree, Degree, No degree, No degree, No d…
## $ vote6.fct   <fct> Not very, Not at all, Not at all, Not at all, Not at all, …
## $ sati        <dbl> NA, 6, 2, 1, 6, 6, NA, 3, 2, NA, NA, NA, NA, NA, 5, NA, NA…
## $ sati.fct    <fct> NA, Mostly satisfied, Mostly dissatisfied, Completely diss…
## $ fimngrs.cap <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ logincome   <dbl> 8.099818, 8.297170, 8.196070, 6.757514, 6.809039, 6.577861…
## $ year        <dbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, NA, …
## $ sati.ind    <dbl> 3.000000, 3.000000, 3.000000, 3.000000, 5.000000, 5.000000…
## $ sati.dev    <dbl> NA, 3.000000, -1.000000, -2.000000, 1.000000, 1.000000, NA…
## $ sf12pcs.ind <dbl> 57.65250, 57.65250, 57.65250, 57.65250, 48.82667, 48.82667…
## $ sf12pcs.dev <dbl> -3.1425000, 5.3275000, -0.6825000, -1.5025000, 5.1033333, …
## $ sf12mcs.ind <dbl> 52.75250, 52.75250, 52.75250, 52.75250, 42.96333, 42.96333…
## $ sf12mcs.dev <dbl> 2.977500, -16.532500, 7.267500, 6.287500, 3.516667, 2.4266…
## $ waves       <int> 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4…
## $ sati.lag    <dbl> NA, NA, 6, 2, NA, 6, 6, NA, NA, 2, NA, NA, NA, NA, NA, 5, …
## $ sf12pcs.lag <dbl> NA, 54.51, 62.98, 56.97, NA, 53.93, 46.39, NA, NA, 33.18, …
## $ sf12mcs.lag <dbl> NA, 55.73, 36.22, 60.02, NA, 46.48, 45.39, NA, NA, 46.80, …

Simple univariate visualizations

We will start with some simple visualizations to get to know ggplot2. By diving into the data, let’s uncover how voting trends have shifted over the years. For example, we could look at the voting variable using a barplot.

Ggplot2 is a powerful package for creating visualizations in R. It is based on the grammar of graphics and allows for a high degree of customization. It has two main components: the data with the main dimensions, which are fed in the ggplot() command and the geometries, which are added using the geom_ commands. The dimensions of the graph are set using the aes() command. The different parts of the graphs are layered on top of each other using +.

Having set the stage, let’s create a barplot of the voting variable to demonstrate these concepts in practice. Here, the variable of interest is “vote6.fct”, and the geometry is a barplot (geom_bar()):

ggplot(usl, aes(vote6.fct)) +
  geom_bar()

Bar chart depicting the distribution of voting behavior categories within the dataset. Categories include 'Very', 'Fairly', 'Not very', 'Not at all', and data not available ('NA').

We see that ggplot2 automatically calculates the number of cases in each category and then adjusts the bar height based on this information. At times, raw data may be inaccessible, or we might prefer crafting our own summary statistics for a tailored visualization. The package allows us to do that as well. Let’s imagine that instead of the number of cases, we want to see the proportion of cases in each category. We can do this by making a table with the absolute frequencies and then creating a new variable with the proportions:

tab1 <- count(usl, vote6.fct) |>
  mutate(prop = n / sum(n))

tab1

## # A tibble: 5 × 3
##   vote6.fct      n   prop
##   <fct>      <int>  <dbl>
## 1 Very       14638 0.0717
## 2 Fairly     49363 0.242 
## 3 Not very   41082 0.201 
## 4 Not at all 38036 0.186 
## 5 <NA>       60909 0.299

Note that we are using the |> (pipe) operator here. We introduced this in a previous post.

Now, we can redo the barplot using the new variables. For the dimensions, we use the new variables, and for the geometries, we use geom_bar(stat = "identity"). This is because we are inputting the values directly and not letting ggplot2 calculate them:

ggplot(tab1, aes(vote6.fct, prop)) +
  geom_bar(stat = "identity")

Bar chart showing the proportions of voting behavior categories. Each bar represents a category such as 'Very', 'Fairly', 'Not very', 'Not at all', with the height indicating the proportion of respondents in each category.

Let’s next look at a continuous variable. We can start by looking at the distribution of the monthly income variable. We can use a histogram (geom_histogram()) for this:

ggplot(usl, aes(x = fimngrs.cap)) +
  geom_histogram()

Histogram of the monthly income variable, displaying a skewed distribution with a majority of data points concentrated on the lower end of the income scale.

We see that income is skewed, with most people on the left of the distribution and relatively few on the right side (with larger incomes).

Uni-variate visualizations over time

Continuous variables

So far, we looked at the distribution of variables at one point in time. We can also look at how the distribution of variables changes over time. For example, we can investigate how the distribution of mental health (sf12mcs) changes over time. One way to do that is by using a boxplot that shows the median, the interquartile range, and the outliers. We can use the geom_boxplot() command for this. As we want to look at the distribution over time, we can use the wave variable as the x-axis (first input in aes()) and the sf12mcs as the y-axis (second input in aes()):

ggplot(usl, aes(as.factor(wave), sf12mcs)) +
  geom_boxplot()

Boxplot illustrating the distribution of mental health scores across different waves of the study, with the 'wave' variable treated as categorical.

Note that here we want to treat “wave” as a categorical variable (using as.factor()) to have distinct distributions at each wave.

We see that the median and distribution are pretty stable over time. We can also use a violin plot to see the distribution of the variable over time. This is similar to a boxplot but also shows the distribution’s density. We can use the geom_violin() command for this:

ggplot(usl, aes(as.factor(wave), sf12mcs)) +
  geom_violin()

Violin plot depicting the distribution density of mental health scores over time, shown for different waves of data collection

Your visualization goals determine whether to use a boxplot or a violin plot. If you aim to highlight the median and distribution spread, use a boxplot. Conversely, use the violin plot for a detailed view of the distribution’s density.

Categorical variables

We can also look at how the distribution of categorical variables changes over time. For example, to look at how voting behaviour changes over time, we can use the same geom_bar() command. We can add the fill argument to colour the bars by the categories of the voting variable. Similar to before, we add the wave variable to see the change in time.

usl |>
  ggplot(aes(wave, fill = vote6.fct)) +
  geom_bar()

Chart showing the stability of proportions within voting behavior categories over time

We observe that the big trend is the increasing number of missing data due to the drop-off in the study. We can also look at the proportion of each category over time while ignoring the missing cases. We first filter all the missing cases out and then use the position == "fill" option to calculate the proportions within each wave. This results in:

usl |>
  filter(!is.na(vote6.fct)) |>
  ggplot(aes(wave, fill = vote6.fct)) +
  geom_bar(position = "fill")

Once we remove the missing cases, we will see that the proportions for each category are pretty stable over time. This aggregate stability could hide individual-level changes. We explore this in a separate post using a Sankey plot/river/alluvial graph.

We can further explore the aggregate distribution of voting over time by creating our own statistics. We use again the count() command to calculate the number of cases in each category and then calculate the proportion of cases in each category:

tab4 <- usl |>
  filter(!is.na(vote6.fct)) |>
  group_by(wave) |>
  count(vote6.fct) |>
  mutate(prop = n / sum(n))

tab4

## # A tibble: 16 × 4
## # Groups:   wave [4]
##     wave vote6.fct      n   prop
##    <dbl> <fct>      <int>  <dbl>
##  1     1 Very        4948 0.104 
##  2     1 Fairly     16107 0.339 
##  3     1 Not very   13968 0.294 
##  4     1 Not at all 12520 0.263 
##  5     2 Very        3824 0.106 
##  6     2 Fairly     12732 0.353 
##  7     2 Not very   10161 0.282 
##  8     2 Not at all  9334 0.259 
##  9     3 Very        3020 0.0971
## 10     3 Fairly     10759 0.346 
## 11     3 Not very    8927 0.287 
## 12     3 Not at all  8404 0.270 
## 13     4 Very        2846 0.100 
## 14     4 Fairly      9765 0.344 
## 15     4 Not very    8026 0.282 
## 16     4 Not at all  7778 0.274

We can use this table to create an alternative barplot where we have a different plot at each wave, and the height depends on the proportions we calculated. The position = "dodge" option puts the bars next to each other while the stat = "identity" option makes the bars have the height we calculated:

tab4 |>
  ggplot(aes(wave, prop, fill = as.factor(vote6.fct))) +
  geom_bar(position = "dodge", stat = "identity")

We can flip the axes and have the different categories on the x-axis to make it easier to see the changes over time. We can do this by switching the x and fill arguments in the aes() command:

tab4 |>
  ggplot(aes(x = vote6.fct, prop, fill = as.factor(wave))) +
  geom_bar(position = "dodge", stat = "identity")

Again, we see that the proportions are pretty stable over time. Which version of the graph you should use depends on what you want to show. The first version is better for showing the overall distributions and how they change in time, while the second facilitates comparisons within categories over time.

Simple change plots for 2-3 time points

Sometimes, we want to compare the distribution of a variable at two or three time points. This could be due to data limitations (we only have 2-3 waves) or the fact that we want to highlight the overall change in a certain period. There are two popular ways to do this, both of which combine points and lines.

Imagine we want to see the overall change in the likelihood of voting from the start of the study (2009 in our case) to the end (2014). Let’s start by calculating the proportion of people voting each year. We can use the filter() command to keep only the relevant years and then use the group_by() and count() commands to calculate the number of cases in each category. We can then calculate the proportion of cases in each category:

tab2 <- usl |>
  filter(year %in% c(2009, 2014)) |>
  group_by(year) |>
  count(vote6.fct) |>
  mutate(prop = n / sum(n))

tab2

## # A tibble: 10 × 4
## # Groups:   year [2]
##     year vote6.fct      n   prop
##    <dbl> <fct>      <int>  <dbl>
##  1  2009 Very        2446 0.0998
##  2  2009 Fairly      7846 0.320 
##  3  2009 Not very    7048 0.288 
##  4  2009 Not at all  6661 0.272 
##  5  2009 <NA>         512 0.0209
##  6  2014 Very         122 0.0985
##  7  2014 Fairly       412 0.333 
##  8  2014 Not very     302 0.244 
##  9  2014 Not at all   292 0.236 
## 10  2014 <NA>         110 0.0889

One strategy to visualize such data is to have a line for each category and points for the proportions at each time. We can also add a line between the points to facilitate visualisation of the change. In the plot below, we add the proportions on the y-axis and the categories on the x-axis. We colour the lines and points by the year and group them by category. We use the geom_point() command to add the points and the geom_line() command to add the lines:

tab2 |>
  ggplot(aes(prop, vote6.fct,
             color = as.factor(year),
             group = vote6.fct)) +
  geom_point() +
  geom_line()

line plot showing change in voting behaviour over time

Note that we must make “year” a factor to treat the years as distinct categories. We also need to make the voting variable the group to make lines for each category.

We see that the proportion of people choosing “Not at all” has increased over time while the proportion of people choosing “Very” has remained pretty stable.

We can make the graph a little nicer by excluding missing cases, reversing the order of the labels (using fct_rev()), adding labels (labs()) and changing the theme (theme_bw()):

tab2 |>
  na.omit() |>
  mutate(Year = as.factor(year),
         Vote = fct_rev(vote6.fct)) |>
  ggplot(aes(prop, Vote,
             color = Year,
             group = Vote)) +
  geom_point() +
  geom_line() +
  labs(y = "Vote", x = "Proportion") +
  theme_bw()

line plot showing change in voting behaviour over time. nice formatting

The second strategy for visualizing such data is to have a line for each category and points for the proportions at each point in time. This is similar to the previous plot, but the x-axis and y-axis are switched. Here, the slope of the lines shows the amount of change we have in our data. We can do this by switching the prop and year variables in the aes() command:

tab2 |>
  na.omit() |>
  ggplot(aes(as.factor(year), prop,
             color = vote6.fct,
             group = vote6.fct)) +
  geom_point() +
  geom_line() +
  labs(y = "Proportion",
       x = "Year",
       color = "Voting") +
  theme_bw()

alternative line plot showing change in voting behaviour over time. nice formatting

Note that we need to add group = vote6.fct to force the creation of the line across the discrete time points.

Adopting a similar approach allows us to track how a continuous variable evolves over time. For example, if we want to see how satisfaction has changed between 2009 and 2014 based on the residence of the respondents we can filter the data, group it by “urban.fct” and “year” and calculate the mean satisfaction. We can then visualize this using the same approach as before:

tab3 <- usl |>
  filter(year %in% c(2009, 2014)) |>
  group_by(year, urban.fct) |>
  summarise(mean_sati = mean(sati, na.rm = T))

tab3

## # A tibble: 4 × 3
## # Groups:   year [2]
##    year urban.fct mean_sati
##   <dbl> <fct>         <dbl>
## 1  2009 Rural          5.39
## 2  2009 Urban          5.24
## 3  2014 Rural          5.20
## 4  2014 Urban          5.02

The first type of plot we covered above would be:

tab3 |>
  ggplot(aes(mean_sati, urban.fct,
             color = as.factor(year),
             group = urban.fct)) +
  geom_point() +
  geom_line() +
  labs(y = "Residence",
       x = "Mean satisfaction",
       color = "Year") +
  theme_bw()

line plot showing change in time of satisfaction by residence

While the second would be created using:

tab3 |>
  ggplot(aes(as.factor(year), mean_sati,
             color = urban.fct,
             group = urban.fct)) +
  geom_point() +
  geom_line() +
  labs(y = "Mean satisfaction",
       x = "Year",
       color = "Residence") +
  theme_bw()

Satisfaction deteriorated for both groups at similar rates. The second graph makes it a little easier to see the differences between the groups in terms of the rate of change.

Line graphs

The line graph is the most popular type of visualization when we have longitudinal data. It is pretty flexible, as it can capture different types of changes and can be done both at the individual level and the aggregate level.

Individual-level line graphs

A good way to get an intuition about the data, especially when it is large, is to sample just a few cases and see how they change over time. We can do this by randomly sampling a few people from the data. Before that, setting up a starting point for the sampling procedure is useful. This will ensure that we consistently get the same results (and your graphs will look the same as mine if you are trying to run this on your own computer). We can do this using the set.seed() command:

set.seed(1234)

We can then sample 20 people from the data and create a smaller version with only these people in it. Here, we first select all unique IDs (using unique(usl$pidp)) and then use the sample() command to select 20 of them randomly:

random_people <- unique(usl$pidp) |> 
  sample(20)

We then create a smaller dataset using only these individuals:

susl <- filter(usl, pidp %in% random_people)

Now, we can use this data to look at individual changes in time. We will look at how mental health changed over time using the geom_line() command. The group argument needs to be used to make sure that the lines are connected for the same individual:

ggplot(susl, aes(wave, sf12mcs, group = pidp)) +
  geom_line()

Line graph showcasing the mental health trajectories of 20 randomly selected individuals across four waves. Each line represents one individual's mental health score trend over time, illustrating the variability in mental health changes.

We see that there is some variation, with some people improving while others deteriorating. To simplify reading individual trends, let’s create separate graphs for each person by using facet_wrap():

susl |>
  ggplot(aes(wave, sf12mcs, group = pidp)) +
  geom_line() +
  facet_wrap(~ pidp, ncol = 5) +
  theme(strip.text.x = element_blank(),
        strip.background = element_blank())

Collection of small line graphs, each representing one individual's mental health trend over four waves. The graphs are arranged in a grid, facilitating comparison across individuals.

Note that we ask for the graphs to be shown in five columns using ncol = 5 and remove the heading with the individual ID that appears by default using theme().

Such graphs are also useful to inform how to model the change in time for the variable of interest. For example, if we employ linear models using either multilevel or latent growth modelling, we will estimate the time change for each individual with a straight line. We can see how well this would represent the individual change by adding a linear trend for each individual. The geom_smooth() command with the argument method = "lm" will add a linear trend line to the graph:

susl |>
  ggplot(aes(wave, fimngrs.cap, group = pidp)) +
  geom_line() +
  geom_smooth(method = "lm", se = F, alpha = 0.5) +
  facet_wrap(~ pidp, ncol = 5) +
  theme(strip.text.x = element_blank(),
        strip.background = element_blank())

Note that the se = F argument for removing the confidence intervals for the trend line.

We see that a linear trend would capture the change in time pretty well for most individuals.

Aggregate line graphs

Another aspect we might want to explore is how the average mental health changes over time. We can do this by adding a summary line to the graph. The stat_summary() command will calculate the mean for each wave and then add a line and a point for each wave. We can use the fun = mean argument to tell ggplot2 to calculate the mean and the geom = “line” and geom = “point” arguments to tell it to add a line and a point, respectively:

susl |>
  ggplot(aes(wave, fimngrs.cap)) +
  geom_line(aes(group = pidp), alpha = 0.3) +
  stat_summary(fun = mean, geom = "line", color = "blue") +
  stat_summary(fun = mean, geom = "point", size = 2, color = "blue")

When looking at the aggregate change in time, it is useful to use the full dataset. We can use either the geom_smooth() command to calculate a trend or the average from the data. Below is a graph that shows both:

usl |>
  ggplot(aes(wave, sf12mcs)) +
  geom_smooth(method = "lm", se = F) +
  stat_summary(fun = mean, geom = "line", color = "red")

Line graph depicting the average mental health scores across four waves. The graph includes both a smooth trend line and points representing the mean scores at each wave, highlighting overall trends in mental health.

In this case, the linear model is slightly different from the average in the data. This might indicate that a nonlinear model might better represent the change in mental health. We can explore this in a separate post using growth curve modelling and the multilevel model for change.

Trends by groups

In addition to looking at the overall change in time, we might want to see how this differs among groups. For example, we might want to see how the change in mental health differs by education level. We can do this by adding colour to the lines for each group. The color argument in the aes() command will colour the lines by the group, and the group argument to make sure that the lines are connected for the same group:

ggplot(usl, aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) +
  geom_smooth(method = "lm")

Line graph showing the trends in mental health over time, differentiated by education level. Lines represent the average mental health scores for individuals with and without a degree, indicating differences and changes over time.

We could make this graph a little nicer by removing the missing cases, adding the labels and changing the theme:

usl |>
  filter(!is.na(degree.fct)) |>
  ggplot(aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(y = "Mental health", x = "Wave", color = "Education")

We see that people with a degree have higher mental health than those without. We also see that the difference between the two groups decreases over time.

We can add another dimension to the graph by using faceting (i.e., multiple graphs by group). We can do this by adding the facet_wrap() command to the graph. In this case, we will make a separate graph for each category of the “urban.fct” variable:

ggplot(usl, aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) +
  geom_smooth(method = "lm") +
  facet_wrap(~ urban.fct)

This graph is useful for exploring moderation effects. In this case, we could explore how time changes for people with a degree and without depending on where they live.

To make it a little easier to compare, we can remove the missing cases:

usl |>
  filter(!is.na(urban.fct)) |>
  filter(!is.na(degree.fct)) |>
  ggplot(aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) +
  geom_smooth(method = "lm") +
  facet_wrap(~ urban.fct)

Here, we see that there is a convergence in mental health for people with and without a degree. This is happening both in rural and urban areas, although the initial difference is larger in urban areas.

Sometimes, having all the groups in the same graph is easier to interpret. We can add an interaction directly to the aes() by adding the : between the two variables:

usl |>
  filter(!is.na(urban.fct)) |>
  filter(!is.na(degree.fct)) |>
  ggplot(aes(wave, sf12mcs,
             color = degree.fct:urban.fct,
             group = degree.fct:urban.fct)) +
  geom_smooth(method = "lm")

Now, it’s clearer that people with no degree in urban areas have the lowest levels of mental health, and while the differences go down over time, they are still present in wave 4.

Area graphs

Another way to visualize change over time is using area graphs. These are similar to line graphs, but colours fill the area under the line. This can be useful when we have multiple groups and want to see how the change in time is distributed between them. We can use the geom_area() command to create area graphs. We can use the fill argument to colour the areas by the group and the position = "stack" argument to stack the areas on top of each other. To see how the likelihood of voting has changed over time, we can use:

usl |>
  group_by(wave) |>
  count(vote6.fct) |>
  na.omit() |>
  mutate(prop = n / sum(n),
         wave = as.numeric(wave)) |>
  ggplot(aes(wave, n, fill = vote6.fct)) +
  geom_area(position = "stack")

Stacked area graph illustrating the change in proportions of voting behavior categories over time. The areas are color-coded by voting behavior, showing the dynamics of each category across four waves

We can use a similar approach if we want to summarize a continuous variable using a categorical one. For example, we can look at how mental health has changed over time for people with a degree and people without one. To facilitate comparisons, we can overlay the two areas using position = "identity" and render them transparent with alpha = 0.4:

usl |>
  filter(!is.na(degree.fct)) |>
  group_by(year, degree.fct) |>
  summarise(meantal_health = mean(sf12mcs, na.rm = T)) |>
  ggplot(aes(year, meantal_health, fill = degree.fct)) +
  geom_area(position = "identity", alpha = 0.4)

Area graph showing the average mental health scores over time, separated by education level. The areas overlap and are semi-transparent, allowing for comparison of trends between those with and without a degree.

Overall, the two distributions are pretty similar, indicating no differences in mental health by education level.

We can look at another example where we average satisfaction by voting behaviour. Here, we stack the areas and remove missing cases in voting to make it easier to read:

usl |>
  filter(!is.na(vote6.fct)) |>
  group_by(year, vote6.fct) |>
  summarise(satisfaction = mean(sati, na.rm = T)) |>
  ggplot(aes(year, satisfaction, fill = vote6.fct)) +
  geom_area(position = "stack")

Stacked area graph depicting the average satisfaction levels over time, categorized by voting behavior. Each area represents a voting category, illustrating how satisfaction changes across categories over time

Exporting the graphs

Once we create the graphs we want, we can easily export them using the ggsave() command. This command takes the name of the file we want to save. For example, to save the last graph we created, we can use a png file we can run:

ggsave("satisfaction_by_vote.png")

The command is pretty flexible. You can also save a graph and export it later. Other common things you might want to change are the format, height, width and resolution (dpi). For example, we create a graph and export it as a tiff file with a resolution of 300 dpi, a width of 10 inches and a height of 5 inches:

grph1 <- ggplot(usl, aes(wave, sf12mcs, 
                         color = degree.fct, 
                         group = degree.fct)) +
  geom_smooth(method = "lm")

ggsave(grph1, "mental_health_degree.tiff", 
       device = "tiff", dpi = 300, width = 10, height = 5)

visualizing longitudinal data using R key commands

Conclusions

In this guide, we journeyed through the art of visualizing longitudinal data with R. We started by looking at simple univariate visualizations and then moved to more complex ones, discussing both continuous and categorical data. This should cover the most common types of graphs used with longitudinal data. You can see a summary of the key commands in the figure.

Now that you’re armed with these techniques, why not dive into your data and uncover the stories waiting to be told? Have we overlooked anything? We welcome your questions and insights in the comments below.

Was the information useful?

Consider supporting the site:

Subscribe to newsletter 📬

Buy a coffee ☕

Buy the book 📖

Take a course 🏫

2 responses to “Complete guide to visualizing longitudinal data in R”

Understanding the longitudinal data workflow: a comprehensive guide – Longitudinal data analysis says:
April 4, 2024 at 10:18 am
[…] – Complete guide to visualizing longitudinal data in R […]
Cleaning longitudinal data in R: a step-by-step guide – Longitudinal data analysis says:
April 4, 2024 at 10:25 am
[…] check how to do that using tables and summary statistics in this blog post and how to do graphs in this post. You can adapt the process covered in this post and the previous one to your own data. Any […]

Longitudinal Analysis

Longitudinal methods, made clear