Longitudinal data can be complex as it includes multiple cases with observations at different points in time. This complexity grows with missing data patterns, nested structures like individuals within households, and various variable types (like time-constant versus time-varying. Visualizing this complicated data can help us understand it better. This post will cover the main types of graphs used to explore longitudinal data using R. We will apply these techniques to the synthetic data from our previous post that replicates a large social science panel study. Visualization complements data exploration using tables and summary statistics we covered in a previous post.
Preparing for visualization
You can follow this guide by running the code in R on your own computer. Our examples will use synthetic (simulated) data modelled after Understanding Society, a comprehensive panel study from the UK. The real data can be accessed for free from the UK Data Archive.
In a previous blog, we cleaned the data and stored it in both the long and the wide formats. We will use that cleaned data here. If you want to follow along, you can download the data from here and all the code from here.
Before we start, make sure you have the tidyverse
and haven
packages installed and loaded by running the following code:
install.packages("tidyverse") library(tidyverse) library(haven)
Although the haven
package is not required for importing data; it simplifies working with and recoding data originally imported using it. We will use the ggplot2
package from the tidyverse
for visualization.
Next, we will load the data prepared in the previous post. This contains both the long and the wide format data in an “RData” file. We will load the long-format data and have a quick look at it:
load("./data/us_clean_syn.RData") glimpse(usl)
## Rows: 204,028 ## Columns: 32 ## $ pidp <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11… ## $ wave <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4… ## $ age <dbl+lbl> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 42… ## $ hiqual <dbl+lbl> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, … ## $ single <dbl+lbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, NA, NA, NA, 0, NA… ## $ fimngrs <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,… ## $ sclfsato <dbl+lbl> -9, 6, 2, 1, 6, 6, -8, 3, 2, NA, NA, NA, -9, NA… ## $ sf12pcs <dbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, NA, 46.16, 33.18… ## $ sf12mcs <dbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, NA, 37.02, 46.80… ## $ istrtdaty <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, … ## $ sf1 <dbl+lbl> 2, 2, 1, 3, 4, 3, 3, 3, 5, NA, NA, NA, 2, NA… ## $ present <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, … ## $ gndr.fct <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male… ## $ single.fct <fct> Single, Single, Single, In relationship, In relationship, … ## $ urban.fct <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, Ur… ## $ degree.fct <fct> Degree, Degree, Degree, Degree, No degree, No degree, No d… ## $ vote6.fct <fct> Not very, Not at all, Not at all, Not at all, Not at all, … ## $ sati <dbl> NA, 6, 2, 1, 6, 6, NA, 3, 2, NA, NA, NA, NA, NA, 5, NA, NA… ## $ sati.fct <fct> NA, Mostly satisfied, Mostly dissatisfied, Completely diss… ## $ fimngrs.cap <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,… ## $ logincome <dbl> 8.099818, 8.297170, 8.196070, 6.757514, 6.809039, 6.577861… ## $ year <dbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, NA, … ## $ sati.ind <dbl> 3.000000, 3.000000, 3.000000, 3.000000, 5.000000, 5.000000… ## $ sati.dev <dbl> NA, 3.000000, -1.000000, -2.000000, 1.000000, 1.000000, NA… ## $ sf12pcs.ind <dbl> 57.65250, 57.65250, 57.65250, 57.65250, 48.82667, 48.82667… ## $ sf12pcs.dev <dbl> -3.1425000, 5.3275000, -0.6825000, -1.5025000, 5.1033333, … ## $ sf12mcs.ind <dbl> 52.75250, 52.75250, 52.75250, 52.75250, 42.96333, 42.96333… ## $ sf12mcs.dev <dbl> 2.977500, -16.532500, 7.267500, 6.287500, 3.516667, 2.4266… ## $ waves <int> 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4… ## $ sati.lag <dbl> NA, NA, 6, 2, NA, 6, 6, NA, NA, 2, NA, NA, NA, NA, NA, 5, … ## $ sf12pcs.lag <dbl> NA, 54.51, 62.98, 56.97, NA, 53.93, 46.39, NA, NA, 33.18, … ## $ sf12mcs.lag <dbl> NA, 55.73, 36.22, 60.02, NA, 46.48, 45.39, NA, NA, 46.80, …
Simple univariate visualizations
We will start with some simple visualizations to get to know ggplot2
. By diving into the data, let’s uncover how voting trends have shifted over the years. For example, we could look at the voting variable using a barplot.
Ggplot2
is a powerful package for creating visualizations in R. It is based on the grammar of graphics and allows for a high degree of customization. It has two main components: the data with the main dimensions, which are fed in the ggplot()
command and the geometries, which are added using the geom_
commands. The dimensions of the graph are set using the aes()
command. The different parts of the graphs are layered on top of each other using +
.
Having set the stage, let’s create a barplot of the voting variable to demonstrate these concepts in practice. Here, the variable of interest is “vote6.fct”, and the geometry is a barplot (geom_bar()
):
ggplot(usl, aes(vote6.fct)) + geom_bar()
We see that ggplot2
automatically calculates the number of cases in each category and then adjusts the bar height based on this information. At times, raw data may be inaccessible, or we might prefer crafting our own summary statistics for a tailored visualization. The package allows us to do that as well. Let’s imagine that instead of the number of cases, we want to see the proportion of cases in each category. We can do this by making a table with the absolute frequencies and then creating a new variable with the proportions:
tab1 <- count(usl, vote6.fct) |> mutate(prop = n / sum(n)) tab1
## # A tibble: 5 × 3 ## vote6.fct n prop ## <fct> <int> <dbl> ## 1 Very 14638 0.0717 ## 2 Fairly 49363 0.242 ## 3 Not very 41082 0.201 ## 4 Not at all 38036 0.186 ## 5 <NA> 60909 0.299
Note that we are using the |>
(pipe) operator here. We introduced this in a previous post.
Now, we can redo the barplot using the new variables. For the dimensions, we use the new variables, and for the geometries, we use geom_bar(stat = "identity")
. This is because we are inputting the values directly and not letting ggplot2
calculate them:
ggplot(tab1, aes(vote6.fct, prop)) + geom_bar(stat = "identity")
Let’s next look at a continuous variable. We can start by looking at the distribution of the monthly income variable. We can use a histogram (geom_histogram()
) for this:
ggplot(usl, aes(x = fimngrs.cap)) + geom_histogram()
We see that income is skewed, with most people on the left of the distribution and relatively few on the right side (with larger incomes).
Uni-variate visualizations over time
Continuous variables
So far, we looked at the distribution of variables at one point in time. We can also look at how the distribution of variables changes over time. For example, we can investigate how the distribution of mental health (sf12mcs) changes over time. One way to do that is by using a boxplot that shows the median, the interquartile range, and the outliers. We can use the geom_boxplot() command for this. As we want to look at the distribution over time, we can use the wave
variable as the x-axis (first input in aes()
) and the sf12mcs
as the y-axis (second input in aes()
):
ggplot(usl, aes(as.factor(wave), sf12mcs)) + geom_boxplot()
Note that here we want to treat “wave” as a categorical variable (using as.factor()
) to have distinct distributions at each wave.
We see that the median and distribution are pretty stable over time. We can also use a violin plot to see the distribution of the variable over time. This is similar to a boxplot but also shows the distribution’s density. We can use the geom_violin()
command for this:
ggplot(usl, aes(as.factor(wave), sf12mcs)) + geom_violin()
Your visualization goals determine whether to use a boxplot or a violin plot. If you aim to highlight the median and distribution spread, use a boxplot. Conversely, use the violin plot for a detailed view of the distribution’s density.
Categorical variables
We can also look at how the distribution of categorical variables changes over time. For example, to look at how voting behaviour changes over time, we can use the same geom_bar() command. We can add the fill argument to colour the bars by the categories of the voting variable. Similar to before, we add the wave
variable to see the change in time.
usl |> ggplot(aes(wave, fill = vote6.fct)) + geom_bar()
We observe that the big trend is the increasing number of missing data due to the drop-off in the study. We can also look at the proportion of each category over time while ignoring the missing cases. We first filter all the missing cases out and then use the position == "fill"
option to calculate the proportions within each wave. This results in:
usl |> filter(!is.na(vote6.fct)) |> ggplot(aes(wave, fill = vote6.fct)) + geom_bar(position = "fill")
Once we remove the missing cases, we will see that the proportions for each category are pretty stable over time. This aggregate stability could hide individual-level changes. We explore this in a separate post using a Sankey plot/river/alluvial graph.
We can further explore the aggregate distribution of voting over time by creating our own statistics. We use again the count()
command to calculate the number of cases in each category and then calculate the proportion of cases in each category:
tab4 <- usl |> filter(!is.na(vote6.fct)) |> group_by(wave) |> count(vote6.fct) |> mutate(prop = n / sum(n)) tab4
## # A tibble: 16 × 4 ## # Groups: wave [4] ## wave vote6.fct n prop ## <dbl> <fct> <int> <dbl> ## 1 1 Very 4948 0.104 ## 2 1 Fairly 16107 0.339 ## 3 1 Not very 13968 0.294 ## 4 1 Not at all 12520 0.263 ## 5 2 Very 3824 0.106 ## 6 2 Fairly 12732 0.353 ## 7 2 Not very 10161 0.282 ## 8 2 Not at all 9334 0.259 ## 9 3 Very 3020 0.0971 ## 10 3 Fairly 10759 0.346 ## 11 3 Not very 8927 0.287 ## 12 3 Not at all 8404 0.270 ## 13 4 Very 2846 0.100 ## 14 4 Fairly 9765 0.344 ## 15 4 Not very 8026 0.282 ## 16 4 Not at all 7778 0.274
We can use this table to create an alternative barplot where we have a different plot at each wave, and the height depends on the proportions we calculated. The position = "dodge"
option puts the bars next to each other while the stat = "identity"
option makes the bars have the height we calculated:
tab4 |> ggplot(aes(wave, prop, fill = as.factor(vote6.fct))) + geom_bar(position = "dodge", stat = "identity")
We can flip the axes and have the different categories on the x-axis to make it easier to see the changes over time. We can do this by switching the x
and fill
arguments in the aes()
command:
tab4 |> ggplot(aes(x = vote6.fct, prop, fill = as.factor(wave))) + geom_bar(position = "dodge", stat = "identity")
Again, we see that the proportions are pretty stable over time. Which version of the graph you should use depends on what you want to show. The first version is better for showing the overall distributions and how they change in time, while the second facilitates comparisons within categories over time.
Simple change plots for 2-3 time points
Sometimes, we want to compare the distribution of a variable at two or three time points. This could be due to data limitations (we only have 2-3 waves) or the fact that we want to highlight the overall change in a certain period. There are two popular ways to do this, both of which combine points and lines.
Imagine we want to see the overall change in the likelihood of voting from the start of the study (2009 in our case) to the end (2014). Let’s start by calculating the proportion of people voting each year. We can use the filter()
command to keep only the relevant years and then use the group_by()
and count()
commands to calculate the number of cases in each category. We can then calculate the proportion of cases in each category:
tab2 <- usl |> filter(year %in% c(2009, 2014)) |> group_by(year) |> count(vote6.fct) |> mutate(prop = n / sum(n)) tab2
## # A tibble: 10 × 4 ## # Groups: year [2] ## year vote6.fct n prop ## <dbl> <fct> <int> <dbl> ## 1 2009 Very 2446 0.0998 ## 2 2009 Fairly 7846 0.320 ## 3 2009 Not very 7048 0.288 ## 4 2009 Not at all 6661 0.272 ## 5 2009 <NA> 512 0.0209 ## 6 2014 Very 122 0.0985 ## 7 2014 Fairly 412 0.333 ## 8 2014 Not very 302 0.244 ## 9 2014 Not at all 292 0.236 ## 10 2014 <NA> 110 0.0889
One strategy to visualize such data is to have a line for each category and points for the proportions at each time. We can also add a line between the points to facilitate visualisation of the change. In the plot below, we add the proportions on the y-axis and the categories on the x-axis. We colour the lines and points by the year and group them by category. We use the geom_point()
command to add the points and the geom_line()
command to add the lines:
tab2 |> ggplot(aes(prop, vote6.fct, color = as.factor(year), group = vote6.fct)) + geom_point() + geom_line()
Note that we must make “year” a factor to treat the years as distinct categories. We also need to make the voting variable the group
to make lines for each category.
We see that the proportion of people choosing “Not at all” has increased over time while the proportion of people choosing “Very” has remained pretty stable.
We can make the graph a little nicer by excluding missing cases, reversing the order of the labels (using fct_rev()
), adding labels (labs()
) and changing the theme (theme_bw()
):
tab2 |> na.omit() |> mutate(Year = as.factor(year), Vote = fct_rev(vote6.fct)) |> ggplot(aes(prop, Vote, color = Year, group = Vote)) + geom_point() + geom_line() + labs(y = "Vote", x = "Proportion") + theme_bw()
The second strategy for visualizing such data is to have a line for each category and points for the proportions at each point in time. This is similar to the previous plot, but the x-axis and y-axis are switched. Here, the slope of the lines shows the amount of change we have in our data. We can do this by switching the prop
and year
variables in the aes()
command:
tab2 |> na.omit() |> ggplot(aes(as.factor(year), prop, color = vote6.fct, group = vote6.fct)) + geom_point() + geom_line() + labs(y = "Proportion", x = "Year", color = "Voting") + theme_bw()
Note that we need to add group = vote6.fct
to force the creation of the line across the discrete time points.
Adopting a similar approach allows us to track how a continuous variable evolves over time. For example, if we want to see how satisfaction has changed between 2009 and 2014 based on the residence of the respondents we can filter the data, group it by “urban.fct” and “year” and calculate the mean satisfaction. We can then visualize this using the same approach as before:
tab3 <- usl |> filter(year %in% c(2009, 2014)) |> group_by(year, urban.fct) |> summarise(mean_sati = mean(sati, na.rm = T)) tab3
## # A tibble: 4 × 3 ## # Groups: year [2] ## year urban.fct mean_sati ## <dbl> <fct> <dbl> ## 1 2009 Rural 5.39 ## 2 2009 Urban 5.24 ## 3 2014 Rural 5.20 ## 4 2014 Urban 5.02
The first type of plot we covered above would be:
tab3 |> ggplot(aes(mean_sati, urban.fct, color = as.factor(year), group = urban.fct)) + geom_point() + geom_line() + labs(y = "Residence", x = "Mean satisfaction", color = "Year") + theme_bw()
While the second would be created using:
tab3 |> ggplot(aes(as.factor(year), mean_sati, color = urban.fct, group = urban.fct)) + geom_point() + geom_line() + labs(y = "Mean satisfaction", x = "Year", color = "Residence") + theme_bw()
Satisfaction deteriorated for both groups at similar rates. The second graph makes it a little easier to see the differences between the groups in terms of the rate of change.
Line graphs
The line graph is the most popular type of visualization when we have longitudinal data. It is pretty flexible, as it can capture different types of changes and can be done both at the individual level and the aggregate level.
Individual-level line graphs
A good way to get an intuition about the data, especially when it is large, is to sample just a few cases and see how they change over time. We can do this by randomly sampling a few people from the data. Before that, setting up a starting point for the sampling procedure is useful. This will ensure that we consistently get the same results (and your graphs will look the same as mine if you are trying to run this on your own computer). We can do this using the set.seed()
command:
set.seed(1234)
We can then sample 20 people from the data and create a smaller version with only these people in it. Here, we first select all unique IDs (using unique(usl$pidp)) and then use the sample() command to select 20 of them randomly:
random_people <- unique(usl$pidp) |> sample(20)
We then create a smaller dataset using only these individuals:
susl <- filter(usl, pidp %in% random_people)
Now, we can use this data to look at individual changes in time. We will look at how mental health changed over time using the geom_line()
command. The group
argument needs to be used to make sure that the lines are connected for the same individual:
ggplot(susl, aes(wave, sf12mcs, group = pidp)) + geom_line()
We see that there is some variation, with some people improving while others deteriorating. To simplify reading individual trends, let’s create separate graphs for each person by using facet_wrap()
:
susl |> ggplot(aes(wave, sf12mcs, group = pidp)) + geom_line() + facet_wrap(~ pidp, ncol = 5) + theme(strip.text.x = element_blank(), strip.background = element_blank())
Note that we ask for the graphs to be shown in five columns using ncol = 5
and remove the heading with the individual ID that appears by default using theme()
.
Such graphs are also useful to inform how to model the change in time for the variable of interest. For example, if we employ linear models using either multilevel or latent growth modelling, we will estimate the time change for each individual with a straight line. We can see how well this would represent the individual change by adding a linear trend for each individual. The geom_smooth()
command with the argument method = "lm"
will add a linear trend line to the graph:
susl |> ggplot(aes(wave, fimngrs.cap, group = pidp)) + geom_line() + geom_smooth(method = "lm", se = F, alpha = 0.5) + facet_wrap(~ pidp, ncol = 5) + theme(strip.text.x = element_blank(), strip.background = element_blank())
Note that the se = F
argument for removing the confidence intervals for the trend line.
We see that a linear trend would capture the change in time pretty well for most individuals.
Aggregate line graphs
Another aspect we might want to explore is how the average mental health changes over time. We can do this by adding a summary line to the graph. The stat_summary()
command will calculate the mean for each wave and then add a line and a point for each wave. We can use the fun = mean argument to tell ggplot2 to calculate the mean and the geom = “line” and geom = “point” arguments to tell it to add a line and a point, respectively:
susl |> ggplot(aes(wave, fimngrs.cap)) + geom_line(aes(group = pidp), alpha = 0.3) + stat_summary(fun = mean, geom = "line", color = "blue") + stat_summary(fun = mean, geom = "point", size = 2, color = "blue")
When looking at the aggregate change in time, it is useful to use the full dataset. We can use either the geom_smooth()
command to calculate a trend or the average from the data. Below is a graph that shows both:
usl |> ggplot(aes(wave, sf12mcs)) + geom_smooth(method = "lm", se = F) + stat_summary(fun = mean, geom = "line", color = "red")
In this case, the linear model is slightly different from the average in the data. This might indicate that a nonlinear model might better represent the change in mental health. We can explore this in a separate post using growth curve modelling and the multilevel model for change.
Trends by groups
In addition to looking at the overall change in time, we might want to see how this differs among groups. For example, we might want to see how the change in mental health differs by education level. We can do this by adding colour to the lines for each group. The color argument in the aes() command will colour the lines by the group, and the group
argument to make sure that the lines are connected for the same group:
ggplot(usl, aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) + geom_smooth(method = "lm")
We could make this graph a little nicer by removing the missing cases, adding the labels and changing the theme:
usl |> filter(!is.na(degree.fct)) |> ggplot(aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) + geom_smooth(method = "lm") + theme_bw() + labs(y = "Mental health", x = "Wave", color = "Education")
We see that people with a degree have higher mental health than those without. We also see that the difference between the two groups decreases over time.
We can add another dimension to the graph by using faceting (i.e., multiple graphs by group). We can do this by adding the facet_wrap()
command to the graph. In this case, we will make a separate graph for each category of the “urban.fct” variable:
ggplot(usl, aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) + geom_smooth(method = "lm") + facet_wrap(~ urban.fct)
This graph is useful for exploring moderation effects. In this case, we could explore how time changes for people with a degree and without depending on where they live.
To make it a little easier to compare, we can remove the missing cases:
usl |> filter(!is.na(urban.fct)) |> filter(!is.na(degree.fct)) |> ggplot(aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) + geom_smooth(method = "lm") + facet_wrap(~ urban.fct)
Here, we see that there is a convergence in mental health for people with and without a degree. This is happening both in rural and urban areas, although the initial difference is larger in urban areas.
Sometimes, having all the groups in the same graph is easier to interpret. We can add an interaction directly to the aes()
by adding the :
between the two variables:
usl |> filter(!is.na(urban.fct)) |> filter(!is.na(degree.fct)) |> ggplot(aes(wave, sf12mcs, color = degree.fct:urban.fct, group = degree.fct:urban.fct)) + geom_smooth(method = "lm")
Now, it’s clearer that people with no degree in urban areas have the lowest levels of mental health, and while the differences go down over time, they are still present in wave 4.
Area graphs
Another way to visualize change over time is using area graphs. These are similar to line graphs, but colours fill the area under the line. This can be useful when we have multiple groups and want to see how the change in time is distributed between them. We can use the geom_area() command to create area graphs. We can use the fill argument to colour the areas by the group and the position = "stack"
argument to stack the areas on top of each other. To see how the likelihood of voting has changed over time, we can use:
usl |> group_by(wave) |> count(vote6.fct) |> na.omit() |> mutate(prop = n / sum(n), wave = as.numeric(wave)) |> ggplot(aes(wave, n, fill = vote6.fct)) + geom_area(position = "stack")
We can use a similar approach if we want to summarize a continuous variable using a categorical one. For example, we can look at how mental health has changed over time for people with a degree and people without one. To facilitate comparisons, we can overlay the two areas using position = "identity"
and render them transparent with alpha = 0.4
:
usl |> filter(!is.na(degree.fct)) |> group_by(year, degree.fct) |> summarise(meantal_health = mean(sf12mcs, na.rm = T)) |> ggplot(aes(year, meantal_health, fill = degree.fct)) + geom_area(position = "identity", alpha = 0.4)
Overall, the two distributions are pretty similar, indicating no differences in mental health by education level.
We can look at another example where we average satisfaction by voting behaviour. Here, we stack the areas and remove missing cases in voting to make it easier to read:
usl |> filter(!is.na(vote6.fct)) |> group_by(year, vote6.fct) |> summarise(satisfaction = mean(sati, na.rm = T)) |> ggplot(aes(year, satisfaction, fill = vote6.fct)) + geom_area(position = "stack")
Exporting the graphs
Once we create the graphs we want, we can easily export them using the ggsave() command. This command takes the name of the file we want to save. For example, to save the last graph we created, we can use a png file we can run:
ggsave("satisfaction_by_vote.png")
The command is pretty flexible. You can also save a graph and export it later. Other common things you might want to change are the format, height, width and resolution (dpi). For example, we create a graph and export it as a tiff file with a resolution of 300 dpi, a width of 10 inches and a height of 5 inches:
grph1 <- ggplot(usl, aes(wave, sf12mcs, color = degree.fct, group = degree.fct)) + geom_smooth(method = "lm") ggsave(grph1, "mental_health_degree.tiff", device = "tiff", dpi = 300, width = 10, height = 5)
Conclusions
In this guide, we journeyed through the art of visualizing longitudinal data with R. We started by looking at simple univariate visualizations and then moved to more complex ones, discussing both continuous and categorical data. This should cover the most common types of graphs used with longitudinal data. You can see a summary of the key commands in the figure.
Now that you’re armed with these techniques, why not dive into your data and uncover the stories waiting to be told? Have we overlooked anything? We welcome your questions and insights in the comments below.
Was the information useful?
Consider supporting the site by:
2 responses to “Complete guide to visualizing longitudinal data in R”
[…] – Complete guide to visualizing longitudinal data in R […]
[…] check how to do that using tables and summary statistics in this blog post and how to do graphs in this post. You can adapt the process covered in this post and the previous one to your own data. Any […]