Understanding the longitudinal data workflow: a comprehensive guide

Longitudinal data analysis provides a unique window into the evolution of phenomena over time, capturing the dynamism of real-world processes in ways that cross-sectional studies simply cannot. It also helps shed light on causal relationships. From understanding how economic policies impact growth year after year to tracking the progress of students throughout their education, longitudinal analysis provides novel insights.

However, the richness of longitudinal data also brings its own set of challenges. These include managing multiple datasets at various levels, handling different types of missing data, and navigating diverse variables. Also, longitudinal data can be diverse, including everything from panel studies to log data from phones. To analyse such data, it is useful to step back and understand the workflow needed to get insights from such sources.

In this comprehensive guide, you’ll learn the essential steps to transform raw longitudinal data into insightful analyses that are ready for communication. From importing data to communicating results, we’ll guide you through each stage with practical tips, paving your way to mastering longitudinal data analysis. One way to visualize this entire process is with the following diagram, which includes the main steps and the associated commands in R. We will next discuss each of these stages to explore the key aspects you need to consider in your work.

Illustration showing the workflow of longitudinal data analysis, including data import, transformation, cleaning, exploration, analysis, and communication.

1.      Importing data

The first step in the journey to data analysis is to import the data. Longitudinal data is stored in different ways. Traditional social science data typically stores data separately for each wave and each level. For example, Understanding Society, a large UK panel study, has a separate dataset for wave 1 individual-level information and wave 1 household-level information. Other datasets include details about relationships between household members or biological attributes. The following waves would follow a similar structure. On the other hand, a study like the Health and Retirement Study, an ageing study in the US, has a separate file for each survey section at each point in time. Given this complexity and diversity, your first step should be understanding how your data is stored and identifying which datasets are essential for your analysis.

Once we have decided on that, we can download the data. Typically, this is stored either as CSV (comma-separated values) or in a more specialised format, like “.sav”, “.dta”, “.sas”. Most statistical software can open the “.csv” format, but it tends to occupy more space and lack labels or other metadata. Because of that, importing data from one of the other formats might be useful if it comes with labels. R’s “haven” package can import such data with labels and other attributes.

When working with large datasets, importing only the necessary variables might be useful. For example, the commands from “haven” have an option col_select where we can give the names of the variables of interest. These have to be identified in advance using the study codebook or questionnaire. In this way, we can import just a handful of essential variables instead of hundreds. Similarly, we can select only the cases of interest at this point to make the data size more manageable. Because we will combine data from multiple waves, longitudinal data could get quite big.

Blogpost that can guide you on this topic:

–          Data structures for longitudinal analysis: wide versus long

–          Preparing longitudinal data for analysis in R: a step-by-step guide

2. Transforming data

Diagram illustrating various data transformation techniques applied in longitudinal data analysis.

Once we decide what data and variables are essential, we can combine them. Before that, it is useful to rename variables to make them easier to identify. One format often used is “varname_wavenumber” (e.g., education_2 for education in wave 2) for time-varying variables (i.e., those that change in time). For time-constant predictors (that do not change in time), we can just use the variable name.

Effective data merging requires identifying a unique ID for each case. For example, ‘pidp’ is the individual-level unique ID in Understanding Society. This is also time constant, meaning that we can merge cases over time. On the other hand, we need to use the household id (“hidp”) to merge individual-level and household-level information. Nevertheless, this is not time constant, meaning that it changes over time. As a result, it’s useful only when merging data within the same wave. Understanding such issues is essential for making the correct data mergers. Usually, studies have user guides that advise on this process.

Once we have the data of interest with the renamed variables and the ID information, we can merge them. There are different strategies to do this. My recommendation is to merge the data in the wide format. This can be done in R easily with the xxx_join() commands. For example, the full_join(us1, us2, by = “pidp”) will merge all the cases in waves 1 and 2 of Understanding Society using the “pidp” variable. The xxx_join() commands give the flexibility to decide who to include and make it easier to see if the merger was successful. At this stage, we can also filter cases. For example, if we want to keep only cases that are present in all waves (balanced data), this can be done using the filter() command.

Finally, we need to prepare the data for cleaning. My recommendation is to do that in the long format. This tends to be easier as you can clean each variable once for all the points in time. Also, the process would be the same regardless of how many waves of data you have. To do this, we can reshape the data using, for example, the pivot_longer() command.

Blogpost that can guide you on this topic:

–          Data structures for longitudinal analysis: wide versus long

–          Preparing longitudinal data for analysis in R: a step-by-step guide

3. Cleaning data

Diagram illustrating various data cleaning techniques applied in longitudinal data analysis.

We can start cleaning it once we have just the variables of interest from all the waves in the long format. This involves describing each variable and recoding the necessary categories for our research. This also involves recording missing categories. For example, in Understanding Society, negative values tend to be associated with missing information. These would need to be coded as NA in R. We might also want to recode categorical variables as factors and give them appropriate labels or collapse variables with many categories. Similarly, we might want to transform continuous variables if they are skewed or lag them.

In this process, we must always remember what we are trying to achieve and how best to code variables for it. We should also understand the potential limitations of your data. For example, are there variables with lots of missing information? If yes, we might need to find alternative measures, impute the missing information or drop them from our analysis.

Blogpost that can guide you on this topic:

–          Cleaning longitudinal data in R: a step-by-step guide

4. Exploring data

Set of graphs displaying initial exploratory data analysis of a longitudinal dataset, including trends over time and distributions.

The next step is also known as Exploratory Data Analysis. Once we have the data in the right format and cleaned, we can start to understand the patterns in the data better. This can be done with tables, summary statistics and visualizations.

I recommend beginning with simple analyses and progressively moving to more complex ones. Begin with univariate statistics and progress to exploring changes over time and differences between groups. In this process, we can develop our hypotheses about relationships and find anomalies in the data.

Again, try to remember the research questions and models of interest to guide the exploration of the data. For example, if we want to look at changes in time, we can use line graphs. If we need to decide how to model change in time (linear or non-linear), graphs and tables can inform that decision.

Blogpost that can guide you on this topic:

–          Exploring longitudinal data in R: tables and summaries

–          Complete guide to visualizing longitudinal data in R

5. Analysing data

Diagram showing main types of statistical models for longitudinal data analysis

There are many types of models and ways to set them up. Before running the models, focus on the research questions and hypotheses to determine the appropriate models. Make a plan for the sequence of models needed to answer the research questions. In general, starting from a simple model and building it up towards more complex ones is recommended.

Let’s look at an example. If our main research question is to look at changes in time of cognitive ability in children and to see if that is affected by parents’ education, then the best models to use would be the multilevel model for change or the latent growth model. Let’s imagine we select the former, and we can plan a sequence of models next. For example, we might start with an empty model to estimate the amount of between and within variation (also known as IntraClass Correlation). Then, we might run a sequence of models to decide if the change in time is linear or non-linear. Finally, we could start adding variables to the model. This could be in blocks, for example first the variables of interest (like parental education) could be added, then the control variables (like gender, type of school, etc.) and then any interactions we are interested in (like the interaction between time and parental education).

Doing the analysis might also lead to new questions or the realization that we need other variables. This is where iteration comes in, and we might need to go back to the data-cleaning step to recode new variables. For example, we might need to recode parental education to better capture its impact on cognitive ability of the child before redoing the analysis.

Blogpost that can guide you on this topic:

–          Estimating and visualizing multilevel models for change in R

–          Understanding causal direction using the cross-lagged model

–          Estimating and visualizing Latent Growth Models with R

6. Communicating results

Infographic summarizing key findings from longitudinal data analysis and tips for effectively communicating results

At this stage, it is useful to consider the audience and our aims. This will impact both your communication methods, such as text, tables, and graphs and the format of the outputs, reports, papers, presentations, etc.

For example, if we want to share intermediate results with colleagues, we can use a dynamic document with outputs from the analysis. In this case things don’t have to be polished and can be more technical. On the other hand, a final report for policymakers would need to be more polished and less technical. We may need to focus more on creating visualizations and text that can easily transmit the findings to a larger audience.

R offers a powerful framework for creating dynamic documents. This framework combines text, R code, and R outputs to create presentations, papers, books, and reports. I recommend you learn more about Rmarkdown and its newer sibling, Quarto, as it can save considerable time in the long run.

Blogpost that can guide you on this topic:

–          Exploring longitudinal data in R: tables and summaries

–          Complete guide to visualizing longitudinal data in R

Other considerations

Storage and version control are some of the other practical things to remember in this process.

In general, it’s recommended that you have backups for your work. This can be achieved by working directly with programs like Dropbox, OneDrive, and Google Drive. This way, we know that in case of computer failure or loss, there is a backup of our work. One thing to keep in mind is our data agreement. This might restrict the possibility of storing data on such platforms. A solution is to keep only the code and outputs on these platforms while storing the data locally.

The other thing to consider is version control. This specially developed software keeps a history of our work and backs it up. The most popular platforms for this are Git and GitHub. There is increasing pressure to use such platforms both from the scientific community (to encourage reproducible research) and from data science. My recommendation is to learn more about using this in your own work. For R users, a good starting point is happygitwithr.com.

Conclusions

Longitudinal data is complex, and working with it is time-consuming. Here, we discussed the main stages of going from raw data to communicating results when using such data. While this cannot cover all the complex decisions to be made and all the commands to be used, we hope it provides a useful overview of the process. You can use it as a map to guide you in this process and to show the way forward.

Illustration showing the workflow of longitudinal data analysis, including data import, transformation, cleaning, exploration, analysis, and communication.

Was the information useful?

Consider supporting the site by: