Longitudinal data is exciting as it can be used to explore individual-level change in time and to better understand causal relationships. Nevertheless, preparing data for longitudinal analysis is often complex and time-consuming. The process from raw data to analysis typically follows several steps: importing data, selecting and renaming variables, selecting cases, merging, and reshaping the data.
In this guide, we simplify the task of preparing longitudinal data for analysis in R, breaking it down into smaller, manageable chunks to make the complex process more understandable.
0. Setting up what we need
You can follow this guide on your own device by running the code in R. We will be using synthetic data (i.e., simulated) that is based on Understanding Society, a large panel study in the UK. The data are synthetic and do not contain any sensitive information. They can be downloaded using the following links: wave 1 data, wave 2 data, wave 3 data, and wave 4 data. You can access the real data for free from the UK Data Archive.
I recommend setting up your working directory to the folder where you have the data. You can do this by running the following code:
setwd("path_to_your_folder")
If you encounter issues, and assuming you’re using RStudio, navigate through the menu: “Session”->“Set Working Directory”->“Choose Directory” to select your data folder.
Before we start, make sure you have the following packages installed: “tidyverse” (for data cleaning) and “haven” (for importing data). You can install them by running the following code:
install.packages("tidyverse") install.packages("haven")
Once the packages are installed, we can load them:
library(tidyverse) library(haven)
1. Importing the data
We first need to import the data. For that, we will use the read_dta()
function from the haven
package as it is stored in the “.dta” format (created using Stata). We will import the data separately for each wave of the study. Here is the code for importing four waves of data:
us1_full <- read_dta("./data/a_indresp_syn.dta") us2_full <- read_dta("./data/b_indresp_syn.dta") us3_full <- read_dta("./data/c_indresp_syn.dta") us4_full <- read_dta("./data/d_indresp_syn.dta")
Here, I saved my data in a subfolder of the working directory called “data”. I am using a relative path where “.” represents the working directory and “data” is the subfolder. If your data are directly in the working directory, you can use the name of the file as the input (e.g., read_dta(“a_indresp_syn.dta”)). If you have issues importing the data, you can use the menu in RStudio: “File”->“Import Dataset”->“From Stata” and select the file you want to import.
2. Selecting variables and renaming them
The first steps when working with new data normally involve exploring it and selecting the variables of interest. A useful command for this is glimpse()
which gives a quick overview of the data. Let’s look at the wave 1 data:
glimpse(us1_full)
## Rows: 51,007 ## Columns: 13 ## $ pidp <dbl> 5, 6, 7, 10, 11, 13, 14, 16, 18, 19, 21, 22, 23, 24, 25, … ## $ a_sex <dbl+lbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2,… ## $ a_dvage <dbl+lbl> 28, 80, 60, 42, 53, 51, 38, 27, 56, 31, 46, 30, 47, 3… ## $ a_hiqual_dv <dbl+lbl> 1, 9, 3, 1, 1, 9, 4, 5, 2, 9, 1, 1, 9, 4, 1, 3, 3, 1,… ## $ a_single_dv <dbl+lbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,… ## $ a_urban_dv <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,… ## $ a_vote6 <dbl+lbl> 3, 4, 2, 1, -7, 3, 3, 4, 3, 4, 2, 4, 4, … ## $ a_fimngrs_dv <dbl+lbl> 3283.87, 896.00, 1458.33, 2916.67, 4306.27, 1121.13,… ## $ a_sclfsato <dbl+lbl> -9, 6, 2, -9, -7, 3, 5, 6, -9, -9, 6, 3, 5, … ## $ a_sf12pcs_dv <dbl+lbl> 54.51, 53.93, 33.18, 58.30, -7.00, 46.66, 49.93, 30.7… ## $ a_sf12mcs_dv <dbl+lbl> 55.73, 46.48, 46.80, 49.41, -7.00, 37.17, 56.10, 45.2… ## $ a_istrtdaty <dbl+lbl> 2009, 2010, 2009, 2009, 2010, 2009, 2010, 2010, 2010,… ## $ a_sf1 <dbl+lbl> 2, 4, 5, 2, 3, 5, 3, 2, 3, 2, 2, 2, 4, 3, 2, 4, 2, 2,…
We see that we have ~51,000 cases and 13 variables. I already selected a subset of the variables in the study, but in your own work, you would need to check the questionnaire, codebook and documentation to select the variables of interest (normally, there would be hundreds of variables). From wave 1, I want to keep all the variables. I will consider some variables to be time-constant and do not need to select them from future waves. For this example, I will treat “sex”, “age”, and “hiqual_dv” (education) as time-constant variables. The rest of the variables will be treated as time-varying (see this blog post for a discussion of the distinction). This means that for waves 2-4, we do not need to select these variables again. As such, we can select only the variables that are not time constant plus the id.
Let’s check what variables we have in wave 2:
glimpse(us2_full)
## Rows: 54,462 ## Columns: 13 ## $ pidp <dbl> 1, 2, 3, 4, 5, 6, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, … ## $ b_sex <dbl+lbl> 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2,… ## $ b_dvage <dbl+lbl> 49, 28, 65, 18, 29, 81, 47, 54, 16, 52, 39, 28, 18, 5… ## $ b_hiqual_dv <dbl+lbl> 4, 3, 9, 9, 1, 9, 5, 1, 2, 9, 4, 5, 3, 1, 9, 5, 1, 1,… ## $ b_single_dv <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,… ## $ b_urban_dv <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ b_vote6 <dbl+lbl> 3, 3, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, -7, … ## $ b_fimngrs_dv <dbl+lbl> 940.10, 1344.00, 514.66, 1111.65, 4002.50, 709.00,… ## $ b_sclfsato <dbl+lbl> 6, -8, 5, 1, 6, 6, 6, 6, -8, 3, 5, 6, -7, … ## $ b_sf12pcs_dv <dbl+lbl> 57.49, -8.00, 51.84, 36.36, 62.98, 46.39, -9.00, 54.8… ## $ b_sf12mcs_dv <dbl+lbl> 54.20, -8.00, 54.96, 28.25, 36.22, 45.39, -9.00, 54.3… ## $ b_istrtdaty <dbl+lbl> 2010, 2010, 2010, 2011, 2010, 2011, 2010, 2011, 2010,… ## $ b_sf1 <dbl+lbl> 2, 1, 3, 5, 2, 3, 2, 2, 2, 3, 2, 3, 4, 5, 3, 2, 3, 2,…
We can use the select()
function to make smaller datasets with the variables we want. When we have a sequence of variables, we can use the “:
” operator to select all the variables in a range. We will use this to make smaller versions of the data for waves 2 to 4. Note that in wave 4, we have another variable we do not want to keep, “ncars”, which we can exclude by putting a “-” in front of it. Here is the code for selecting the variables of interest in waves 2 to 4.
us2 <- select(us2_full, pidp, b_single_dv:b_sf1) us3 <- select(us3_full, pidp, c_single_dv:c_sf1) us4 <- select(us4_full, pidp, d_single_dv:d_sf1, -d_ncars)
We can also rename variables to make them easier to work with. The command is simply rename()
. For example, we can rename the variable “dvage
” to “age
” and “hiqual_dv
” to “hiqual
” using the following code:
us1_full <- rename(us1_full, a_age = a_dvage, a_hiqual = a_hiqual_dv)
Before merging the data, we need to do two more types of renaming. It is generally easier to use prefixes instead of suffixes for time-varying variables as it facilitates data reshaping. Also, numbers are easier to work with than letters as wave number indicators. As a result, we will need to switch the naming convention from using letters as prefixes (the default in Understanding Society) to numbers as suffixes. Secondly, we do not need to keep the wave number in the variable name for variables we will treat as time-constant.
We can use the rename_all()
function to apply the renaming to all the variables. This takes as input the dataset and then a function that is applied to all the variables. In this case, we will use the str_remove() function to remove the wave reference from the variable names. Here is the code:
us1_full <- rename_all(us1_full, ~str_remove(., "a_"))
Two things to notice are that before the function, we need to add the “~
” symbol and that we need to use the “.
” to refer to the variable names.
We can use the same approach for the other waves of data, changing the letter used as the prefix accordingly using the code:
us2 <- rename_all(us2, ~str_remove(., "b_")) us3 <- rename_all(us3, ~str_remove(., "c_")) us4 <- rename_all(us4, ~str_remove(., "d_"))
Next, we need to add the new prefix, which will be the wave number. We can use the rename_at() function. This function takes as input the dataset, the variables we want to rename and the function that will be applied to the variables. In this case, we will use the str_c() function to add the wave number as a suffix. We also want to do this only for variables that are time-varying.
For wave 1, the command would be:
us1_full <- rename_at(us1_full, vars(single_dv:sf1), ~str_c(., "_1"))
Let’s check if this worked as expected:
glimpse(us1_full)
## Rows: 51,007 ## Columns: 13 ## $ pidp <dbl> 5, 6, 7, 10, 11, 13, 14, 16, 18, 19, 21, 22, 23, 24, 25, … ## $ sex <dbl+lbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2,… ## $ age <dbl+lbl> 28, 80, 60, 42, 53, 51, 38, 27, 56, 31, 46, 30, 47, 3… ## $ hiqual <dbl+lbl> 1, 9, 3, 1, 1, 9, 4, 5, 2, 9, 1, 1, 9, 4, 1, 3, 3, 1,… ## $ single_dv_1 <dbl+lbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,… ## $ urban_dv_1 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,… ## $ vote6_1 <dbl+lbl> 3, 4, 2, 1, -7, 3, 3, 4, 3, 4, 2, 4, 4, … ## $ fimngrs_dv_1 <dbl+lbl> 3283.87, 896.00, 1458.33, 2916.67, 4306.27, 1121.13,… ## $ sclfsato_1 <dbl+lbl> -9, 6, 2, -9, -7, 3, 5, 6, -9, -9, 6, 3, 5, … ## $ sf12pcs_dv_1 <dbl+lbl> 54.51, 53.93, 33.18, 58.30, -7.00, 46.66, 49.93, 30.7… ## $ sf12mcs_dv_1 <dbl+lbl> 55.73, 46.48, 46.80, 49.41, -7.00, 37.17, 56.10, 45.2… ## $ istrtdaty_1 <dbl+lbl> 2009, 2010, 2009, 2009, 2010, 2009, 2010, 2010, 2010,… ## $ sf1_1 <dbl+lbl> 2, 4, 5, 2, 3, 5, 3, 2, 3, 2, 2, 2, 4, 3, 2, 4, 2, 2,…
Let’s do the same for waves 2 to 4:
us2 <- rename_at(us2, vars(single_dv:sf1), ~str_c(., "_2")) us3 <- rename_at(us3, vars(single_dv:sf1), ~str_c(., "_3")) us4 <- rename_at(us4, vars(single_dv:sf1), ~str_c(., "_4")) glimpse(us2)
## Rows: 54,462 ## Columns: 10 ## $ pidp <dbl> 1, 2, 3, 4, 5, 6, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, … ## $ single_dv_2 <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,… ## $ urban_dv_2 <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ vote6_2 <dbl+lbl> 3, 3, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, -7, … ## $ fimngrs_dv_2 <dbl+lbl> 940.10, 1344.00, 514.66, 1111.65, 4002.50, 709.00,… ## $ sclfsato_2 <dbl+lbl> 6, -8, 5, 1, 6, 6, 6, 6, -8, 3, 5, 6, -7, … ## $ sf12pcs_dv_2 <dbl+lbl> 57.49, -8.00, 51.84, 36.36, 62.98, 46.39, -9.00, 54.8… ## $ sf12mcs_dv_2 <dbl+lbl> 54.20, -8.00, 54.96, 28.25, 36.22, 45.39, -9.00, 54.3… ## $ istrtdaty_2 <dbl+lbl> 2010, 2010, 2010, 2011, 2010, 2011, 2010, 2011, 2010,… ## $ sf1_2 <dbl+lbl> 2, 1, 3, 5, 2, 3, 2, 2, 2, 3, 2, 3, 4, 5, 3, 2, 3, 2,…
We are almost done with the renaming. One thing you need to ensure before reshaping data is that the names of the variables are separated by a unique character from the wave number. If that is not the case, the data restructuring will not work properly. If we look at wave 2, we see that “_” appears multiple times in the names of some of the variables (e.g., “fimngrs_dv_2”). We can remove the “_dv” from the variable names to avoid this issue and make the names simpler (“_dv” is used by Understanding Society to show which variables are derived after data collection). Here is the code for all four waves:
us1_full <- rename_all(us1_full, ~str_remove(., "_dv")) us2 <- rename_all(us2, ~str_remove(., "_dv")) us3 <- rename_all(us3, ~str_remove(., "_dv")) us4 <- rename_all(us4, ~str_remove(., "_dv"))
Let’s check the result:
glimpse(us2)
## Rows: 54,462 ## Columns: 10 ## $ pidp <dbl> 1, 2, 3, 4, 5, 6, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, 2… ## $ single_2 <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, … ## $ urban_2 <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ vote6_2 <dbl+lbl> 3, 3, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, -7, 4… ## $ fimngrs_2 <dbl+lbl> 940.10, 1344.00, 514.66, 1111.65, 4002.50, 709.00, … ## $ sclfsato_2 <dbl+lbl> 6, -8, 5, 1, 6, 6, 6, 6, -8, 3, 5, 6, -7, 5… ## $ sf12pcs_2 <dbl+lbl> 57.49, -8.00, 51.84, 36.36, 62.98, 46.39, -9.00, 54.80… ## $ sf12mcs_2 <dbl+lbl> 54.20, -8.00, 54.96, 28.25, 36.22, 45.39, -9.00, 54.37… ## $ istrtdaty_2 <dbl+lbl> 2010, 2010, 2010, 2011, 2010, 2011, 2010, 2011, 2010, … ## $ sf1_2 <dbl+lbl> 2, 1, 3, 5, 2, 3, 2, 2, 2, 3, 2, 3, 4, 5, 3, 2, 3, 2, …
We are now ready to move to the next stage.
3. Merging the data
Before we merge the data, it is useful to create a variable that tells us if someone is present in each wave. This will be useful later on when exploring attrition patterns (i.e., people dropping out) and deciding whether to use a balanced dataset or not (more about this in this blog post).
We can create a simple variable that takes the value TRUE if the person is present in the data. When we merge the data, if someone is not present in that wave, they will coded as missing on the variable. We can use the mutate()
function to create these variables. Here is the code for all four waves:
us1_full <- mutate(us1_full, present_1 = TRUE) us2 <- mutate(us2, present_2 = TRUE) us3 <- mutate(us3, present_3 = TRUE) us4 <- mutate(us4, present_4 = TRUE)
Now we have all we need to merge the data (always make sure you have a unique time constant identifier before you do this; in our case, we can use the "pidp
” variable). We will do this in a sequence starting with waves 1 and 2 and then adding the other waves. We will use the full_join()
function that takes the two datasets we want to merge and the variable we want to merge on as inputs. This command will keep all the cases even if some people are present in only one of the waves. We can later decide who we want to keep using the “present” variables.
Here is the code needed to merge the four datasets:
us12 <- full_join(us1_full, us2, by = "pidp") us123 <- full_join(us12, us3, by = "pidp") us1234 <- full_join(us123, us4, by = "pidp")
Let’s check our new data:
glimpse(us1234)
## Rows: 73,101 ## Columns: 44 ## $ pidp <dbl> 5, 6, 7, 10, 11, 13, 14, 16, 18, 19, 21, 22, 23, 24, 25, 2… ## $ sex <dbl+lbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, … ## $ age <dbl+lbl> 28, 80, 60, 42, 53, 51, 38, 27, 56, 31, 46, 30, 47, 32… ## $ hiqual <dbl+lbl> 1, 9, 3, 1, 1, 9, 4, 5, 2, 9, 1, 1, 9, 4, 1, 3, 3, 1, … ## $ single_1 <dbl+lbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, … ## $ urban_1 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, … ## $ vote6_1 <dbl+lbl> 3, 4, 2, 1, -7, 3, 3, 4, 3, 4, 2, 4, 4, 3… ## $ fimngrs_1 <dbl+lbl> 3283.87, 896.00, 1458.33, 2916.67, 4306.27, 1121.13, … ## $ sclfsato_1 <dbl+lbl> -9, 6, 2, -9, -7, 3, 5, 6, -9, -9, 6, 3, 5, 3… ## $ sf12pcs_1 <dbl+lbl> 54.51, 53.93, 33.18, 58.30, -7.00, 46.66, 49.93, 30.75… ## $ sf12mcs_1 <dbl+lbl> 55.73, 46.48, 46.80, 49.41, -7.00, 37.17, 56.10, 45.20… ## $ istrtdaty_1 <dbl+lbl> 2009, 2010, 2009, 2009, 2010, 2009, 2010, 2010, 2010, … ## $ sf1_1 <dbl+lbl> 2, 4, 5, 2, 3, 5, 3, 2, 3, 2, 2, 2, 4, 3, 2, 4, 2, 2, … ## $ present_1 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE… ## $ single_2 <dbl+lbl> 1, 0, NA, NA, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0… ## $ urban_2 <dbl+lbl> 1, 1, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2… ## $ vote6_2 <dbl+lbl> 4, 4, NA, NA, 2, 4, 4, 4, 4, 3, 2, 4, 4, 2… ## $ fimngrs_2 <dbl+lbl> 4002.50, 709.00, NA, NA, 4010.01, 280.17, … ## $ sclfsato_2 <dbl+lbl> 6, 6, NA, NA, 6, 3, 5, 6, 5, 5, 4, 6, 6, 1… ## $ sf12pcs_2 <dbl+lbl> 62.98, 46.39, NA, NA, 54.80, 62.65, 56.10, 50.45… ## $ sf12mcs_2 <dbl+lbl> 36.22, 45.39, NA, NA, 54.37, 31.34, 57.16, 58.47… ## $ istrtdaty_2 <dbl+lbl> 2010, 2011, NA, NA, 2011, 2010, 2011, 2011, 2011, … ## $ sf1_2 <dbl+lbl> 2, 3, NA, NA, 2, 3, 2, 3, 5, 3, 3, 2, 5, 2… ## $ present_2 <lgl> TRUE, TRUE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR… ## $ single_3 <dbl+lbl> 1, 0, NA, 0, 0, 0, 0, 1, 1, NA, 0, 0, 1, NA… ## $ urban_3 <dbl+lbl> 1, 1, NA, 1, 1, 1, 1, 1, 1, NA, 1, 1, 2, NA… ## $ vote6_3 <dbl+lbl> 4, 4, NA, 3, 3, 4, 3, 4, 4, NA, 2, 4, 4, NA… ## $ fimngrs_3 <dbl+lbl> 3616.67, 702.00, NA, 336.58, 2751.66, 666.67, … ## $ sclfsato_3 <dbl+lbl> 2, -8, NA, 5, 6, -8, 3, 6, 3, NA, 6, 6, 3, NA… ## $ sf12pcs_3 <dbl+lbl> 56.97, -8.00, NA, 56.15, 57.28, -8.00, 58.05, 58.57… ## $ sf12mcs_3 <dbl+lbl> 60.02, -8.00, NA, 49.12, 49.52, -8.00, 57.16, 55.17… ## $ istrtdaty_3 <dbl+lbl> 2011, 2012, NA, 2012, 2012, 2011, 2012, 2012, 2012, … ## $ sf1_3 <dbl+lbl> 1, 3, NA, 2, 3, 3, 2, 3, 3, NA, 2, 2, 4, NA… ## $ present_3 <lgl> TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, TR… ## $ single_4 <dbl+lbl> 0, 0, NA, NA, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0… ## $ urban_4 <dbl+lbl> 1, 1, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2… ## $ vote6_4 <dbl+lbl> 4, 3, NA, NA, 1, 4, 4, 4, 3, -7, 2, 4, 4, 3… ## $ fimngrs_4 <dbl+lbl> 850.50, 829.15, NA, NA, 4322.68, 423.58, … ## $ sclfsato_4 <dbl+lbl> 1, 3, NA, NA, 2, 6, 5, 5, 5, -7, 6, 4, 6, 4… ## $ sf12pcs_4 <dbl+lbl> 56.15, 46.16, NA, NA, 56.42, 42.78, 56.40, 52.24… ## $ sf12mcs_4 <dbl+lbl> 59.04, 37.02, NA, NA, 54.45, 46.40, 60.02, 52.40… ## $ istrtdaty_4 <dbl+lbl> 2012, 2013, NA, NA, 2013, 2012, 2013, 2013, 2014, … ## $ sf1_4 <dbl+lbl> 3, 3, NA, NA, 2, 4, 2, 3, 4, 3, 2, 2, 4, 2… ## $ present_4 <lgl> TRUE, TRUE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
We have ~73,000 cases and 45 variables in this “wide” data. Before we reshape the data, let’s explore the participation in the study using the count() command and the “present” variables. Here, we make a table for participation in all four waves:
count(us1234, present_1, present_2, present_3, present_4)
## # A tibble: 15 × 5 ## present_1 present_2 present_3 present_4 n ## <lgl> <lgl> <lgl> <lgl> <int> ## 1 TRUE TRUE TRUE TRUE 26748 ## 2 TRUE TRUE TRUE NA 4146 ## 3 TRUE TRUE NA TRUE 1618 ## 4 TRUE TRUE NA NA 5779 ## 5 TRUE NA TRUE TRUE 1581 ## 6 TRUE NA TRUE NA 622 ## 7 TRUE NA NA TRUE 332 ## 8 TRUE NA NA NA 10181 ## 9 NA TRUE TRUE TRUE 10805 ## 10 NA TRUE TRUE NA 2069 ## 11 NA TRUE NA TRUE 634 ## 12 NA TRUE NA NA 2663 ## 13 NA NA TRUE TRUE 2334 ## 14 NA NA TRUE NA 1039 ## 15 NA NA NA TRUE 2550
We see that around 26,000 were present in all the waves, ~4,000 dropped out at wave 4, etc. We also see that some people joined later in the study. For example, around 12,000 people joined at wave 2. It is important to understand these patterns before deciding how to handle the missing data. For example, everyone who joined after wave 2 will be missing in the wave 1 data, including our time constant variables. We could choose to drop these people or get their time-constant information from wave 2. For now, we will keep all the people present in wave 1, even if they drop out later. This is still unbalanced data (some people drop out), but the reference population is the sample in wave 1 (more about balanced and unbalanced data can be found in this post).
The code for filtering the cases and describing the data are:
us <- filter(us1234, present_1 == TRUE) count(us, present_1, present_2, present_3, present_4)
## # A tibble: 8 × 5 ## present_1 present_2 present_3 present_4 n ## <lgl> <lgl> <lgl> <lgl> <int> ## 1 TRUE TRUE TRUE TRUE 26748 ## 2 TRUE TRUE TRUE NA 4146 ## 3 TRUE TRUE NA TRUE 1618 ## 4 TRUE TRUE NA NA 5779 ## 5 TRUE NA TRUE TRUE 1581 ## 6 TRUE NA TRUE NA 622 ## 7 TRUE NA NA TRUE 332 ## 8 TRUE NA NA NA 10181
Now, we have all the variables and cases we want and can move to the next step.
4. Reshaping the data
In general, it is easier to clean data when it is in the long format as we need to recode variables just once, regardless of the number of time points. This does assume that the variables are coded consistently over time (i.e., values on the same variable mean the same thing). We will use the pivot_longer() command to do this (see an overall introduction to pivoting here and an introduction to how to apply it to longitudinal data here). We will use the following inputs:
- the data,
- the columns that are time-varying (notice here that we select the time constant variables and put “!” to take the opposite, i.e., the other variables),
- the separator that delimitates the name of the variables from the wave number,
- the pattern in the name. Here, we say that the first part of the name is the variable name (defined as “.value”) and that the suffix should be a new variable called “wave”.
This applied to our data results in the following code:
usl <- pivot_longer( us, cols = !c(pidp, sex, age, hiqual), names_sep = "_", names_to = c(".value", "wave") ) glimpse(usl)
## Rows: 204,028 ## Columns: 15 ## $ pidp <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11, … ## $ sex <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ age <dbl+lbl> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 42, … ## $ hiqual <dbl+lbl> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,… ## $ wave <chr> "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", … ## $ single <dbl+lbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, NA, NA, NA, 0, NA, … ## $ urban <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, 1, NA, … ## $ vote6 <dbl+lbl> 3, 4, 4, 4, 4, 4, 4, 3, 2, NA, NA, NA, 1, NA, … ## $ fimngrs <dbl+lbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 7… ## $ sclfsato <dbl+lbl> -9, 6, 2, 1, 6, 6, -8, 3, 2, NA, NA, NA, -9, NA, … ## $ sf12pcs <dbl+lbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, -8.00, 46.16, … ## $ sf12mcs <dbl+lbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, -8.00, 37.02, … ## $ istrtdaty <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, … ## $ sf1 <dbl+lbl> 2, 2, 1, 3, 4, 3, 3, 3, 5, NA, NA, NA, 2, NA, … ## $ present <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA…
We now have many more rows, ~ 204,000, and fewer columns, 15. Now, each row is an individual measured at a particular point in time.
To enhance the visibility of the data’s structure, let’s reorganize the columns and rows, positioning “pidp” and “wave” at the start and ordering the data by these two variables. This rearrangement allows us to observe each individual’s progression over time easily. For row sorting, we’ll employ the arrange()
function. Here’s the code that executes these adjustments, subsequently displaying the initial lines of the data for review:
usl <- select(usl, pidp, wave, everything()) usl <- arrange(usl, pidp, wave) head(usl)
## # A tibble: 6 × 15 ## pidp wave sex age hiqual single urban vote6 fimngrs sclfsato ## <dbl> <chr> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+lb> ## 1 5 1 1 [male] 28 1 [Deg… 1 [Yes] 1 [urb… 3 [not… 3284. -9 [mis… ## 2 5 2 1 [male] 28 1 [Deg… 1 [Yes] 1 [urb… 4 [or … 4002. 6 [mos… ## 3 5 3 1 [male] 28 1 [Deg… 1 [Yes] 1 [urb… 4 [or … 3617. 2 [mos… ## 4 5 4 1 [male] 28 1 [Deg… 0 [No] 1 [urb… 4 [or … 850. 1 [com… ## 5 6 1 1 [male] 80 9 [No … 0 [No] 1 [urb… 4 [or … 896 6 [mos… ## 6 6 2 1 [male] 80 9 [No … 0 [No] 1 [urb… 4 [or … 709 6 [mos… ## # ℹ 5 more variables: sf12pcs <dbl+lbl>, sf12mcs <dbl+lbl>, ## # istrtdaty <dbl+lbl>, sf1 <dbl+lbl>, present <lgl>
Note that here everything()
is a helper function that keeps all the other variables in their current order.
When we finish cleaning the data, we can also create a wide version of the data. This is useful for some types of analysis and descriptive statistics. We can use the pivot_wider()
command with the following inputs:
- long data
- time-varying variables (
values_from
) - how to separate the suffix (
names_sep
) - where to take the suffix from (
names_from
)
In our case, reshaping the data to wide format can be done using the following code:
usw <- pivot_wider( usl, values_from = !c(pidp, sex, age, hiqual, wave), names_sep = "_", names_from = wave ) glimpse(usw)
## Rows: 51,007 ## Columns: 44 ## $ pidp <dbl> 5, 6, 7, 10, 11, 13, 14, 16, 18, 19, 21, 22, 23, 24, 25, 2… ## $ sex <dbl+lbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, … ## $ age <dbl+lbl> 28, 80, 60, 42, 53, 51, 38, 27, 56, 31, 46, 30, 47, 32… ## $ hiqual <dbl+lbl> 1, 9, 3, 1, 1, 9, 4, 5, 2, 9, 1, 1, 9, 4, 1, 3, 3, 1, … ## $ single_1 <dbl+lbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, … ## $ single_2 <dbl+lbl> 1, 0, NA, NA, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0… ## $ single_3 <dbl+lbl> 1, 0, NA, 0, 0, 0, 0, 1, 1, NA, 0, 0, 1, NA… ## $ single_4 <dbl+lbl> 0, 0, NA, NA, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0… ## $ urban_1 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, … ## $ urban_2 <dbl+lbl> 1, 1, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2… ## $ urban_3 <dbl+lbl> 1, 1, NA, 1, 1, 1, 1, 1, 1, NA, 1, 1, 2, NA… ## $ urban_4 <dbl+lbl> 1, 1, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2… ## $ vote6_1 <dbl+lbl> 3, 4, 2, 1, -7, 3, 3, 4, 3, 4, 2, 4, 4, 3… ## $ vote6_2 <dbl+lbl> 4, 4, NA, NA, 2, 4, 4, 4, 4, 3, 2, 4, 4, 2… ## $ vote6_3 <dbl+lbl> 4, 4, NA, 3, 3, 4, 3, 4, 4, NA, 2, 4, 4, NA… ## $ vote6_4 <dbl+lbl> 4, 3, NA, NA, 1, 4, 4, 4, 3, -7, 2, 4, 4, 3… ## $ fimngrs_1 <dbl+lbl> 3283.87, 896.00, 1458.33, 2916.67, 4306.27, 1121.13, … ## $ fimngrs_2 <dbl+lbl> 4002.50, 709.00, NA, NA, 4010.01, 280.17, … ## $ fimngrs_3 <dbl+lbl> 3616.67, 702.00, NA, 336.58, 2751.66, 666.67, … ## $ fimngrs_4 <dbl+lbl> 850.50, 829.15, NA, NA, 4322.68, 423.58, … ## $ sclfsato_1 <dbl+lbl> -9, 6, 2, -9, -7, 3, 5, 6, -9, -9, 6, 3, 5, 3… ## $ sclfsato_2 <dbl+lbl> 6, 6, NA, NA, 6, 3, 5, 6, 5, 5, 4, 6, 6, 1… ## $ sclfsato_3 <dbl+lbl> 2, -8, NA, 5, 6, -8, 3, 6, 3, NA, 6, 6, 3, NA… ## $ sclfsato_4 <dbl+lbl> 1, 3, NA, NA, 2, 6, 5, 5, 5, -7, 6, 4, 6, 4… ## $ sf12pcs_1 <dbl+lbl> 54.51, 53.93, 33.18, 58.30, -7.00, 46.66, 49.93, 30.75… ## $ sf12pcs_2 <dbl+lbl> 62.98, 46.39, NA, NA, 54.80, 62.65, 56.10, 50.45… ## $ sf12pcs_3 <dbl+lbl> 56.97, -8.00, NA, 56.15, 57.28, -8.00, 58.05, 58.57… ## $ sf12pcs_4 <dbl+lbl> 56.15, 46.16, NA, NA, 56.42, 42.78, 56.40, 52.24… ## $ sf12mcs_1 <dbl+lbl> 55.73, 46.48, 46.80, 49.41, -7.00, 37.17, 56.10, 45.20… ## $ sf12mcs_2 <dbl+lbl> 36.22, 45.39, NA, NA, 54.37, 31.34, 57.16, 58.47… ## $ sf12mcs_3 <dbl+lbl> 60.02, -8.00, NA, 49.12, 49.52, -8.00, 57.16, 55.17… ## $ sf12mcs_4 <dbl+lbl> 59.04, 37.02, NA, NA, 54.45, 46.40, 60.02, 52.40… ## $ istrtdaty_1 <dbl+lbl> 2009, 2010, 2009, 2009, 2010, 2009, 2010, 2010, 2010, … ## $ istrtdaty_2 <dbl+lbl> 2010, 2011, NA, NA, 2011, 2010, 2011, 2011, 2011, … ## $ istrtdaty_3 <dbl+lbl> 2011, 2012, NA, 2012, 2012, 2011, 2012, 2012, 2012, … ## $ istrtdaty_4 <dbl+lbl> 2012, 2013, NA, NA, 2013, 2012, 2013, 2013, 2014, … ## $ sf1_1 <dbl+lbl> 2, 4, 5, 2, 3, 5, 3, 2, 3, 2, 2, 2, 4, 3, 2, 4, 2, 2, … ## $ sf1_2 <dbl+lbl> 2, 3, NA, NA, 2, 3, 2, 3, 5, 3, 3, 2, 5, 2… ## $ sf1_3 <dbl+lbl> 1, 3, NA, 2, 3, 3, 2, 3, 3, NA, 2, 2, 4, NA… ## $ sf1_4 <dbl+lbl> 3, 3, NA, NA, 2, 4, 2, 3, 4, 3, 2, 2, 4, 2… ## $ present_1 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE… ## $ present_2 <lgl> TRUE, TRUE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR… ## $ present_3 <lgl> TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, TR… ## $ present_4 <lgl> TRUE, TRUE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
5. Save the data
Now that we have the data in a long and wide format, we can save it for future use. For example, we could save the long data as a “rds” file using write_rds()
:
write_rds(usl, "./data/usl_syn.rds")
Conclusions
The process of preparing data for longitudinal analysis can be complex and time-consuming, yet it is the cornerstone of sound research. The journey begins with importing data, followed by selecting and renaming variables, merging datasets, choosing relevant cases, and reshaping the data. This step-by-step guide aimed to equip you with the necessary tools to prepare your own data for analysis efficiently. After following the guide on your computer, try adapting the code to your own data. Put any questions you might have in the comments below!
After transforming the data, the next stage is to clean it by recoding the variables. Then, it can be explored using tables and visualization.
Was the information useful?
Consider supporting the site by:
3 responses to “Preparing longitudinal data for analysis in R: a step-by-step guide”
[…] steps from importing data to merging and reshaping it. We covered this process in detail in our previous blog post. The next phase focuses on data cleaning, which includes recoding variables, generating new ones, […]
[…] the data we cleaned in a previous post. Remember to set up your working directory as previously described. Next let’s load the packages we need. Here we will be mainly using […]
[…] – Preparing longitudinal data for analysis in R: a step-by-step guide […]