Merge Data in R: Complete Guide to Joining Datasets

When working with more complex data, information is often stored in different files. For example, individual and neighbourhood data might come from different sources, and you might need to merge data before you can analyse it. Similarly, in panel data, we might have individual and household-level data as well as data at multiple time points. When accessing the data, each one is typically stored in a separate file. To make the most of this information, we need to merge data at different levels. This can be a complex process, and it is important to do it correctly to avoid mistakes and ensure we use the data as effectively as possible.

In this post, I will provide a step-by-step guide to merging data in R, with a focus on panel data.

Different ways to merge data

There are different types of merges or joins possible. In social science, we typically perform a mutating join, which aims to bring information together from multiple sources for analysis. If we use the tydiverse package, these would come in four slightly different flavours. At one extreme, full_join() will bring all variables and cases together into a dataset, while at the other, inner_join() will keep only cases present in both datasets. In between, we have left_join() and right_join(), which keep all cases from one dataset and only the matching cases from the other. Here is a visual representation of these different types of joins:

Another dimension to consider when joining data is the level at which it happens. The most straightforward approach would be a “1:1” merger, in which each case in one dataset has a unique match in the other. For example, if we have data at two points in time at the individual level can do a 1:1 merger. However, sometimes we also deal with “1:m” or “m:m” mergers. For example, if we have individual-level and household-level data, we would need to perform a “1:m” merger, since multiple individuals can belong to the same household. Similarly, if we have neighbourhood and school data that we want to link to students, we would have a “m:m” merger.

In R, the join command identifies the type of join, so we do not need to explicitly specify it. As a result, we need to ensure that the data and IDs are structured correctly before we merge them.

Merging panel data

Longitudinal or panel data can be particularly complex to merge, as they often include data at different levels (e.g., individual, household, neighbourhood) and multiple time points. We will use this to illustrate how to merge data in general.

As an example, we will use data from the Understanding Society study to explore the main types of merging you might want to use. This study includes data at different levels. Here we will focus on the following types of data:

“indresp”: individual-level data with answers from the main survey
“hhresp”: household-level data with information about the household
“youth”: survey answers from those between 10 and 15 in the household
“egoalt”: data about the relationships in the household

Before merging, it is also important to understand the IDs (sometimes called “keys”). In this case, we can use the following variables:

“pidp”: individual-level ID that can also be used across time
“x_hidp”: household-level ID that can be used within each point in time but not across time
“x_pn”: identifies the individual in the household. Could be used in combination with “x_hidp” to identify individuals, as an alternative to “pidp”
“apidp”: identifies the other person in the relationship file. Could be used in combination with “pidp” to identify the partner, child, sibling, etc.

The other thing to keep in mind with this study is that the first letter denotes the first wave, so “a_indresp” is the wave 1 data, while “b_indresp” is the second wave. This is also true for the variable names. An exception to this is “pidp”, which is stable in time, so it does not need a prefix.

We will explore four different scenarios of merging panel data:

add household-level information to individual-level data
combine info from the youth data with that from the household and the adults
add information about the partner at the individual level data
merge across waves

We can visualise the data and the different levels in the following way:

different ways to merge data when you have panel of longitudinal data

Let’s prepare the ground for our scenarios. We will import the data we want to use (using the col_select option to select just the variables of interest) and do some basic cleaning to make it easier to work with:

library(tidyverse)
library(haven)

a_indresp <- read_dta("./data/a_indresp.dta",
                    col_select = all_of(c("pidp", "a_hidp", "a_dvage", "a_sex",
                                          "a_marstat_dv", "a_sf12mcs_dv",
                                          "a_hiqual_dv")))
a_hhresp <- read_dta("./data/a_hhresp.dta",
                    col_select = all_of(c("a_hidp", "a_hhsize", "a_nkids_dv",
                                          "a_hhtype_dv", "a_tenure_dv")))
a_youth <- read_dta("./data/a_youth.dta",
                    col_select = all_of(c("pidp", "a_hidp", "a_yp2uni",
                                          "a_ypacvwell", "a_ypargm")))
a_egoalt <- read_dta("./data/a_egoalt.dta")

b_indresp <- read_dta("./data/US/UKDA-6614-stata/stata/stata13_se/ukhls/b_indresp.dta",
                      col_select = all_of(c("pidp", "b_hidp",
                                            "b_marstat_dv", "b_sf12mcs_dv")))


# some data cleaning
a_indresp <- a_indresp |>
  mutate(a_degree = ifelse(a_hiqual_dv %in% 1:2, T, F),
         a_partner = ifelse(a_marstat_dv %in% 1:2, T, F),
         a_sf12mcs = ifelse(a_sf12mcs_dv < 0, NA, a_sf12mcs_dv))

Merge data scenario 1: add household-level information to individual-level data

The first scenario is the most straightforward one. We want to add household-level information to individual-level data. This is a “1:m” merger, as multiple individuals can belong to the same household. The ID we will use is “a_hidp”, which is present in both datasets. We will apply the left_join() command to keep all cases from the individual-level data and add only the matching cases from the household-level data.

We use this strategy because the main unit of analysis for us is the individual, and we want to retain all individuals in our data. If we used an inner_join(), we would lose all the individuals that do not have a match in the household-level data. Similarly, if we used a full_join(), we would get many missing values in the household-level variables for individuals who do not have a match in the household data.

The command is pretty straightforward once we understand what we are doing:

a_ind_hh <- left_join(a_indresp, a_hhresp, by = "a_hidp")
glimpse(a_ind_hh)

## Rows: 50,994
## Columns: 14
## $ pidp         <dbl> 68001367, 68004087, 68006127, 68006135, 68006807, 6800748…
## $ a_hidp       <dbl+lbl> 68001363, 68004083, 68006123, 68006123, 68006803, 680…
## $ a_sex        <dbl+lbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2,…
## $ a_dvage      <dbl+lbl> 39, 59, 39, 17, 72, 57, 59, 25, 22, 38, 51, 51, 31, 2…
## $ a_marstat_dv <dbl+lbl> 6, 4, 2, 6, 4, 1, 1, 6, 6, 1, 1, 3, 6, 6, 4, 6, 4, 6,…
## $ a_hiqual_dv  <dbl+lbl> 3, 2, 4, 2, 9, 2, 3, 3, 3, 4, 3, 5, 2, 3, 1, 3, 1, 9,…
## $ a_sf12mcs_dv <dbl+lbl> 61.13, 58.17, 26.66, 41.58, 43.89, 59.53, -7.00, -7.0…
## $ a_degree     <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALS…
## $ a_partner    <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALS…
## $ a_sf12mcs    <dbl> 61.13, 58.17, 26.66, 41.58, 43.89, 59.53, NA, NA, NA, 24.…
## $ a_hhsize     <dbl+lbl> 1, 1, 4, 4, 1, 4, 4, 4, 4, 2, 2, 1, 1, 1, 2, 2, 1, 4,…
## $ a_hhtype_dv  <dbl+lbl>  3,  3, 11, 11,  2, 19, 19, 19, 19,  6,  6,  3,  3,  …
## $ a_nkids_dv   <dbl+lbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ a_tenure_dv  <dbl+lbl> 2, 7, 6, 6, 2, 2, 2, 2, 2, 1, 1, 2, 2, 3, 6, 6, 2, 6,…

Now we have a combination of individual and household-level data. We can use this information to explore how individual and household-level information explain outcomes of interest. For example, let’s imagine we want to see how the mental health of individuals (measured by “a_sf12mcs” from the individual-level data) differs for those with children in the household (measured by “a_nkids_dv” from the household-level data). We can do this by grouping the data by whether they have children and then calculating the average mental health score for each group.

a_ind_hh |>
  group_by(a_nkids_dv > 0) |>
  summarise(mean_sf12mcs = mean(a_sf12mcs, na.rm = T),
            n = n())

## # A tibble: 2 × 3
##   `a_nkids_dv > 0` mean_sf12mcs     n
##   <lgl>                   <dbl> <int>
## 1 FALSE                    50.8 32968
## 2 TRUE                     49.8 18026

In this case, it appears that people who have children in the household have lower mental health on average. This is just an example, but it shows how we can use the merged data to explore interesting research questions.

Merge data scenario 2: add info to youth data from household and adult

In addition to the individual-level interview, Understanding Society also collects data from those aged 10 to 15 in the household. This is stored in the file called “youth”.

When analysing this kind of data, we might want to explore how context, such as the household, influenced the outcomes of interest for the youth. We could get this kind of information in two ways.

First, we could merge the youth data with the household-level data. This is similar to what we did above with the main individual data. In this case, the code would be:

a_youth_hh <- left_join(a_youth, a_hhresp, by = "a_hidp")
glimpse(a_youth_hh)

## Rows: 4,899
## Columns: 9
## $ pidp        <dbl> 68006139, 68026539, 68028579, 68028583, 68028587, 68029935…
## $ a_hidp      <dbl> 68006123, 68026523, 68028563, 68028563, 68028563, 68029923…
## $ a_ypargm    <dbl+lbl> 4, 4, 2, 1, 4, 3, 4, 2, 3, 3, 3, 3, 4, 4, 2, 4, 4, 4, …
## $ a_ypacvwell <dbl+lbl> 1, 2, 1, 3, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, …
## $ a_yp2uni    <dbl+lbl>  1,  1,  1, -9,  1,  1,  1,  1,  1,  1, -8,  1,  1, -8…
## $ a_hhsize    <dbl+lbl> 4, 5, 6, 6, 6, 4, 2, 3, 5, 4, 5, 5, 4, 4, 4, 3, 5, 4, …
## $ a_hhtype_dv <dbl+lbl> 11, 12, 12, 12, 12, 11,  4, 10, 12, 20, 20, 20,  5,  5…
## $ a_nkids_dv  <dbl+lbl> 1, 2, 3, 3, 3, 2, 1, 1, 3, 1, 2, 2, 3, 3, 3, 1, 2, 2, …
## $ a_tenure_dv <dbl+lbl> 6, 6, 2, 2, 2, 2, 7, 1, 3, 2, 2, 2, 3, 3, 6, 2, 2, 2, …

The second way to get household information is to aggregate data from the main interview with adults and then merge it with the youth data. For example, we could take the “a_degree” variable from the main interview and, for each household, calculate whether at least one person has a degree and the proportion of people who do. This would give us a different perspective on the household context.

Here, we group the data by “a_hidp” and use transmute() to create new variables that summarise the information about the degree in the household. The first variable, “a_onedegree”, is a binary indicator of whether at least one person in the household has a degree. The second variable “a_avg_degree” is the proportion of individuals with a degree. We then filter the data to keep only one row per household (using row_number() == 1) and ungroup it.

a_agg <- a_indresp  |>
  group_by(a_hidp) |>
  transmute(a_onedegree = sum(a_degree, na.rm = T) > 0,
            a_avg_degree = mean(a_degree, na.rm = T))  |>
  filter(row_number() == 1) |>
  ungroup()

glimpse(a_agg)

## Rows: 30,035
## Columns: 3
## $ a_hidp       <dbl+lbl> 68001363, 68004083, 68006123, 68006803, 68007483, 680…
## $ a_onedegree  <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE…
## $ a_avg_degree <dbl> 0.00, 1.00, 0.50, 0.00, 0.25, 0.00, 0.00, 1.00, 0.00, 0.5…

Now we can link this information to the youth data using left_join(), since we want to keep all cases from the youth data and only add matching cases from the aggregated data.

a_youth_hh2 <- left_join(a_youth_hh, a_agg, by = "a_hidp")
glimpse(a_youth_hh2)

## Rows: 4,899
## Columns: 11
## $ pidp         <dbl> 68006139, 68026539, 68028579, 68028583, 68028587, 6802993…
## $ a_hidp       <dbl+lbl> 68006123, 68026523, 68028563, 68028563, 68028563, 680…
## $ a_ypargm     <dbl+lbl> 4, 4, 2, 1, 4, 3, 4, 2, 3, 3, 3, 3, 4, 4, 2, 4, 4, 4,…
## $ a_ypacvwell  <dbl+lbl> 1, 2, 1, 3, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1,…
## $ a_yp2uni     <dbl+lbl>  1,  1,  1, -9,  1,  1,  1,  1,  1,  1, -8,  1,  1, -…
## $ a_hhsize     <dbl+lbl> 4, 5, 6, 6, 6, 4, 2, 3, 5, 4, 5, 5, 4, 4, 4, 3, 5, 4,…
## $ a_hhtype_dv  <dbl+lbl> 11, 12, 12, 12, 12, 11,  4, 10, 12, 20, 20, 20,  5,  …
## $ a_nkids_dv   <dbl+lbl> 1, 2, 3, 3, 3, 2, 1, 1, 3, 1, 2, 2, 3, 3, 3, 1, 2, 2,…
## $ a_tenure_dv  <dbl+lbl> 6, 6, 2, 2, 2, 2, 7, 1, 3, 2, 2, 2, 3, 3, 6, 2, 2, 2,…
## $ a_onedegree  <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FAL…
## $ a_avg_degree <dbl> 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.…

We could use the contextual data to explain the outcomes of the children in the “youth” data. For example, let’s see whether the likelihood of them wanting to go to university (measured by “a_yp2uni”) differs for those with at least one household member with a degree (measured by “a_onedegree”). We can do this by grouping the data by whether the household has at least one person with a degree, and then calculating the proportion who want to go to university for each group.

a_youth_hh2 |>
  filter(a_yp2uni %in% 1:2) |> # remove missing cases
  group_by(a_onedegree) |>
  count(a_yp2uni) |>
  na.omit() |> # remove missing cases
  mutate(prop = n/sum(n))

## # A tibble: 4 × 4
## # Groups:   a_onedegree [2]
##   a_onedegree a_yp2uni      n   prop
##   <lgl>       <dbl+lbl> <int>  <dbl>
## 1 FALSE       1 [yes]    1688 0.938 
## 2 FALSE       2 [no]      112 0.0622
## 3 TRUE        1 [yes]    1591 0.971 
## 4 TRUE        2 [no]       48 0.0293

We see that most youth in the sample want to go to university, and the proportion is slightly higher among those with at least one person in the household who has a degree. Again, this is just an example, but it shows how we can use the merged data to explore interesting research questions.

Merge data scenario 3: add information about the partner at the individual level

Another interesting aspect of this study is that it is a household panel. That means that all the adults in the household are interviewed. This also allows us to explore how relationships among individuals in the household influence the outcomes of interest. For example, we could explore how the mental health of individuals is influenced by the mental health of their partner.

To understand how individuals relate to each other, we need to use the “egoalt” data. In this data, each row represents a relationship between two individuals in the same household.

glimpse(a_egoalt)

## Rows: 182,802
## Columns: 18
## $ pidp              <dbl> 68006127, 68006127, 68006127, 68006131, 68006131, 68…
## $ a_hidp            <dbl> 68006123, 68006123, 68006123, 68006123, 68006123, 68…
## $ a_pno             <dbl+lbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2, …
## $ apidp             <dbl> 68006131, 68006135, 68006139, 68006127, 68006135, 68…
## $ a_apno            <dbl+lbl> 2, 3, 4, 1, 3, 4, 1, 2, 4, 1, 2, 3, 2, 3, 4, 1, …
## $ a_hhorig          <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ a_memorig         <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ a_psu             <dbl+lbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, …
## $ a_strata          <dbl+lbl> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, …
## $ a_sampst          <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ a_sex             <dbl+lbl> 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
## $ a_relationship    <dbl+lbl>  2,  9,  9,  2, 12, 12,  4,  7, 14,  4,  7, 14, …
## $ a_rel_dv          <dbl+lbl>  3,  4,  4,  3,  7,  7, 13, 25, 10, 13, 25, 10, …
## $ a_relationship_dv <dbl+lbl>  2,  9,  9,  2, 12, 12,  4,  7, 14,  4,  7, 14, …
## $ a_asex            <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, …
## $ pid               <dbl> -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, …
## $ a_enwstat         <dbl+lbl> 2, 2, 1, 3, 3, 3, 3, 3, 3, 1, 2, 2, 1, 2, 1, 1, …
## $ a_anwstat         <dbl+lbl> 3, 3, 1, 2, 3, 2, 2, 3, 2, 1, 3, 3, 1, 3, 1, 1, …

The main information of interest is the “apidp”, the ID of the other person in the relationship and “a_rel_dv”, which indicates the type of relationship between the two individuals.

count(a_egoalt, a_rel_dv)

## # A tibble: 27 × 2
##    a_rel_dv                        n
##    <dbl+lbl>                   <int>
##  1  0 [Undefined]                256
##  2  2 [Lawful spouse]          29466
##  3  3 [Live-in partner]         6568
##  4  4 [Natural child]          43874
##  5  5 [Adopted child]            315
##  6  6 [Foster child]             122
##  7  7 [Step/partner's-child]    1844
##  8  9 [Daughter/son-in-law]      631
##  9 10 [Natural brother/sister] 35018
## 10 11 [Other brother/sister]    1026
## # ℹ 17 more rows

We will focus on the relationships that are coded as 2 or 3, which indicate a partner. We will select the “pidp” and “apidp” variables to create a dataset containing the individual’s ID and their partner’s ID.

a_partner <- a_egoalt |>
  filter(a_rel_dv %in% 2:3) |>
  select(pidp, apidp)

Now that we have a dataset containing each individual’s ID and their partners, we can merge it with the individual-level data to get information about the partners’ mental health. We will use a left_join() to keep all cases from the individual-level data and add only the matching cases from the partner data. We will also select only the relevant variables and rename them to make them easier to work with later.

Note that here we use “apidp” to do the merging (using by = c("apidp" = "pidp")) so the information we select is that of the partner.

a_partner_data <- left_join(a_partner, a_indresp, by = c("apidp" = "pidp")) |>
  select(pidp, apidp, a_sf12mcs) |>
  rename(a_partner_pidp = apidp,
         a_partner_sf12mcs = a_sf12mcs)

Now we can merge this partner data with the individual-level data to create a dataset that includes information on the individuals and their partners’ mental health.

a_indresp2 <- left_join(a_indresp, a_partner_data, by = "pidp")

With this data, we can explore how the mental health of individuals is related to the mental health of their partners. For example, we could calculate the correlation between the individual’s mental health and their partner’s mental health.

a_indresp2 |>
  select(a_sf12mcs, a_partner_sf12mcs) |>
  cor(use = "complete.obs")

##                   a_sf12mcs a_partner_sf12mcs
## a_sf12mcs         1.0000000         0.2889263
## a_partner_sf12mcs 0.2889263         1.0000000

There seems to be a small positive correlation between an individual’s mental health and their partner’s. This approach could be extended to examine relationships such as that between mothers and daughters or between siblings, just to give some examples.

Merge data scenario 4: merge across waves

Finally, the most common type of merger we use when working with longitudinal data is merging data across waves. One useful strategy here is to create indicators for whether someone is present in your data before you perform the merger. This will help you identify patterns of participation and attrition.

a_indresp2 <- mutate(a_indresp2, a_present = T)
b_indresp <- mutate(b_indresp, b_present = T)

Once we have these indicators, we can do the merger. I tend to use full_join() to bring all the cases together, then use the “present” variables to decide which ones I want to keep for my analysis. For example, if I want to conduct a balanced panel analysis, I would keep only cases present in both waves. If I want to conduct an unbalanced panel analysis, I would keep all cases that appear in at least one wave.

Let’s merge the data across time:

ab_indresp <- full_join(a_indresp2, b_indresp, by = "pidp")

And let’s explore the patterns of participation:

count(ab_indresp, a_present, b_present)

## # A tibble: 3 × 3
##   a_present b_present     n
##   <lgl>     <lgl>     <int>
## 1 TRUE      TRUE      38364
## 2 TRUE      NA        12630
## 3 NA        TRUE      16205

Here we see that around 38,000 cases were present in both waves, and 12,000 dropped out after the first wave. We also have around 16,000 new cases that were not present in the first wave but are present in the second wave.

If we want to do a balanced panel analysis, we would keep only those cases that are present in both waves. We can do this by filtering the data to keep only those cases where both “a_present” and “b_present” are TRUE:

ab_ind_balanced <- filter(ab_indresp, a_present == T, b_present == T)
count(ab_ind_balanced, a_present, b_present)

## # A tibble: 1 × 3
##   a_present b_present     n
##   <lgl>     <lgl>     <int>
## 1 TRUE      TRUE      38364

Tips on how to better merge data

A few things to consider before starting to merge panel data:

Tip 1: Understand data sources

It is important to spend time at the beginning of your research to become familiar with your study and how it organises the data. Each study is different, and it’s important to know what data is available, how it is structured and how the different files relate to each other. This will help you to plan your strategy for merging data. It will also help you to understand the limitations of your data and to make informed decisions about how to use it. For example, you can look at the documentation for Understanding Society and that of the Health and Retirement Study to see how they organise their data and what information they provide.

Tip 2: Understand your IDs

In addition to the types of files you have, it is important to know how the IDs are coded in each file and how they relate to each other. This will help you to identify the correct IDs to use for merging and to avoid mistakes. For example, in Understanding Society, we have a “pidp” variable that is stable across time and can be used to merge data across waves. We also have the “x_hidp” variables, which are valid only within each wave and cannot be used to merge data across waves.

Tip 3: Use the correct sequence of merging

When merging panel data, it is important to use the correct sequence. One useful strategy is to first merge the data within each wave (e.g., individual and household) and then merge across waves.

Tip 4: Define the target sample and use the correct type of merging

Before starting to merge panel data, it is important to define your target sample and to use the correct type of merging. For example, if you want to do a balanced panel analysis, you would need to use an inner_join() to keep only those cases that are present in both waves. If you want to do an unbalanced panel analysis, you would need to use a full_join() to keep all the cases that are present in at least one wave.

Alternatively, you could create indicators of presence in each wave and then use a full_join() to get all the cases together and then use the indicators to decide which ones to keep for your analysis.

It is important to ensure that the data you have reflects the population of interest. For example, if my study focuses on participants in the first waves of the study, I can still use unbalanced data but will exclude any individuals who join the study after wave 1.

Conclusions on how to merge data

Merging data is a complex process that requires careful planning and attention to detail. It is important to define your target sample, understand your data sources and IDs, and plan an appropriate sequence of merges. By following the procedures and tips presented here, you can avoid mistakes and ensure you use the data appropriately.

Was the information useful?

Consider supporting the site:

Subscribe to newsletter 📬

Buy a coffee ☕

Buy the book 📖

Take a course 🏫

Longitudinal Analysis

Longitudinal methods, made clear