Data Preparation Checklist: Correct Variable Types, Missing Categories Coded, Categories Labeled Properly, Check Data Distribution, Identify Outliers, Rescale Variables.

Data cleaning tips for making the most of longitudinal data

Data visualisation is a key step when exploring data, developing hypotheses and preparing for more advanced statistical models. Often, researchers rush to this step before properly cleaning their data. In this guide, we will cover key data cleaning tips before visualisation.

Data cleaning tips: use the data checklist

A good point to start with is our data preparation checklist. This includes some of the key things to look out for before you visualise and analyse your data.

Data cleaning tips: Data preparation checklist

Is the variable in the correct format?

The first data cleaning tip is to ensure that the variables you will use are coded in the correct format. Many functions in R run differently depending on the input. For example, if you use the summary() command, the statistics you get will differ between numeric variables and factors (which are used to code categorical data). When importing data in R sometimes the variables are not coded in the format you expect and this might lead to errors downstream. For example, if a variable is treated as text, you cannot use it to calculate a mean.

We can look at data from the Understanding Society to see how this can be problematic. We have two variables to measure time: the wave and the interview year (“istrdaty”). If we select them and use glimpse() we see:

usl |> 
  select(wave, istrtdaty) |> 
  glimpse()
## Rows: 204,028
## Columns: 2
## $ wave      <chr> "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", …
## $ istrtdaty <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009,   …

First, the wave is treated as “character” or text (“<chr>” in the output). This is incorrect; it should be numeric instead. The date of the interview is <dbl+lbl>. This stands for numeric (or “double”) with labels. This is a format that is often used when importing data using the haven package. A better approach would be to code this either as a numeric or a date variable.

We can convert the original data to numeric using the as.numeric() function. For the date format, we can use the ymd() function. Here, we will just assume the data is collected on the first day of each year, so we combine the year with “0101”. For most purposes, just having the numeric year is enough, but if we need the date format, now we will also have that:

usl |> 
  transmute(wave = as.numeric(wave),
            year_num = as.numeric(istrtdaty),
            year_date = ymd(paste0(year_num, "0101"))) |> 
  glimpse()
## Rows: 204,028
## Columns: 3
## $ wave      <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, …
## $ year_num  <dbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, NA, NA…
## $ year_date <date> 2009-01-01, 2010-01-01, 2011-01-01, 2012-01-01, 2010-01-01,…

Are missing categories coded accordingly?

One key decision is how to treat missing values. In R, typically, these should be coded as NA unless you plan to use them explicitly.

So, a first step is to check that missing values are coded as NA. Below, when we do a graph of mental health, we see some extreme negative values.

qplot(us1$a_sf12mcs_dv)
original sf12 distribtion with missing data

A closer look at the attributes indicates that these are different types of missing answers.

attributes(us1$a_sf12mcs_dv)
## $label
## [1] "SF-12 Mental Component Summary (PCS)"
## 
## $format.stata
## [1] "%12.0g"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##          missing     inapplicable Proxy respondent 
##               -9               -8               -7

The simplest way to deal with this is to recode negative values NA using the code below:

us1 |> 
  mutate(mcs = ifelse(a_sf12mcs_dv < 0, NA, a_sf12mcs_dv)) |> 
  select(mcs) |> 
  summary()
##       mcs       
##  Min.   : 0.00  
##  1st Qu.:45.91  
##  Median :53.04  
##  Mean   :50.49  
##  3rd Qu.:57.16  
##  Max.   :77.11  
##  NAs    :3594

Missing data can be complex, and when the amount of missing data is large or systematic, it is worth considering more advanced methods for addressing it, such as imputation or weighting. More on this in a future guide.

Are categories labelled correctly?

In the social sciences, we often work with categorical data. In R, it is best to code these as a factor. These types of variables come with their own issues and commands. One problem you might have concerns the ordering of the categories. In the example below, it seems the categories are not in the right order.

count(usl, vote6)
## # A tibble: 5 × 2
##   vote6          n
##   <fct>      <int>
## 1 Fairly     49363
## 2 Very       14638
## 3 Not very   41082
## 4 Not at all 38036
## 5 <NA>       60909

We can put them in the order we want using the fct_relevel() command. This will move the categories we select as the first labels. For example, if we want “Very” to be the first category, we can use the following command.

mutate(usl, vote6 = fct_relevel(vote6, "Very")) |> 
  count(vote6)
## # A tibble: 5 × 2
##   vote6          n
##   <fct>      <int>
## 1 Very       14638
## 2 Fairly     49363
## 3 Not very   41082
## 4 Not at all 38036
## 5 <NA>       60909

We will discuss more issues regarding factors at the end of this guide. For now, remember this data cleaning tip for categorical variables: check that the categories have appropriate labels and are in the correct order.

Does the distribution look reasonable?

The next data cleaning tip is to check the distribution of the variables of interest. Here, having some domain expertise is invaluable as it will allow you to spot strange patterns or errors.

For example, do you spot any strange things about the income variable?

qplot(us1$a_fimngrs_dv)
histogram of income with negative values

From this distribution, it looks like there are some negative values. Let’s check this also using summary():

summary(us1$a_fimngrs_dv)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -12436.6    508.7   1097.0   1455.2   1946.2  22666.0

There indeed seem to be some negative values. To me, this seems implausible. It might be that these were mistakes introduced earlier on. It would be good to investigate this, for example, by looking at the documentation or contacting the data collection team. Here, we will assume this is an error and treat all negative values as missing:

us1 <- us1 |> 
  mutate(gross_income = ifelse(a_fimngrs_dv < 0, NA, a_fimngrs_dv)) 

qplot(us1$gross_income)
histogram of income without negative values

Sometimes it is possible to do other types of logical checks. For example, you could investigate whether there is income for those who don’t work or are not allowed to work (e.g., children). The more checks you can do, the more confidence you can have in your data.

Are there extreme cases or outliers?

For continuous variables, such as income, we often encounter extreme cases or outliers. The next data cleaning tip focuses on this. A good way to investigate this is to look at the box-and-whisker plot, where the box represents 50% of the distribution. Cases outside the box-and-whisker (box + line) range represent outliers. If we look at the graph for income, we get:

ggplot(us1, aes(gross_income)) + geom_boxplot()
box an whiskers plot of income

We see relatively few cases above ~£10,000. Here, there are a couple of decisions we can make. One of the most common ones is to cap the variable at some extreme cutoff point. This is sometimes called Winsorisation. Here, as an example, we will cap income to a maximum of £10,000 using the ifelse() command. For this, we add our condition, as well as what should happen when the condition (i.e., the value is greater than 10000) is met (set the value to 10000) or not met (copy the old value).

us1 <- us1 |> 
  mutate(gross_income_cap = ifelse(gross_income > 10000, 10000, gross_income))

qplot(us1$gross_income_cap)
histogram of income capped

Now the distribution looks a little more reasonable.

A more flexible way to do this is to calculate the percentile and cap it at that level. For example, if we want to see what value separates the top 1% of the distribution from the rest we can use the command below:

cap <- quantile(us1$gross_income, probs = 0.99, na.rm = T)
cap
##      99% 
## 7991.999

Now we can use that value to do the cap. We can choose different cut-off points (like 5% or 10%) depending on how skewed the distribution is.

us1 |> 
  transmute(gross_income_cap = ifelse(gross_income > cap, cap, gross_income)) |>
  ggplot(aes(gross_income_cap)) + geom_histogram()
histogram of income capped at 99%

Do we need to rescale the variables?

Sometimes we need to rescale some of the variables we have. A common one in the social sciences is to take the log of income. This is done because the distribution is skewed, with relatively few people having very large incomes. Taking a log of 0 yields an unwanted result (-Infinity), so we typically add a constant to avoid losing those cases.

qplot(log(us1$gross_income_cap + 0.01))
log of income histogram

Alternatively, we could use rankings of people by income. This would address the skewness of the data but would lose the original scale. The command percent_rank() is useful for this. You can see what the distribution would look like if we used this for income:

qplot(percent_rank(us1$gross_income))
histogram of income ranking

There are other ways in which we might want to change the scale of variables. When working with longitudinal income data, we need to be aware that currency values may change due to inflation. So, £1 does not mean the same thing in 1990 as in 2000.

We can use the priceR package to adjust the income by inflation. We provide our gross income and the year it was collected, and ask to adjust the values to 2009, the first year of data collection.

library(priceR)


usl <- usl %>%
  mutate(fimngrs.cap.adjust =
           adjust_for_inflation(fimngrs.cap, year, "GB", to_date   = 2009))

If we compare the original income variable and the adjusted one, we see quite big differences over a relatively small period of time:

usl |> 
  group_by(year) |> 
  summarise(mean_income = mean(fimngrs.cap, na.rm = T),
            mean_income_adj = mean(fimngrs.cap.adjust, na.rm = T)) 
## # A tibble: 7 × 3
##    year mean_income mean_income_adj
##   <dbl>       <dbl>           <dbl>
## 1  2009       1462.           1462.
## 2  2010       1506.           1477.
## 3  2011       1616.           1546.
## 4  2012       1699.           1565.
## 5  2013       1740.           1563.
## 6  2014       1785.           1567.
## 7    NA       1498.            NaN

Other types of standardisation could account for currency differences across countries or household composition (e.g., using the OECD standardised scale). More generally, in data science, we often recode variables to a 0-1 scale to facilitate the estimation of more complex models.

Other tips

Data cleaning tips: Adjusting factors

As mentioned above, factors come with their own complications, so it’s useful to discuss other typical issues that might arise.

Let’s look at a barplot for satisfaction:

usl |> 
  ggplot(aes(sati.fct)) +
  geom_bar() +
  coord_flip()
barplot of satisfaction

One thing we might want to do is to reverse the order of the categories. We can use the fct_rev() to do this quickly:

usl |> 
  ggplot(aes(fct_rev(sati.fct))) +
  geom_bar() +
  coord_flip()
bar plot of satisfaction with reverse scale

This does what we want, but highlights another issue. How best to treat missing data? Sometimes, if the focus is just on the available data, we can first filter out the missing data and then do the graph:

usl |> 
  filter(!is.na(sati.fct)) |> 
  ggplot(aes(sati.fct)) +
  geom_bar() +
  coord_flip()
barplot without missing data

Alternatively, if we want to show the amount of missing data, it might be better to treat it as a normal category:

usl |> 
  mutate(sati.fct = fct_explicit_na(sati.fct, "Missing")) |> 
  ggplot(aes(sati.fct)) +
  geom_bar() +
  coord_flip()
barplot of satisfaction with explicit missing category

We can move the missing category to the end using the fct_relevel() command.

usl |> 
  mutate(sati.fct = fct_explicit_na(sati.fct, "Missing"),
         sati.fct = fct_relevel(sati.fct, "Missing")) |> 
  ggplot(aes(sati.fct)) +
  geom_bar() +
  coord_flip()
barplot of satisfaction with missing category at the end

Let’s look at another example where we first compute summary statistics and then plot them. Here, we use the highest degree awarded. The as_factor() command takes labels imported from Stata or SPSS and uses them as labels for a factor variable.

us1 <- us1 |> 
  mutate(edu = as_factor(a_hiqual_dv))

We then use this variable to calculate the average income by education level as well as the number of cases in each:

edu_stats <- us1 |> 
  group_by(edu) |> 
  summarise(income = mean(gross_income_cap, na.rm = T),
            n = n())
edu_stats
## # A tibble: 7 × 3
##   edu          income     n
##   <fct>         <dbl> <int>
## 1 Missing       1518.    92
## 2 Degree        2355. 10954
## 3 Other higher  1732.  5537
## 4 A level etc   1318.  9591
## 5 GCSE etc      1161. 10526
## 6 Other qual    1166.  5225
## 7 No qual        844.  9069

We can use it to make a graph:

edu_stats |> 
  ggplot(aes(income, edu)) +
  geom_point()
everage income by education

Another data cleaning tip is to reorder categories in a factor based on a second variable. For example, in this graph, we might want the categories to be ordered by their average income. We can use the fct_reorder() for that:

edu_stats |> 
  mutate(edu = fct_reorder(edu, income)) |> 
  ggplot(aes(income, edu)) +
  geom_point()
change category order for average income by education

Another thing we might want to do sometimes is to collapse or lump some categories together. This is useful if we have categories with few cases. If we want to show only the top three categories, plus all the others lumped together, we can use fct_lump_n(edu, 3). This would lead to a much simpler graph.

us1 |> 
  mutate(edu = fct_lump_n(edu, 3)) |> 
  group_by(edu) |> 
  summarise(income = mean(gross_income_cap, na.rm = T),
            n = n()) |> 
  ggplot(aes(income, edu)) +
  geom_point()
average income by education with 4 categories

Data cleaning tips: Changing the time variable when plotting

When working with longitudinal data, the way we code the time variable is key. Let’s see how this could be done in different ways with a simple example using the change in time of income by wave.

usl |>
  ggplot(aes(wave, logincome)) +
  geom_smooth(method = "lm")
plot of change in time with issue

For some reason, we get an empty graph. If we look at how the variable is coded, we see it is a character (text).

typeof(usl$wave)
## [1] "character"

We can code this as numeric and redo the graph:

usl |>
  mutate(wave = as.numeric(wave)) |> 
  ggplot(aes(wave, fimngrs.cap.adjust)) +
  geom_smooth(method = "lm")
change in time graph

This is better. Sometimes, treating a variable with few categories as continuous can yield scales that do not make sense, such as a wave of “2.5”. Here is an example:

usl |>
  mutate(wave = as.numeric(wave)) |> 
  filter(wave < 4) |> 
  ggplot(aes(wave, fimngrs.cap.adjust)) +
  geom_smooth(method = "lm")
change in time over three waves with problematic scale

There are two strategies to deal with this. First, we can define the x scale by hand:

usl |>
  filter(wave %in% 1:3) |> 
  mutate(wave = as.numeric(wave)) |> 
  ggplot(aes(wave, fimngrs.cap.adjust)) +
  geom_smooth(method = "lm") +
  scale_x_continuous(breaks = 1:3)
change in time with three waves with cleaner scale

Secondly, we can code time as a factor. When calculating summary statistics, we will need to add the option group = 1 to be able to estimate trend lines:

usl |>
  filter(wave %in% 1:3) |> 
  mutate(wave = as.factor(wave)) |> 
  ggplot(aes(wave, fimngrs.cap.adjust, group = 1)) +
  geom_smooth(method = "lm")
change in time with three waves with scale as a factor

This has the benefit of including only valid values in the axis.

Another data cleaning tip for working with longitudinal data is to always consider how best to define the time metric. For example, with this data, we could also define time as years:

usl |>
  ggplot(aes(year, fimngrs.cap.adjust)) +
  geom_smooth(method = "lm")
change in time for income by year

Or we could look at change by age:

usl |>
  ggplot(aes(age, fimngrs.cap.adjust)) +
  geom_smooth(method = "lm")
change in time by age of income

Make sure that whatever time metric you choose makes substantive sense and that it is coded correctly.

Data cleaning tips: Creating weighted graphs

The final data cleaning tip to consider before doing visualisations concerns weighting.

Weights are variables created to adjust for the sampling design used or for issues such as non-response and coverage bias. Sometimes, ignoring the provided weights can lead to incorrect visualisations and results.

There are different ways to do weighted statistics. Here we will use the survey package. We first define a simple design that has no complex survey design (clustering or stratification) but has a weight variable (“a_indinus_xw”):

library(survey)


des <- svydesign(
  ids = ~1,
  weights = ~a_indinus_xw,
  data = us1
)

We then create a table, as before, showing the average income by education level. This now accounts for the weights and also provides a corrected confidence interval.

edu_stats_w <- svyby(
  ~gross_income_cap,
  ~edu,
  des,
  svymean,
  na.rm = TRUE,
  vartype = "ci"
)

edu_stats_w
##                       edu gross_income_cap      ci_l      ci_u
## Missing           Missing         988.6604  581.2113 1396.1095
## Degree             Degree        2479.7152 2434.8854 2524.5450
## Other higher Other higher        1779.8557 1736.8228 1822.8886
## A level etc   A level etc        1383.6905 1353.2563 1414.1246
## GCSE etc         GCSE etc        1197.7313 1172.5039 1222.9586
## Other qual     Other qual        1187.9602 1154.2907 1221.6297
## No qual           No qual         859.1098  840.2124  878.0071

Now we can use this to create a graph with the weighted statistics:

ggplot(edu_stats_w, aes(gross_income_cap, edu)) +
  geom_pointrange(aes(xmin = ci_l,
                      xmax = ci_u)) 
weighted average income by education

We can compare this with the unweighted statistics from before, and we see that there are some important differences, especially for the “Missing” category:

ggplot(edu_stats_w, aes(gross_income_cap, edu)) +
  geom_pointrange(aes(xmin = ci_l,
                      xmax = ci_u)) +
  geom_point(data = edu_stats, aes(income, edu), color = "red")
average income by education with and without weights

This approach is very flexible and can be used with more complex survey designs and other statistical methods as well.

Conclusions on data cleaning tips before visualisation

Here, we cover key data cleaning tips to consider before performing more advanced visualisations and analyses. We went through a checklist of things to look for when cleaning data, as well as a more in-depth discussion of factors, time variables, and weighting.

You can learn more about data cleaning in this post or in our online course. You can also check our complete guide to visualising longitudinal data.


Was the information useful?

Consider supporting the site:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.