Longitudinal data can be complex and difficult to understand. Exploring and describing it is an important part of working with this data type. This process is part of what’s known as Exploratory Data Analysis (EDA), a crucial step in understanding your data. In this blog post, we will see how to use tables and summary statistics to describe longitudinal data. In R, there are different specialized packages for this, but here we will focus on base R
(i.e., the commands that come by default when you install R
) and tidyverse
(a popular package for cleaning and visualizing data) to cover the most common descriptives used with longitudinal data.
Setting up the environment
You can download the data we cleaned in a previous post if you want to follow along. Remember to set up your working directory as previously described. Next, let’s load the packages we need. Here, we will be mainly using the tidyverse
package:
library(tidyverse)
With that out of the way, we can load the data. The data we will be using is the us_clean_syn dataset. We transformed and cleaned this dataset in two previous blog posts: here and here. You can download the data from here. The full code used in the post can be found here.
load("./data/us_clean_syn.RData")
This file includes both the data in the wide format and the long format. As we will see later, each format can be useful for different types of tables and summary statistics.
Simple tables and descriptives
Let’s start with the basics. We will first look at the structure of the data using the glimpse()
function. This will give us an overview of the variables in the long dataset.
glimpse(usl)
## Rows: 204,028 ## Columns: 31 ## $ pidp <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11… ## $ wave <chr> "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4"… ## $ age <dbl+lbl> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 42… ## $ hiqual <dbl+lbl> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, … ## $ single <dbl+lbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, NA, NA, NA, 0, NA… ## $ fimngrs <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,… ## $ sclfsato <dbl+lbl> -9, 6, 2, 1, 6, 6, -8, 3, 2, NA, NA, NA, -9, NA… ## $ sf12pcs <dbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, NA, 46.16, 33.18… ## $ sf12mcs <dbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, NA, 37.02, 46.80… ## $ istrtdaty <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, … ## $ sf1 <dbl+lbl> 2, 2, 1, 3, 4, 3, 3, 3, 5, NA, NA, NA, 2, NA… ## $ present <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, … ## $ gndr.fct <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male… ## $ single.fct <fct> Single, Single, Single, In relationship, In relationship, … ## $ urban.fct <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, Ur… ## $ degree.fct <fct> Degree, Degree, Degree, Degree, No degree, No degree, No d… ## $ vote6.fct <fct> Not very, Not at all, Not at all, Not at all, Not at all, … ## $ sati <dbl> NA, 6, 2, 1, 6, 6, NA, 3, 2, NA, NA, NA, NA, NA, 5, NA, NA… ## $ sati.fct <fct> NA, Mostly satisfied, Mostly dissatisfied, Completely diss… ## $ fimngrs.cap <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,… ## $ logincome <dbl> 8.099818, 8.297170, 8.196070, 6.757514, 6.809039, 6.577861… ## $ sati.ind <dbl> 3.000000, 3.000000, 3.000000, 3.000000, 5.000000, 5.000000… ## $ sati.dev <dbl> NA, 3.000000, -1.000000, -2.000000, 1.000000, 1.000000, NA… ## $ sf12pcs.ind <dbl> 57.65250, 57.65250, 57.65250, 57.65250, 48.82667, 48.82667… ## $ sf12pcs.dev <dbl> -3.1425000, 5.3275000, -0.6825000, -1.5025000, 5.1033333, … ## $ sf12mcs.ind <dbl> 52.75250, 52.75250, 52.75250, 52.75250, 42.96333, 42.96333… ## $ sf12mcs.dev <dbl> 2.977500, -16.532500, 7.267500, 6.287500, 3.516667, 2.4266… ## $ waves <int> 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4… ## $ sati.lag <dbl> NA, NA, 6, 2, NA, 6, 6, NA, NA, 2, NA, NA, NA, NA, NA, 5, … ## $ sf12pcs.lag <dbl> NA, 54.51, 62.98, 56.97, NA, 53.93, 46.39, NA, NA, 33.18, … ## $ sf12mcs.lag <dbl> NA, 55.73, 36.22, 60.02, NA, 46.48, 45.39, NA, NA, 46.80, …
This list of all the variables in the data shows their type and the first few values. We see that we have 31 variables in the dataset and around 204,000 rows. The wave
variable is the time variable, pidp
is the unique identifier for each person.
We can use the count()
command to do a simple frequency table. For example, the distribution of the voting variable over all points in time is:
count(usl, vote6.fct)
## # A tibble: 5 × 2 ## vote6.fct n ## <fct> <int> ## 1 Very 14638 ## 2 Fairly 49363 ## 3 Not very 41082 ## 4 Not at all 38036 ## 5 <NA> 60909
We can also calculate the proportions by creating a new column using mutate()
.
count(usl, vote6.fct) |> mutate(prop = n / sum(n))
## # A tibble: 5 × 3 ## vote6.fct n prop ## <fct> <int> <dbl> ## 1 Very 14638 0.0717 ## 2 Fairly 49363 0.242 ## 3 Not very 41082 0.201 ## 4 Not at all 38036 0.186 ## 5 <NA> 60909 0.299
We introduced the |>
(pipe) command in a previous post here.
The alternative approach is by using Base R
is with the table()
command:
table(usl$vote6.fct, useNA = "always")
## ## Very Fairly Not very Not at all <NA> ## 14638 49363 41082 38036 60909
And if we want the proportions, we can use the prop.table()
command:
table(usl$vote6.fct, useNA = "always") |> prop.table() |> round(2)
## ## Very Fairly Not very Not at all <NA> ## 0.07 0.24 0.20 0.19 0.30
Note that we used round(2) to have just two decimal points printed to make the table easier to read.
We can also quickly summarise the variables in the dataset using the summary() command. Here, we will select only the variables from gndr.fct
to fimngrs.cap
for brevity:
usl |> select(gndr.fct:fimngrs.cap) |> summary()
## gndr.fct single.fct urban.fct degree.fct ## Male : 92364 In relationship:96690 Rural: 32970 Degree : 66188 ## Female:111664 Single :55984 Urban:119657 No degree:137396 ## NA's :51354 NA's : 51401 NA's : 444 ## ## ## ## ## vote6.fct sati sati.fct ## Very :14638 Min. :1.00 Mostly satisfied :55016 ## Fairly :49363 1st Qu.:4.00 Somewhat satisfied :21619 ## Not very :41082 Median :6.00 Completely satisfied :14746 ## Not at all:38036 Mean :5.17 Neither sat nor dissat:11735 ## NA's :60909 3rd Qu.:6.00 Somewhat dissatisfied : 9657 ## Max. :7.00 (Other) : 9924 ## NA's :81331 NA's :81331 ## fimngrs.cap ## Min. : 0.0 ## 1st Qu.: 642.2 ## Median : 1231.7 ## Mean : 1591.3 ## 3rd Qu.: 2084.2 ## Max. :10000.0 ## NA's :51519
We see that this command figures out what type of variable we have and shows the appropriate summary statistics. For example, for the gndr.fct
variable, we see that it is a factor with two levels: Female
and Male
. For the fimngrs.cap
variable, we see that it is a numeric variable with a mean of 1591 and a median of 1231.
Grouped summary tables
We can also calculate grouped statics. In the context of longitudinal data, this is most often used to describe change in time using long data. For example, we can calculate the mean income over time using a combination of group_by()
and summarise()
functions:
usl |> group_by(wave) |> summarise(income = mean(fimngrs, na.rm = T))
## # A tibble: 4 × 2 ## wave income ## <chr> <dbl> ## 1 1 1465. ## 2 2 1579. ## 3 3 1685. ## 4 4 1749.
We can do multiple summaries at once. For example, we can calculate the mean income, variance of income, proportion of cases missing on income, mean age, and proportion of singles over time:
usl |> group_by(wave) |> summarise(mean_income = mean(fimngrs, na.rm = T), var_income = var(fimngrs, na.rm = T), miss_income = mean(is.na(fimngrs)), mean_age = mean(age, na.rm = T), prop_single = mean(single, na.rm = T))
## # A tibble: 4 × 6 ## wave mean_income var_income miss_income mean_age prop_single ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1465. 2181203. 0.000823 45.5 0.386 ## 2 2 1579. 2182069. 0.250 45.5 0.363 ## 3 3 1685. 2404209. 0.351 45.5 0.357 ## 4 4 1749. 2481650. 0.407 45.5 0.349
We see that the mean and variance of income are increasing over time, and the proportion of missing values is also increasing (probably due to dropouts in the study). The mean of the age variable does not change over time as we treat it as a time constant (i.e., the age at the start of the study). The proportion of singles is decreasing over time.
We can also calculate similar statistics for different groups. For example, we can calculate the same statistics separately for people with and without degrees at each point in time.
usl |> group_by(wave, degree.fct) |> summarise(mean_income = mean(fimngrs, na.rm = T), var_income = var(fimngrs, na.rm = T), miss_income = mean(is.na(fimngrs)))
## # A tibble: 12 × 5 ## # Groups: wave [4] ## wave degree.fct mean_income var_income miss_income ## <chr> <fct> <dbl> <dbl> <dbl> ## 1 1 Degree 2146. 3307885. 0.00127 ## 2 1 No degree 1137. 1309982. 0.000611 ## 3 1 <NA> 1382. 1721212. 0 ## 4 2 Degree 2279. 3453524. 0.238 ## 5 2 No degree 1234. 1195328. 0.256 ## 6 2 <NA> 1410. 1570990. 0.541 ## 7 3 Degree 2394. 3717992. 0.320 ## 8 3 No degree 1318. 1331952. 0.366 ## 9 3 <NA> 1769. 1714929. 0.477 ## 10 4 Degree 2458. 3768458. 0.363 ## 11 4 No degree 1367. 1376334. 0.428 ## 12 4 <NA> 1737. 1988781. 0.514
We see that there is quite a large difference in the mean income between people with and without a degree. The variance of income is also higher for people with a degree.
If we want to eliminate the cases that are missing on the grouping variable, we can first filter them out:
usl |> filter(!is.na(degree.fct)) |> group_by(wave, degree.fct) |> summarise(mean_income = mean(fimngrs, na.rm = T), var_income = var(fimngrs, na.rm = T), miss_income = mean(is.na(fimngrs)))
## # A tibble: 8 × 5 ## # Groups: wave [4] ## wave degree.fct mean_income var_income miss_income ## <chr> <fct> <dbl> <dbl> <dbl> ## 1 1 Degree 2146. 3307885. 0.00127 ## 2 1 No degree 1137. 1309982. 0.000611 ## 3 2 Degree 2279. 3453524. 0.238 ## 4 2 No degree 1234. 1195328. 0.256 ## 5 3 Degree 2394. 3717992. 0.320 ## 6 3 No degree 1318. 1331952. 0.366 ## 7 4 Degree 2458. 3768458. 0.363 ## 8 4 No degree 1367. 1376334. 0.428
Transition tables and correlation matrices
There are two other types of tables that can be useful for longitudinal data exploration: transition tables and correlation matrices. These show the likelihood of people changing over time, which is often one of the main reasons we are interested in longitudinal studies.
Correlation matrices
For example, we can calculate the correlation matrix for the income over time using the wide data. We first select only the variables of interest (using select()
), then do a correlation (using cor()
with pairwise deletion), and round the results to two decimal points (using round()
):
usw |> select(matches("fimngrs.cap")) |> cor(use = "pairwise.complete.obs") |> round(2)
## fimngrs.cap_1 fimngrs.cap_2 fimngrs.cap_3 fimngrs.cap_4 ## fimngrs.cap_1 1.00 0.78 0.73 0.70 ## fimngrs.cap_2 0.78 1.00 0.62 0.67 ## fimngrs.cap_3 0.73 0.62 1.00 0.73 ## fimngrs.cap_4 0.70 0.67 0.73 1.00
The correlation between income in wave 1 and wave 2 is 0.78. This is a high positive correlation, implying that people with high income in wave 1 are also likely to have high income in wave 2. As time passes, the correlation becomes weaker, and this process often takes place in longitudinal data as more change can happen with the passing of time.
Transition tables
When we are interested in the transition between two categorical variables, we can use transition tables. For example, we can calculate the transition table for the voting variable “vote6.fct” from wave 1 to wave 2. There are different ways to do this. A simple one would be just using the count()
function:
usw |> count(vote6.fct_1, vote6.fct_2)
## # A tibble: 25 × 3 ## vote6.fct_1 vote6.fct_2 n ## <fct> <fct> <int> ## 1 Very Very 2171 ## 2 Very Fairly 1194 ## 3 Very Not very 171 ## 4 Very Not at all 95 ## 5 Very <NA> 1317 ## 6 Fairly Very 1187 ## 7 Fairly Fairly 7565 ## 8 Fairly Not very 2434 ## 9 Fairly Not at all 800 ## 10 Fairly <NA> 4121 ## # ℹ 15 more rows
This has the information we need but it is not very easy to read as it is in the “long format” with each possible combination presented on a different row. We can use the spread()
function to make it more readable:
usw |> count(vote6.fct_1, vote6.fct_2) |> spread(vote6.fct_2, n)
## # A tibble: 5 × 6 ## vote6.fct_1 Very Fairly `Not very` `Not at all` `<NA>` ## <fct> <int> <int> <int> <int> <int> ## 1 Very 2171 1194 171 95 1317 ## 2 Fairly 1187 7565 2434 800 4121 ## 3 Not very 195 2592 4804 2569 3808 ## 4 Not at all 117 806 2223 5280 4094 ## 5 <NA> 154 575 529 590 28768
Note that the first input for the spread()
command is the variable that should be spread (“vote6.fct_2” here), and the second is the “value” (in this case “n”).
This is more in line with the way transition tables are typically presented. Here, we see that around 2,100 people are very likely to vote in both waves, and around 95 people switched from very likely to not at all likely.
Using absolute frequency (i.e., just the number of cases in each cell) is not always the best way to present transition tables, as it is hard to compare transition likelihoods for different categories. We can instead calculate the proportion of cases in each cell. Usually, doing this by row (i.e., for each category in wave 1) is the most intuitive to read. We can do this by first calculating the total number of cases in each row and then dividing the number of cases in each cell by this total. We can do this using the group_by()
and mutate()
functions:
usw |> count(vote6.fct_1, vote6.fct_2) |> group_by(vote6.fct_1) |> mutate(prop = round(n / sum(n), 2)) |> select(-n) |> spread(vote6.fct_2, prop)
## # A tibble: 5 × 6 ## # Groups: vote6.fct_1 [5] ## vote6.fct_1 Very Fairly `Not very` `Not at all` `<NA>` ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Very 0.44 0.24 0.03 0.02 0.27 ## 2 Fairly 0.07 0.47 0.15 0.05 0.26 ## 3 Not very 0.01 0.19 0.34 0.18 0.27 ## 4 Not at all 0.01 0.06 0.18 0.42 0.33 ## 5 <NA> 0.01 0.02 0.02 0.02 0.94
We see that around 47% of people who were “Fairly” likely to vote in wave 1 are also “Fairly” likely to vote in wave 2, while around 5% switched to “Not at all” likely. The diagonal shows the proportion of people who stayed in the same category, which is also known as stability, while the off-diagonal elements show the change.
An alternative way to do this is with the commands table()
and prop.table()
. These functions calculate the proportion of cases in each cell of the table. We can use it like this:
table(usw$vote6.fct_1, usw$vote6.fct_2, useNA = "always") |> prop.table(margin = 1) |> round(2)
## ## Very Fairly Not very Not at all <NA> ## Very 0.44 0.24 0.03 0.02 0.27 ## Fairly 0.07 0.47 0.15 0.05 0.26 ## Not very 0.01 0.19 0.34 0.18 0.27 ## Not at all 0.01 0.06 0.18 0.42 0.33 ## <NA> 0.01 0.02 0.02 0.02 0.94
Note that margin = 1
calculates the proportions by row.
This gives us the same result as before. Which strategy to use depends mostly on personal preferences.
Describing relationships over time
Another type of descriptive analysis that we might want to perform using longitudinal data is examining how relationships between variables change over time.
Let’s imagine we are interested in the relationship between gender and voting behaviour. We can calculate a simple table to explore this over all the waves of the data:
usl |> count(vote6.fct, gndr.fct)
## # A tibble: 10 × 3 ## vote6.fct gndr.fct n ## <fct> <fct> <int> ## 1 Very Male 8858 ## 2 Very Female 5780 ## 3 Fairly Male 23674 ## 4 Fairly Female 25689 ## 5 Not very Male 16497 ## 6 Not very Female 24585 ## 7 Not at all Male 13478 ## 8 Not at all Female 24558 ## 9 <NA> Male 29857 ## 10 <NA> Female 31052
Let’s use a similar strategy as before and make this in the wide format using proportions by gender to make it easier to understand:
usl |> count(vote6.fct, gndr.fct) |> group_by(gndr.fct) |> mutate(prop = round(n / sum(n), 2)) |> select(-n) |> spread(vote6.fct, prop)
## # A tibble: 2 × 6 ## # Groups: gndr.fct [2] ## gndr.fct Very Fairly `Not very` `Not at all` `<NA>` ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Male 0.1 0.26 0.18 0.15 0.32 ## 2 Female 0.05 0.23 0.22 0.22 0.28
We see that males are more likely to vote, with around 36% saying they are “Very” or “Fairly” likely to vote compared to around 28% for females.
Let’s now look at how this relationship changes over time. We can do this by adding the wave
variable to the count()
and group_by()
functions:
usl |> count(wave, vote6.fct, gndr.fct) |> na.omit() |> group_by(wave, gndr.fct) |> mutate(prop = round(n / sum(n), 2)) |> select(-n) |> spread(vote6.fct, prop)
## # A tibble: 8 × 6 ## # Groups: wave, gndr.fct [8] ## wave gndr.fct Very Fairly `Not very` `Not at all` ## <chr> <fct> <dbl> <dbl> <dbl> <dbl> ## 1 1 Male 0.15 0.38 0.26 0.21 ## 2 1 Female 0.07 0.31 0.32 0.3 ## 3 2 Male 0.15 0.38 0.26 0.21 ## 4 2 Female 0.07 0.33 0.3 0.29 ## 5 3 Male 0.13 0.38 0.27 0.22 ## 6 3 Female 0.07 0.32 0.3 0.31 ## 7 4 Male 0.13 0.38 0.26 0.22 ## 8 4 Female 0.07 0.32 0.3 0.31
Note that we used na.omit()
to ignore the missing category and make the table easier to read. The same goes for the round()
command.
Now, we can explore how this relationship changed over time. For example in wave 1 the difference between men and women in the “Very” likely to vote category is around 8% while in wave 4 it is around 6%. We see that this is mostly because of a decrease in the likelihood of male voting.
We can also look at the relationship between a categorical and a continuous variable and how this changes over time. For example, we can calculate the mean satisfaction for each degree level over time (using summarise()
). We eliminate the missing category (using na.omit()
) and reshape the table to wide (using spread()
) to make it easier to read:
usl |> group_by(wave, degree.fct) |> summarise(mean_sati = mean(sati, na.rm = T) |> round(2)) |> na.omit() |> spread(degree.fct, mean_sati)
## # A tibble: 4 × 3 ## # Groups: wave [4] ## wave Degree `No degree` ## <chr> <dbl> <dbl> ## 1 1 5.34 5.2 ## 2 2 5.32 5.16 ## 3 3 5.22 5.06 ## 4 4 5.15 4.97
We see that people with degrees are more satisfied with their lives than people without degrees. This difference increases slightly over time.
Finally, we can examine the relationship between two continuous variables and how this changes over time. For example, we can calculate the correlation between income and satisfaction over time using some of the commands we covered already:
usl |> group_by(wave) |> summarise(mean_income = mean(fimngrs.cap, na.rm = T), mean_sati = mean(sati, na.rm = T), cor = cor(fimngrs.cap, sati, use = "pairwise.complete.obs")) |> mutate_at(vars(-wave), ~round(., 2))
## # A tibble: 4 × 4 ## wave mean_income mean_sati cor ## <chr> <dbl> <dbl> <dbl> ## 1 1 1460. 5.25 0.05 ## 2 2 1574. 5.21 0.04 ## 3 3 1677. 5.11 0.05 ## 4 4 1741. 5.04 0.05
Note that we round all the variables except “wave” using mutate_at()
.
We see that the correlation between income and satisfaction is very low and positive, around 0.05. This means that higher-income people are not necessarily more likely to be satisfied with their life. We also see that this relationship is pretty stable over time.
Exporting tables
We covered the main types of tables and summary statistics that are useful for longitudinal data. There are two main strategies for exporting the tables from R
. The first one is to save the tables to a file, like a .csv
or excel format. We can save our tables as an object and write it to a file using the write_csv()
function:
tab1 <- usl |> group_by(wave) |> summarise(mean_income = mean(fimngrs.cap, na.rm = T), mean_sati = mean(sati, na.rm = T), cor = cor(fimngrs.cap, sati, use = "pairwise.complete.obs")) write_csv(tab1, "usl_summary.csv")
The second strategy is to create dynamic documents that include the tables we want in a nice format. In R
we can create such documents using Rmarkdown
or Quarto
documents. The tables we created above can be easily included in these documents. For example, we can use the kable()
function from the knitr
package to create a table in markdown format. This is useful if you are writing using this format:
library(knitr) tab2 <- usl |> group_by(wave) |> summarise(mean_income = mean(fimngrs.cap, na.rm = T), mean_sati = mean(sati, na.rm = T), cor = cor(fimngrs.cap, sati, use = "pairwise.complete.obs")) |> mutate_at(vars(-wave), ~round(., 2)) kable(tab2)
wave | mean_income | mean_sati | cor |
---|---|---|---|
1 | 1459.88 | 5.25 | 0.05 |
2 | 1573.74 | 5.21 | 0.04 |
3 | 1677.22 | 5.11 | 0.05 |
4 | 1741.10 | 5.04 | 0.05 |
The kable()
command can adapt to the type of output you want. For example, if you are writing a LaTeX
document, you can use the format = "latex"
argument to get a table in LaTeX
format.
kable(tab2, format = "latex")
\begin{tabular}{l|r|r|r} \hline wave & mean\_income & mean\_sati & cor\\ \hline 1 & 1459.88 & 5.25 & 0.05\\ \hline 2 & 1573.74 & 5.21 & 0.04\\ \hline 3 & 1677.22 & 5.11 & 0.05\\ \hline 4 & 1741.10 & 5.04 & 0.05\\ \hline \end{tabular}
The same with HTML:
kable(tab2, format = "html")
<table> <thead> <tr> <th style="text-align:left;"> wave </th> <th style="text-align:right;"> mean_income </th> <th style="text-align:right;"> mean_sati </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 1459.88 </td> <td style="text-align:right;"> 5.25 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 1573.74 </td> <td style="text-align:right;"> 5.21 </td> <td style="text-align:right;"> 0.04 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 1677.22 </td> <td style="text-align:right;"> 5.11 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 1741.10 </td> <td style="text-align:right;"> 5.04 </td> <td style="text-align:right;"> 0.05 </td> </tr> </tbody> </table>
Conclusions
Longitudinal data can be complex and difficult to understand. Describing this data is an important part of the longitudinal workflow. In this blog post, we covered the most common tables and summary statistics needed for longitudinal data using base R
and tidyverse
. We also saw how we can export these tables to files or reports. This is part of the Data Exploration stage of longitudinal data analysis and visualization. You can see some of the main commands we covered in the figure on the right.
There are other specialized packages for creating tables in R
like kableExtra
or gt
that can help you create more complex tables, and they work well with dynamic documents. You can find a list of some of these packages here.
Try creating some of these tables in your own data. Are there other types of tables or descriptives we missed? Let us know in the comments below.
Was the information useful?
Consider supporting the site by:
4 responses to “Exploring longitudinal data in R: tables and summaries”
[…] transforming the data the next stage would be to clean it by recoding the variables and use tables and visualization to explore it. We cover these next stages in this blogpost and this […]
[…] is exploratory data analysis. You can check how to do that using tables and summary statistics in this blog post. You can adapt the process covered in this post and the previous one to your own data. Any […]
[…] Longitudinal data can be complex as it includes multiple cases with observations at different points in time. This complexity grows with missing data patterns, nested structures like individuals within households, and various variable types (like time-constant versus time-varying. Visualizing this complicated data can help us understand it better. In this post, we will cover the main types of graphs used to explore longitudinal data using R. We will apply these techniques to the synthetic data from our previous post that replicates a large social science panel study. Visualization complements the exploration of data using tables and summary statistics that we covered in a previous post. […]
[…] – Exploring longitudinal data in R: tables and summaries […]