describing longitudinal data using tables and summary statistics in R

Exploring longitudinal data in R: tables and summaries

Longitudinal data can be complex and difficult to understand. Exploring and describing it is an important part of working with this data type. This process is part of what’s known as Exploratory Data Analysis (EDA), a crucial step in understanding your data. In this blog post, we will see how to use tables and summary statistics to describe longitudinal data. In R, there are different specialized packages for this, but here we will focus on base R (i.e., the commands that come by default when you install R) and tidyverse (a popular package for cleaning and visualizing data) to cover the most common descriptives used with longitudinal data.

Setting up the environment

You can download the data we cleaned in a previous post if you want to follow along. Remember to set up your working directory as previously described. Next, let’s load the packages we need. Here, we will be mainly using the tidyverse package:

library(tidyverse)

Access the code used here.

Access the data here.

With that out of the way, we can load the data. The data we will be using is the us_clean_syn dataset. We transformed and cleaned this dataset in two previous blog posts: here and here. You can download the data from here. The full code used in the post can be found here.

load("./data/us_clean_syn.RData")

This file includes both the data in the wide format and the long format. As we will see later, each format can be useful for different types of tables and summary statistics.

Simple tables and descriptives

Let’s start with the basics. We will first look at the structure of the data using the glimpse() function. This will give us an overview of the variables in the long dataset.

glimpse(usl)
## Rows: 204,028
## Columns: 31
## $ pidp        <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 10, 10, 10, 10, 11, 11…
## $ wave        <chr> "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4"…
## $ age         <dbl+lbl> 28, 28, 28, 28, 80, 80, 80, 80, 60, 60, 60, 60, 42, 42…
## $ hiqual      <dbl+lbl> 1, 1, 1, 1, 9, 9, 9, 9, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, …
## $ single      <dbl+lbl>  1,  1,  1,  0,  0,  0,  0,  0,  1, NA, NA, NA,  0, NA…
## $ fimngrs     <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ sclfsato    <dbl+lbl> -9,  6,  2,  1,  6,  6, -8,  3,  2, NA, NA, NA, -9, NA…
## $ sf12pcs     <dbl> 54.51, 62.98, 56.97, 56.15, 53.93, 46.39, NA, 46.16, 33.18…
## $ sf12mcs     <dbl> 55.73, 36.22, 60.02, 59.04, 46.48, 45.39, NA, 37.02, 46.80…
## $ istrtdaty   <dbl+lbl> 2009, 2010, 2011, 2012, 2010, 2011, 2012, 2013, 2009, …
## $ sf1         <dbl+lbl>  2,  2,  1,  3,  4,  3,  3,  3,  5, NA, NA, NA,  2, NA…
## $ present     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, …
## $ gndr.fct    <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male…
## $ single.fct  <fct> Single, Single, Single, In relationship, In relationship, …
## $ urban.fct   <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, Ur…
## $ degree.fct  <fct> Degree, Degree, Degree, Degree, No degree, No degree, No d…
## $ vote6.fct   <fct> Not very, Not at all, Not at all, Not at all, Not at all, …
## $ sati        <dbl> NA, 6, 2, 1, 6, 6, NA, 3, 2, NA, NA, NA, NA, NA, 5, NA, NA…
## $ sati.fct    <fct> NA, Mostly satisfied, Mostly dissatisfied, Completely diss…
## $ fimngrs.cap <dbl> 3283.87, 4002.50, 3616.67, 850.50, 896.00, 709.00, 702.00,…
## $ logincome   <dbl> 8.099818, 8.297170, 8.196070, 6.757514, 6.809039, 6.577861…
## $ sati.ind    <dbl> 3.000000, 3.000000, 3.000000, 3.000000, 5.000000, 5.000000…
## $ sati.dev    <dbl> NA, 3.000000, -1.000000, -2.000000, 1.000000, 1.000000, NA…
## $ sf12pcs.ind <dbl> 57.65250, 57.65250, 57.65250, 57.65250, 48.82667, 48.82667…
## $ sf12pcs.dev <dbl> -3.1425000, 5.3275000, -0.6825000, -1.5025000, 5.1033333, …
## $ sf12mcs.ind <dbl> 52.75250, 52.75250, 52.75250, 52.75250, 42.96333, 42.96333…
## $ sf12mcs.dev <dbl> 2.977500, -16.532500, 7.267500, 6.287500, 3.516667, 2.4266…
## $ waves       <int> 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4…
## $ sati.lag    <dbl> NA, NA, 6, 2, NA, 6, 6, NA, NA, 2, NA, NA, NA, NA, NA, 5, …
## $ sf12pcs.lag <dbl> NA, 54.51, 62.98, 56.97, NA, 53.93, 46.39, NA, NA, 33.18, …
## $ sf12mcs.lag <dbl> NA, 55.73, 36.22, 60.02, NA, 46.48, 45.39, NA, NA, 46.80, …

This list of all the variables in the data shows their type and the first few values. We see that we have 31 variables in the dataset and around 204,000 rows. The wave variable is the time variable, pidp is the unique identifier for each person.

We can use the count() command to do a simple frequency table. For example, the distribution of the voting variable over all points in time is:

count(usl, vote6.fct)
## # A tibble: 5 × 2
##   vote6.fct      n
##   <fct>      <int>
## 1 Very       14638
## 2 Fairly     49363
## 3 Not very   41082
## 4 Not at all 38036
## 5 <NA>       60909

We can also calculate the proportions by creating a new column using mutate().

count(usl, vote6.fct) |> 
  mutate(prop = n / sum(n))
## # A tibble: 5 × 3
##   vote6.fct      n   prop
##   <fct>      <int>  <dbl>
## 1 Very       14638 0.0717
## 2 Fairly     49363 0.242 
## 3 Not very   41082 0.201 
## 4 Not at all 38036 0.186 
## 5 <NA>       60909 0.299

We introduced the |> (pipe) command in a previous post here.

The alternative approach is by using Base R is with the table() command:

table(usl$vote6.fct, useNA = "always")
## 
##       Very     Fairly   Not very Not at all       <NA> 
##      14638      49363      41082      38036      60909

And if we want the proportions, we can use the prop.table() command:

table(usl$vote6.fct, useNA = "always") |> 
  prop.table() |> 
  round(2)
## 
##       Very     Fairly   Not very Not at all       <NA> 
##       0.07       0.24       0.20       0.19       0.30

Note that we used round(2) to have just two decimal points printed to make the table easier to read.

We can also quickly summarise the variables in the dataset using the summary() command. Here, we will select only the variables from gndr.fct to fimngrs.cap for brevity:

usl |>
  select(gndr.fct:fimngrs.cap) |>
  summary()
##    gndr.fct                single.fct    urban.fct          degree.fct    
##  Male  : 92364   In relationship:96690   Rural: 32970   Degree   : 66188  
##  Female:111664   Single         :55984   Urban:119657   No degree:137396  
##                  NA's           :51354   NA's : 51401   NA's     :   444  
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##       vote6.fct          sati                         sati.fct    
##  Very      :14638   Min.   :1.00    Mostly satisfied      :55016  
##  Fairly    :49363   1st Qu.:4.00    Somewhat satisfied    :21619  
##  Not very  :41082   Median :6.00    Completely satisfied  :14746  
##  Not at all:38036   Mean   :5.17    Neither sat nor dissat:11735  
##  NA's      :60909   3rd Qu.:6.00    Somewhat dissatisfied : 9657  
##                     Max.   :7.00    (Other)               : 9924  
##                     NA's   :81331   NA's                  :81331  
##   fimngrs.cap     
##  Min.   :    0.0  
##  1st Qu.:  642.2  
##  Median : 1231.7  
##  Mean   : 1591.3  
##  3rd Qu.: 2084.2  
##  Max.   :10000.0  
##  NA's   :51519

We see that this command figures out what type of variable we have and shows the appropriate summary statistics. For example, for the gndr.fct variable, we see that it is a factor with two levels: Female and Male. For the fimngrs.cap variable, we see that it is a numeric variable with a mean of 1591 and a median of 1231.

Grouped summary tables

We can also calculate grouped statics. In the context of longitudinal data, this is most often used to describe change in time using long data. For example, we can calculate the mean income over time using a combination of group_by() and summarise() functions:

usl |>
  group_by(wave) |>
  summarise(income = mean(fimngrs, na.rm = T))
## # A tibble: 4 × 2
##   wave  income
##   <chr>  <dbl>
## 1 1      1465.
## 2 2      1579.
## 3 3      1685.
## 4 4      1749.

We can do multiple summaries at once. For example, we can calculate the mean income, variance of income, proportion of cases missing on income, mean age, and proportion of singles over time:

usl |>
  group_by(wave) |>
  summarise(mean_income = mean(fimngrs, na.rm = T),
            var_income = var(fimngrs, na.rm = T),
            miss_income = mean(is.na(fimngrs)),
            mean_age = mean(age, na.rm = T),
            prop_single = mean(single, na.rm = T))
## # A tibble: 4 × 6
##   wave  mean_income var_income miss_income mean_age prop_single
##   <chr>       <dbl>      <dbl>       <dbl>    <dbl>       <dbl>
## 1 1           1465.   2181203.    0.000823     45.5       0.386
## 2 2           1579.   2182069.    0.250        45.5       0.363
## 3 3           1685.   2404209.    0.351        45.5       0.357
## 4 4           1749.   2481650.    0.407        45.5       0.349

We see that the mean and variance of income are increasing over time, and the proportion of missing values is also increasing (probably due to dropouts in the study). The mean of the age variable does not change over time as we treat it as a time constant (i.e., the age at the start of the study). The proportion of singles is decreasing over time.

We can also calculate similar statistics for different groups. For example, we can calculate the same statistics separately for people with and without degrees at each point in time.

usl |>
  group_by(wave, degree.fct) |>
  summarise(mean_income = mean(fimngrs, na.rm = T),
            var_income = var(fimngrs, na.rm = T),
            miss_income = mean(is.na(fimngrs)))
## # A tibble: 12 × 5
## # Groups:   wave [4]
##    wave  degree.fct mean_income var_income miss_income
##    <chr> <fct>            <dbl>      <dbl>       <dbl>
##  1 1     Degree           2146.   3307885.    0.00127 
##  2 1     No degree        1137.   1309982.    0.000611
##  3 1     <NA>             1382.   1721212.    0       
##  4 2     Degree           2279.   3453524.    0.238   
##  5 2     No degree        1234.   1195328.    0.256   
##  6 2     <NA>             1410.   1570990.    0.541   
##  7 3     Degree           2394.   3717992.    0.320   
##  8 3     No degree        1318.   1331952.    0.366   
##  9 3     <NA>             1769.   1714929.    0.477   
## 10 4     Degree           2458.   3768458.    0.363   
## 11 4     No degree        1367.   1376334.    0.428   
## 12 4     <NA>             1737.   1988781.    0.514

We see that there is quite a large difference in the mean income between people with and without a degree. The variance of income is also higher for people with a degree.

If we want to eliminate the cases that are missing on the grouping variable, we can first filter them out:

usl |>
  filter(!is.na(degree.fct)) |>
  group_by(wave, degree.fct) |>
  summarise(mean_income = mean(fimngrs, na.rm = T),
            var_income = var(fimngrs, na.rm = T),
            miss_income = mean(is.na(fimngrs)))
## # A tibble: 8 × 5
## # Groups:   wave [4]
##   wave  degree.fct mean_income var_income miss_income
##   <chr> <fct>            <dbl>      <dbl>       <dbl>
## 1 1     Degree           2146.   3307885.    0.00127 
## 2 1     No degree        1137.   1309982.    0.000611
## 3 2     Degree           2279.   3453524.    0.238   
## 4 2     No degree        1234.   1195328.    0.256   
## 5 3     Degree           2394.   3717992.    0.320   
## 6 3     No degree        1318.   1331952.    0.366   
## 7 4     Degree           2458.   3768458.    0.363   
## 8 4     No degree        1367.   1376334.    0.428

Transition tables and correlation matrices

There are two other types of tables that can be useful for longitudinal data exploration: transition tables and correlation matrices. These show the likelihood of people changing over time, which is often one of the main reasons we are interested in longitudinal studies.

Correlation matrices

For example, we can calculate the correlation matrix for the income over time using the wide data. We first select only the variables of interest (using select()), then do a correlation (using cor() with pairwise deletion), and round the results to two decimal points (using round()):

usw |>
  select(matches("fimngrs.cap")) |>
  cor(use = "pairwise.complete.obs") |>
  round(2)
##               fimngrs.cap_1 fimngrs.cap_2 fimngrs.cap_3 fimngrs.cap_4
## fimngrs.cap_1          1.00          0.78          0.73          0.70
## fimngrs.cap_2          0.78          1.00          0.62          0.67
## fimngrs.cap_3          0.73          0.62          1.00          0.73
## fimngrs.cap_4          0.70          0.67          0.73          1.00

The correlation between income in wave 1 and wave 2 is 0.78. This is a high positive correlation, implying that people with high income in wave 1 are also likely to have high income in wave 2. As time passes, the correlation becomes weaker, and this process often takes place in longitudinal data as more change can happen with the passing of time.

Transition tables

When we are interested in the transition between two categorical variables, we can use transition tables. For example, we can calculate the transition table for the voting variable “vote6.fct” from wave 1 to wave 2. There are different ways to do this. A simple one would be just using the count() function:

usw |>
  count(vote6.fct_1, vote6.fct_2)
## # A tibble: 25 × 3
##    vote6.fct_1 vote6.fct_2     n
##    <fct>       <fct>       <int>
##  1 Very        Very         2171
##  2 Very        Fairly       1194
##  3 Very        Not very      171
##  4 Very        Not at all     95
##  5 Very        <NA>         1317
##  6 Fairly      Very         1187
##  7 Fairly      Fairly       7565
##  8 Fairly      Not very     2434
##  9 Fairly      Not at all    800
## 10 Fairly      <NA>         4121
## # ℹ 15 more rows

This has the information we need but it is not very easy to read as it is in the “long format” with each possible combination presented on a different row. We can use the spread() function to make it more readable:

usw |>
  count(vote6.fct_1, vote6.fct_2) |>
  spread(vote6.fct_2, n)
## # A tibble: 5 × 6
##   vote6.fct_1  Very Fairly `Not very` `Not at all` `<NA>`
##   <fct>       <int>  <int>      <int>        <int>  <int>
## 1 Very         2171   1194        171           95   1317
## 2 Fairly       1187   7565       2434          800   4121
## 3 Not very      195   2592       4804         2569   3808
## 4 Not at all    117    806       2223         5280   4094
## 5 <NA>          154    575        529          590  28768

Note that the first input for the spread() command is the variable that should be spread (“vote6.fct_2” here), and the second is the “value” (in this case “n”).

This is more in line with the way transition tables are typically presented. Here, we see that around 2,100 people are very likely to vote in both waves, and around 95 people switched from very likely to not at all likely.

Using absolute frequency (i.e., just the number of cases in each cell) is not always the best way to present transition tables, as it is hard to compare transition likelihoods for different categories. We can instead calculate the proportion of cases in each cell. Usually, doing this by row (i.e., for each category in wave 1) is the most intuitive to read. We can do this by first calculating the total number of cases in each row and then dividing the number of cases in each cell by this total. We can do this using the group_by() and mutate() functions:

usw |>
  count(vote6.fct_1, vote6.fct_2) |>
  group_by(vote6.fct_1) |>
  mutate(prop = round(n / sum(n), 2)) |>
  select(-n) |>
  spread(vote6.fct_2, prop)
## # A tibble: 5 × 6
## # Groups:   vote6.fct_1 [5]
##   vote6.fct_1  Very Fairly `Not very` `Not at all` `<NA>`
##   <fct>       <dbl>  <dbl>      <dbl>        <dbl>  <dbl>
## 1 Very         0.44   0.24       0.03         0.02   0.27
## 2 Fairly       0.07   0.47       0.15         0.05   0.26
## 3 Not very     0.01   0.19       0.34         0.18   0.27
## 4 Not at all   0.01   0.06       0.18         0.42   0.33
## 5 <NA>         0.01   0.02       0.02         0.02   0.94

We see that around 47% of people who were “Fairly” likely to vote in wave 1 are also “Fairly” likely to vote in wave 2, while around 5% switched to “Not at all” likely. The diagonal shows the proportion of people who stayed in the same category, which is also known as stability, while the off-diagonal elements show the change.

An alternative way to do this is with the commands table() and prop.table(). These functions calculate the proportion of cases in each cell of the table. We can use it like this:

table(usw$vote6.fct_1, usw$vote6.fct_2, useNA = "always") |>
  prop.table(margin = 1) |>
  round(2)
##             
##              Very Fairly Not very Not at all <NA>
##   Very       0.44   0.24     0.03       0.02 0.27
##   Fairly     0.07   0.47     0.15       0.05 0.26
##   Not very   0.01   0.19     0.34       0.18 0.27
##   Not at all 0.01   0.06     0.18       0.42 0.33
##   <NA>       0.01   0.02     0.02       0.02 0.94

Note that margin = 1 calculates the proportions by row.

This gives us the same result as before. Which strategy to use depends mostly on personal preferences.

Describing relationships over time

Another type of descriptive analysis that we might want to perform using longitudinal data is examining how relationships between variables change over time.

Let’s imagine we are interested in the relationship between gender and voting behaviour. We can calculate a simple table to explore this over all the waves of the data:

usl |>
  count(vote6.fct, gndr.fct)
## # A tibble: 10 × 3
##    vote6.fct  gndr.fct     n
##    <fct>      <fct>    <int>
##  1 Very       Male      8858
##  2 Very       Female    5780
##  3 Fairly     Male     23674
##  4 Fairly     Female   25689
##  5 Not very   Male     16497
##  6 Not very   Female   24585
##  7 Not at all Male     13478
##  8 Not at all Female   24558
##  9 <NA>       Male     29857
## 10 <NA>       Female   31052

Let’s use a similar strategy as before and make this in the wide format using proportions by gender to make it easier to understand:

usl |>
  count(vote6.fct, gndr.fct) |>
  group_by(gndr.fct) |>
  mutate(prop = round(n / sum(n), 2)) |>
  select(-n) |>
  spread(vote6.fct, prop)
## # A tibble: 2 × 6
## # Groups:   gndr.fct [2]
##   gndr.fct  Very Fairly `Not very` `Not at all` `<NA>`
##   <fct>    <dbl>  <dbl>      <dbl>        <dbl>  <dbl>
## 1 Male      0.1    0.26       0.18         0.15   0.32
## 2 Female    0.05   0.23       0.22         0.22   0.28

We see that males are more likely to vote, with around 36% saying they are “Very” or “Fairly” likely to vote compared to around 28% for females.

Let’s now look at how this relationship changes over time. We can do this by adding the wave variable to the count() and group_by() functions:

usl |>
  count(wave, vote6.fct, gndr.fct) |>
  na.omit() |>
  group_by(wave, gndr.fct) |>
  mutate(prop = round(n / sum(n), 2)) |>
  select(-n) |>
  spread(vote6.fct, prop)
## # A tibble: 8 × 6
## # Groups:   wave, gndr.fct [8]
##   wave  gndr.fct  Very Fairly `Not very` `Not at all`
##   <chr> <fct>    <dbl>  <dbl>      <dbl>        <dbl>
## 1 1     Male      0.15   0.38       0.26         0.21
## 2 1     Female    0.07   0.31       0.32         0.3 
## 3 2     Male      0.15   0.38       0.26         0.21
## 4 2     Female    0.07   0.33       0.3          0.29
## 5 3     Male      0.13   0.38       0.27         0.22
## 6 3     Female    0.07   0.32       0.3          0.31
## 7 4     Male      0.13   0.38       0.26         0.22
## 8 4     Female    0.07   0.32       0.3          0.31

Note that we used na.omit() to ignore the missing category and make the table easier to read. The same goes for the round() command.

Now, we can explore how this relationship changed over time. For example in wave 1 the difference between men and women in the “Very” likely to vote category is around 8% while in wave 4 it is around 6%. We see that this is mostly because of a decrease in the likelihood of male voting.

We can also look at the relationship between a categorical and a continuous variable and how this changes over time. For example, we can calculate the mean satisfaction for each degree level over time (using summarise()). We eliminate the missing category (using na.omit()) and reshape the table to wide (using spread()) to make it easier to read:

usl |>
  group_by(wave, degree.fct) |>
  summarise(mean_sati = mean(sati, na.rm = T) |> round(2)) |>
  na.omit() |>
  spread(degree.fct, mean_sati)
## # A tibble: 4 × 3
## # Groups:   wave [4]
##   wave  Degree `No degree`
##   <chr>  <dbl>       <dbl>
## 1 1       5.34        5.2 
## 2 2       5.32        5.16
## 3 3       5.22        5.06
## 4 4       5.15        4.97

We see that people with degrees are more satisfied with their lives than people without degrees. This difference increases slightly over time.

Finally, we can examine the relationship between two continuous variables and how this changes over time. For example, we can calculate the correlation between income and satisfaction over time using some of the commands we covered already:

usl |>
  group_by(wave) |>
  summarise(mean_income = mean(fimngrs.cap, na.rm = T),
            mean_sati = mean(sati, na.rm = T),
            cor = cor(fimngrs.cap, sati, 
                      use = "pairwise.complete.obs")) |> 
  mutate_at(vars(-wave), ~round(., 2))
## # A tibble: 4 × 4
##   wave  mean_income mean_sati   cor
##   <chr>       <dbl>     <dbl> <dbl>
## 1 1           1460.      5.25  0.05
## 2 2           1574.      5.21  0.04
## 3 3           1677.      5.11  0.05
## 4 4           1741.      5.04  0.05

Note that we round all the variables except “wave” using mutate_at().

We see that the correlation between income and satisfaction is very low and positive, around 0.05. This means that higher-income people are not necessarily more likely to be satisfied with their life. We also see that this relationship is pretty stable over time.

Exporting tables

We covered the main types of tables and summary statistics that are useful for longitudinal data. There are two main strategies for exporting the tables from R. The first one is to save the tables to a file, like a .csv or excel format. We can save our tables as an object and write it to a file using the write_csv() function:

tab1 <- usl |>
  group_by(wave) |>
  summarise(mean_income = mean(fimngrs.cap, na.rm = T),
            mean_sati = mean(sati, na.rm = T),
            cor = cor(fimngrs.cap, sati, use = "pairwise.complete.obs")) 

write_csv(tab1, "usl_summary.csv")

The second strategy is to create dynamic documents that include the tables we want in a nice format. In R we can create such documents using Rmarkdown or Quarto documents. The tables we created above can be easily included in these documents. For example, we can use the kable() function from the knitr package to create a table in markdown format. This is useful if you are writing using this format:

library(knitr)
tab2 <- usl |>
  group_by(wave) |>
  summarise(mean_income = mean(fimngrs.cap, na.rm = T),
            mean_sati = mean(sati, na.rm = T),
            cor = cor(fimngrs.cap, sati, 
                      use = "pairwise.complete.obs")) |> 
  mutate_at(vars(-wave), ~round(., 2)) 

kable(tab2)
wavemean_incomemean_saticor
11459.885.250.05
21573.745.210.04
31677.225.110.05
41741.105.040.05

The kable() command can adapt to the type of output you want. For example, if you are writing a LaTeX document, you can use the format = "latex" argument to get a table in LaTeX format.

kable(tab2, format = "latex")
\begin{tabular}{l|r|r|r}
\hline
wave & mean\_income & mean\_sati & cor\\
\hline
1 & 1459.88 & 5.25 & 0.05\\
\hline
2 & 1573.74 & 5.21 & 0.04\\
\hline
3 & 1677.22 & 5.11 & 0.05\\
\hline
4 & 1741.10 & 5.04 & 0.05\\
\hline
\end{tabular}

The same with HTML:

kable(tab2, format = "html")
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> wave </th>
   <th style="text-align:right;"> mean_income </th>
   <th style="text-align:right;"> mean_sati </th>
   <th style="text-align:right;"> cor </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 1459.88 </td>
   <td style="text-align:right;"> 5.25 </td>
   <td style="text-align:right;"> 0.05 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 1573.74 </td>
   <td style="text-align:right;"> 5.21 </td>
   <td style="text-align:right;"> 0.04 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:right;"> 1677.22 </td>
   <td style="text-align:right;"> 5.11 </td>
   <td style="text-align:right;"> 0.05 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:right;"> 1741.10 </td>
   <td style="text-align:right;"> 5.04 </td>
   <td style="text-align:right;"> 0.05 </td>
  </tr>
</tbody>
</table>
Data exploration of longitudinal data using R with tables and summaries.

Conclusions

Longitudinal data can be complex and difficult to understand. Describing this data is an important part of the longitudinal workflow. In this blog post, we covered the most common tables and summary statistics needed for longitudinal data using base R and tidyverse. We also saw how we can export these tables to files or reports. This is part of the Data Exploration stage of longitudinal data analysis and visualization. You can see some of the main commands we covered in the figure on the right.

There are other specialized packages for creating tables in R like kableExtra or gt that can help you create more complex tables, and they work well with dynamic documents. You can find a list of some of these packages here.

Try creating some of these tables in your own data. Are there other types of tables or descriptives we missed? Let us know in the comments below.


Was the information useful?

Consider supporting the site by:

4 responses to “Exploring longitudinal data in R: tables and summaries”

  1. […] Longitudinal data can be complex as it includes multiple cases with observations at different points in time. This complexity grows with missing data patterns, nested structures like individuals within households, and various variable types (like time-constant versus time-varying. Visualizing this complicated data can help us understand it better. In this post, we will cover the main types of graphs used to explore longitudinal data using R. We will apply these techniques to the synthetic data from our previous post that replicates a large social science panel study. Visualization complements the exploration of data using tables and summary statistics that we covered in a previous post. […]