Other Approaches

interlacer was inspired by the haven, labelled, and declared packages. These packages provide similar functionality to interlacer, but are more focused on providing compatibility with missing reason data imported from SPSS, SAS, and Stata.

In this section I discuss some of the particularities of these approaches, and how they compare with interlacer.

(Note: Future versions of interlacer will have the ability to convert haven_labelled and declared types to and from interlaced types.)

haven and labelled

The haven and labelled packages rely on two functions for creating vectors that interlace values and missing reasons: haven::labelled_spss() and haven::tagged_na(). Although they both create haven_labelled vectors, they use very different methods for representing missing values.

“Labelled” missing values (`haven::labelled_spss()`)

When SPSS files are loaded with haven via haven::read_spss(), values and missing reasons are loaded into a single interlaced numeric vector:

library(interlacer, warn.conflicts = FALSE)
library(haven)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

(df_spss <- read_spss(
  interlacer_example("colors.sav"), user_na = TRUE
))
#> Registered S3 methods overwritten by 'readr':
#>   method                    from 
#>   as.data.frame.spec_tbl_df vroom
#>   as_tibble.spec_tbl_df     vroom
#>   format.col_spec           vroom
#>   print.col_spec            vroom
#>   print.collector           vroom
#>   print.date_names          vroom
#>   print.locale              vroom
#>   str.col_spec              vroom
#> # A tibble: 11 × 3
#>    person_id age                favorite_color    
#>    <dbl+lbl> <dbl+lbl>          <dbl+lbl>         
#>  1  1         20                  1 [BLUE]        
#>  2  2        -98 (NA) [REFUSED]   1 [BLUE]        
#>  3  3         21                -98 (NA) [REFUSED]
#>  4  4         30                -97 (NA) [OMITTED]
#>  5  5          1                -99 (NA) [N/A]    
#>  6  6         41                  2 [RED]         
#>  7  7         50                -97 (NA) [OMITTED]
#>  8  8         30                  3 [YELLOW]      
#>  9  9        -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10        -97 (NA) [OMITTED]   2 [RED]         
#> 11 11         10                -98 (NA) [REFUSED]

Not just any numeric vector though, a haven::labelled_spss() numeric vector, with attributes describing its value and missing value codes:

attributes(df_spss$favorite_color)
#> $label
#> [1] "Favorite color"
#> 
#> $na_range
#> [1] -Inf    0
#> 
#> $class
#> [1] "haven_labelled_spss" "haven_labelled"      "vctrs_vctr"         
#> [4] "double"             
#> 
#> $format.spss
#> [1] "F8.2"
#> 
#> $labels
#>    BLUE     RED  YELLOW     N/A REFUSED OMITTED 
#>       1       2       3     -99     -98     -97

These attributes adjust the behavior of functions like is.na():

is.na(df_spss$favorite_color)
#>  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

This makes it easy to check if a value is a missing reason, but you still have to filter out missing reasons before you do any aggregations:

df_spss |>
  mutate(
    age_values = if_else(is.na(age), NA, age),
    favorite_color_missing_reasons = if_else(
      is.na(favorite_color), favorite_color, NA
    )
  ) |>
  summarize(
    mean_age = mean(age_values, na.rm = TRUE),
    n = n(),
    .by = favorite_color_missing_reasons
  )
#> # A tibble: 4 × 3
#>   favorite_color_missing_reasons mean_age     n
#>   <dbl+lbl>                         <dbl> <int>
#> 1  NA                                30.3     5
#> 2 -98 (NA) [REFUSED]                 15.5     3
#> 3 -97 (NA) [OMITTED]                 40       2
#> 4 -99 (NA) [N/A]                      1       1

df_spss |>
  mutate(
    age_next_year = if_else(is.na(age), NA, age + 1),
    .after = person_id
  )
#> # A tibble: 11 × 4
#>    person_id age_next_year age                favorite_color    
#>    <dbl+lbl>         <dbl> <dbl+lbl>          <dbl+lbl>         
#>  1  1                   21  20                  1 [BLUE]        
#>  2  2                   NA -98 (NA) [REFUSED]   1 [BLUE]        
#>  3  3                   22  21                -98 (NA) [REFUSED]
#>  4  4                   31  30                -97 (NA) [OMITTED]
#>  5  5                    2   1                -99 (NA) [N/A]    
#>  6  6                   42  41                  2 [RED]         
#>  7  7                   51  50                -97 (NA) [OMITTED]
#>  8  8                   31  30                  3 [YELLOW]      
#>  9  9                   NA -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10                   NA -97 (NA) [OMITTED]   2 [RED]         
#> 11 11                   11  10                -98 (NA) [REFUSED]

It’s a little bit of an improvement to working with raw coded values, because you can use is.na(), and your codes get labels, so you don’t have be constantly looking up codes in your codebook. But it still falls short of interlacer’s functionality for two key reasons:

Reason 1: With interlacer, your value column can be whatever type you want: numeric, character, factor, etc. With labelled missing reasons, values and missing reasons need to be the same type, usually numeric codes. This creates a lot more type gymnastics and potential errors when you’re manipulating them.

Reason 2: Even when the missing values are labelled in the labelled_spss type, aggregations and other math operations are not protected. If you forget to take out your missing values, you get incorrect results / corrupted data:

df_spss |>
  mutate(
    favorite_color_missing_reasons = if_else(
      is.na(favorite_color), favorite_color, NA
    )
  ) |>
  summarize(
    mean_age = mean(age, na.rm = TRUE),
    n = n(),
    .by = favorite_color_missing_reasons
  )
#> # A tibble: 4 × 3
#>   favorite_color_missing_reasons mean_age     n
#>   <dbl+lbl>                         <dbl> <int>
#> 1  NA                               -20.8     5
#> 2 -98 (NA) [REFUSED]                -22.3     3
#> 3 -97 (NA) [OMITTED]                 40       2
#> 4 -99 (NA) [N/A]                      1       1

df_spss |>
  mutate(
    age_next_year = age + 1,
    .after = person_id
  )
#> # A tibble: 11 × 4
#>    person_id age_next_year age                favorite_color    
#>    <dbl+lbl>         <dbl> <dbl+lbl>          <dbl+lbl>         
#>  1  1                   21  20                  1 [BLUE]        
#>  2  2                  -97 -98 (NA) [REFUSED]   1 [BLUE]        
#>  3  3                   22  21                -98 (NA) [REFUSED]
#>  4  4                   31  30                -97 (NA) [OMITTED]
#>  5  5                    2   1                -99 (NA) [N/A]    
#>  6  6                   42  41                  2 [RED]         
#>  7  7                   51  50                -97 (NA) [OMITTED]
#>  8  8                   31  30                  3 [YELLOW]      
#>  9  9                  -97 -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10                  -96 -97 (NA) [OMITTED]   2 [RED]         
#> 11 11                   11  10                -98 (NA) [REFUSED]

“Tagged” missing values (`haven::tagged_na()`)

For loading Stata and SAS files, haven uses a “tagged missingness” approach to mirror how these values are handled in Stata and SAS:

(df_stata <- read_stata(
  interlacer_example("colors.dta")
))
#> # A tibble: 11 × 3
#>    person_id age             favorite_color 
#>    <dbl+lbl> <dbl+lbl>       <dbl+lbl>      
#>  1  1           20               1 [BLUE]   
#>  2  2        NA(a) [REFUSED]     1 [BLUE]   
#>  3  3           21           NA(a) [REFUSED]
#>  4  4           30           NA(b) [OMITTED]
#>  5  5            1              NA          
#>  6  6           41               2 [RED]    
#>  7  7           50           NA(b) [OMITTED]
#>  8  8           30               3 [YELLOW] 
#>  9  9        NA(a) [REFUSED] NA(a) [REFUSED]
#> 10 10        NA(b) [OMITTED]     2 [RED]    
#> 11 11           10           NA(a) [REFUSED]

This approach is deviously clever. It takes advantage of the way NaN floating point values are stored in memory, to make it possible to have different “flavors” of NA values. (For more info on how this is done, check out tagged_na.c in the source code for haven)

They still all act like regular NA values… but now they can include a single character “tag” (usually a letter from a-z). This means that they work with is.na() AND will not include missing reason codes in aggregations!

is.na(df_stata$age)
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

mean(df_stata$age, na.rm = TRUE)
#> [1] 25.375

Unfortunately, you can’t group by them, because dplyr::group_by() is not tag-aware. :(

df_stata |>
  mutate(
    favorite_color_missing_reasons = if_else(
      is.na(favorite_color), favorite_color, NA
    )
  ) |>
  summarize(
    mean_age = mean(age, na.rm = TRUE),
    n = n(),
    .by = favorite_color_missing_reasons
  )
#> # A tibble: 1 × 3
#>   favorite_color_missing_reasons mean_age     n
#>   <dbl+lbl>                         <dbl> <int>
#> 1 NA                                 25.4    11

Another limitation of this approach is that it requires values types to be numeric, because the trick of “tagging” the NA values depends on the peculiarities of how floating point values are stored in memory.

declared

The declared package uses the functiondeclared::declared() for constructing interlaced vectors:

library(declared)

(dcl <- declared(c(1, 2, 3, -99, -98), na_values = c(-99, -98)))
#> <declared<numeric>[5]>
#> [1]       1       2       3 NA(-99) NA(-98)
#> Missing values: -99, -98

declared vectors are similar to haven_labelled_spss vectors, except with a critical innovation: they store actual NA values where there are missing values, and then keep track of the missing reasons entirely in the attributes of the object:

# All the missing reason info is tracked in the attributes
attributes(dcl)
#> $na_index
#> -99 -98 
#>   4   5 
#> 
#> $na_values
#> [1] -99 -98
#> 
#> $date
#> [1] FALSE
#> 
#> $class
#> [1] "declared" "numeric"

# The data stored has actual NA values, so it works as you would expect
# with summary stats like `mean()`, etc.
attributes(dcl) <- NULL
dcl
#> [1]  1  2  3 NA NA

This means aggregations work exactly as you would expect!

dcl <- declared(c(1, 2, 3, -99, -98), na_values = c(-99, -98))

sum(dcl, na.rm = TRUE)
#> [1] 6

interlacer

interlacer builds on the ideas of haven, labelled, and declared with following goals:

1. Be fully generic: Add a missing value channel to any vector type

As mentioned above, haven::labelled_spss() only works with numeric and character types, and haven::tagged_na() only works with numeric types. declared::declared() supports numeric, character and date types.

interlaced types, by contrast, can imbue any vector type with a missing value channel:

interlaced(list(TRUE, FALSE, "reason"), na = "reason")
#> <interlaced<lgl, fct>[3]>
#> [1]  TRUE    FALSE    <reason>
#> NA levels: reason

interlaced(c("2020-01-01", "2020-01-02", "reason"), na = "reason") |>
  map_value_channel(as.Date)
#> <interlaced<date, fct>[3]>
#> [1] 2020-01-01 2020-01-02 <reason>  
#> NA levels: reason


interlaced(c("red", "green", "reason"), na = "reason") |>
  map_value_channel(factor)
#> <interlaced<fct, fct>[3]>
#> [1] red      green    <reason>
#> Levels: green red 
#> NA levels: reason

Like declared vectors, the missing reasons are tracked in the attributes. But unlike declared, missing reasons are stored as an entirely separate channel rather than by tracking their indices:

(int <- interlaced(c(1,2,3, -99, -98), na = c(-99, -98)))
#> <interlaced<dbl, int>[5]>
#> [1]  1     2     3    <-99> <-98>

attributes(int)
#> $na_channel_values
#> [1]  NA  NA  NA -99 -98
#> 
#> $class
#> [1] "interlacer_interlaced" "vctrs_vctr"            "numeric"

attributes(int) <- NULL
int
#> [1]  1  2  3 NA NA

This data structure drives their functional API, described in (3) below.

2. Provide functions for reading / writing interlaced CSV files (not just SPSS / SAS / Stata files)

See interlacer::read_interlaced_csv(), etc.

3. Provide a functional API that integrates well into tidy pipelines

interlacer provides functions to facilitate working with the interlaced type as a Result type, a well-understood abstraction in functional programming. The functions na() map_value_channel() and map_na_channel() all come from this influence.

The na() function creates an interlaced type by “lifting” a value into the missing reason channel. This approach helps create a safer separation between the value and missing reason channels, because it’s always clear which channel you’re making comparisons on.

For example:

# haven
labelled_spss(c(-99, 1, 2), na_values = -99) == 1 # value channel comparison
#> [1] FALSE  TRUE FALSE
labelled_spss(c(-99, 1, 2), na_values = -99) == -99 # na channel comparison
#> [1]  TRUE FALSE FALSE

# declared
declared(c(-99, 1, 2), na_values = -99) == 1 # value channel comparison
#> [1] FALSE  TRUE FALSE
declared(c(-99, 1, 2), na_values = -99) == -99 # na channel comparison
#> [1]  TRUE FALSE FALSE

# interlacer 
interlaced(c(-99, 1, 2), na = -99) == 1 # value channel comparison
#> [1]    NA  TRUE FALSE
interlaced(c(-99, 1, 2), na = -99) == na(-99) # na channel comparison
#> [1] NA NA NA

Similarly, map_value_channel() and map_na_channel() allow you to safely mutate a particular channel, without touching the values of the other channel. This interface is especially useful in tidy pipelines.

Finally, because the interlaced type is based on the vctrs type system, it plays nicely with all the packages in the tidyverse.