interlacer was inspired by the haven, labelled, and declared packages. These packages provide similar functionality to interlacer, but are more focused on providing compatibility with missing reason data imported from SPSS, SAS, and Stata.
In this section I discuss some of the particularities of these approaches, and how they compare with interlacer.
(Note: Future versions of interlacer will have the ability to convert
haven_labelled
and declared
types to and from
interlaced
types.)
haven and labelled
The haven and labelled packages rely
on two functions for creating vectors that interlace values and missing
reasons: haven::labelled_spss()
and
haven::tagged_na()
. Although they both create
haven_labelled
vectors, they use very different methods for
representing missing values.
“Labelled” missing values (haven::labelled_spss()
)
When SPSS files are loaded with haven via
haven::read_spss()
, values and missing reasons are loaded
into a single interlaced numeric vector:
library(interlacer, warn.conflicts = FALSE)
library(haven)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
(df_spss <- read_spss(
interlacer_example("colors.sav"), user_na = TRUE
))
#> Registered S3 methods overwritten by 'readr':
#> method from
#> as.data.frame.spec_tbl_df vroom
#> as_tibble.spec_tbl_df vroom
#> format.col_spec vroom
#> print.col_spec vroom
#> print.collector vroom
#> print.date_names vroom
#> print.locale vroom
#> str.col_spec vroom
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl+lbl> <dbl+lbl> <dbl+lbl>
#> 1 1 20 1 [BLUE]
#> 2 2 -98 (NA) [REFUSED] 1 [BLUE]
#> 3 3 21 -98 (NA) [REFUSED]
#> 4 4 30 -97 (NA) [OMITTED]
#> 5 5 1 -99 (NA) [N/A]
#> 6 6 41 2 [RED]
#> 7 7 50 -97 (NA) [OMITTED]
#> 8 8 30 3 [YELLOW]
#> 9 9 -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10 -97 (NA) [OMITTED] 2 [RED]
#> 11 11 10 -98 (NA) [REFUSED]
Not just any numeric vector though, a
haven::labelled_spss()
numeric vector, with attributes
describing its value and missing value codes:
attributes(df_spss$favorite_color)
#> $label
#> [1] "Favorite color"
#>
#> $na_range
#> [1] -Inf 0
#>
#> $class
#> [1] "haven_labelled_spss" "haven_labelled" "vctrs_vctr"
#> [4] "double"
#>
#> $format.spss
#> [1] "F8.2"
#>
#> $labels
#> BLUE RED YELLOW N/A REFUSED OMITTED
#> 1 2 3 -99 -98 -97
These attributes adjust the behavior of functions like
is.na()
:
is.na(df_spss$favorite_color)
#> [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
This makes it easy to check if a value is a missing reason, but you still have to filter out missing reasons before you do any aggregations:
df_spss |>
mutate(
age_values = if_else(is.na(age), NA, age),
favorite_color_missing_reasons = if_else(
is.na(favorite_color), favorite_color, NA
)
) |>
summarize(
mean_age = mean(age_values, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
#> # A tibble: 4 × 3
#> favorite_color_missing_reasons mean_age n
#> <dbl+lbl> <dbl> <int>
#> 1 NA 30.3 5
#> 2 -98 (NA) [REFUSED] 15.5 3
#> 3 -97 (NA) [OMITTED] 40 2
#> 4 -99 (NA) [N/A] 1 1
df_spss |>
mutate(
age_next_year = if_else(is.na(age), NA, age + 1),
.after = person_id
)
#> # A tibble: 11 × 4
#> person_id age_next_year age favorite_color
#> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl>
#> 1 1 21 20 1 [BLUE]
#> 2 2 NA -98 (NA) [REFUSED] 1 [BLUE]
#> 3 3 22 21 -98 (NA) [REFUSED]
#> 4 4 31 30 -97 (NA) [OMITTED]
#> 5 5 2 1 -99 (NA) [N/A]
#> 6 6 42 41 2 [RED]
#> 7 7 51 50 -97 (NA) [OMITTED]
#> 8 8 31 30 3 [YELLOW]
#> 9 9 NA -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10 NA -97 (NA) [OMITTED] 2 [RED]
#> 11 11 11 10 -98 (NA) [REFUSED]
It’s a little bit of an improvement to working with raw coded values,
because you can use is.na()
, and your codes get labels, so
you don’t have be constantly looking up codes in your codebook. But it
still falls short of interlacer’s functionality for two key reasons:
Reason 1: With interlacer, your value column can be whatever type you want: numeric, character, factor, etc. With labelled missing reasons, values and missing reasons need to be the same type, usually numeric codes. This creates a lot more type gymnastics and potential errors when you’re manipulating them.
Reason 2: Even when the missing values are labelled in the
labelled_spss
type, aggregations and other math operations
are not protected. If you forget to take out your missing values, you
get incorrect results / corrupted data:
df_spss |>
mutate(
favorite_color_missing_reasons = if_else(
is.na(favorite_color), favorite_color, NA
)
) |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
#> # A tibble: 4 × 3
#> favorite_color_missing_reasons mean_age n
#> <dbl+lbl> <dbl> <int>
#> 1 NA -20.8 5
#> 2 -98 (NA) [REFUSED] -22.3 3
#> 3 -97 (NA) [OMITTED] 40 2
#> 4 -99 (NA) [N/A] 1 1
df_spss |>
mutate(
age_next_year = age + 1,
.after = person_id
)
#> # A tibble: 11 × 4
#> person_id age_next_year age favorite_color
#> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl>
#> 1 1 21 20 1 [BLUE]
#> 2 2 -97 -98 (NA) [REFUSED] 1 [BLUE]
#> 3 3 22 21 -98 (NA) [REFUSED]
#> 4 4 31 30 -97 (NA) [OMITTED]
#> 5 5 2 1 -99 (NA) [N/A]
#> 6 6 42 41 2 [RED]
#> 7 7 51 50 -97 (NA) [OMITTED]
#> 8 8 31 30 3 [YELLOW]
#> 9 9 -97 -98 (NA) [REFUSED] -98 (NA) [REFUSED]
#> 10 10 -96 -97 (NA) [OMITTED] 2 [RED]
#> 11 11 11 10 -98 (NA) [REFUSED]
“Tagged” missing values (haven::tagged_na()
)
For loading Stata and SAS files, haven uses a “tagged missingness” approach to mirror how these values are handled in Stata and SAS:
(df_stata <- read_stata(
interlacer_example("colors.dta")
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl+lbl> <dbl+lbl> <dbl+lbl>
#> 1 1 20 1 [BLUE]
#> 2 2 NA(a) [REFUSED] 1 [BLUE]
#> 3 3 21 NA(a) [REFUSED]
#> 4 4 30 NA(b) [OMITTED]
#> 5 5 1 NA
#> 6 6 41 2 [RED]
#> 7 7 50 NA(b) [OMITTED]
#> 8 8 30 3 [YELLOW]
#> 9 9 NA(a) [REFUSED] NA(a) [REFUSED]
#> 10 10 NA(b) [OMITTED] 2 [RED]
#> 11 11 10 NA(a) [REFUSED]
This approach is deviously clever. It takes advantage of the way
NaN
floating point values are stored in memory, to make it
possible to have different “flavors” of NA
values. (For
more info on how this is done, check out tagged_na.c
in the source code for haven)
They still all act like regular NA
values… but now they
can include a single character “tag” (usually a letter from a-z). This
means that they work with is.na()
AND will not include
missing reason codes in aggregations!
is.na(df_stata$age)
#> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
mean(df_stata$age, na.rm = TRUE)
#> [1] 25.375
Unfortunately, you can’t group by them, because
dplyr::group_by()
is not tag-aware. :(
df_stata |>
mutate(
favorite_color_missing_reasons = if_else(
is.na(favorite_color), favorite_color, NA
)
) |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
#> # A tibble: 1 × 3
#> favorite_color_missing_reasons mean_age n
#> <dbl+lbl> <dbl> <int>
#> 1 NA 25.4 11
Another limitation of this approach is that it requires values types
to be numeric, because the trick of “tagging” the NA
values
depends on the peculiarities of how floating point values are stored in
memory.
declared
The declared
package uses the functiondeclared::declared()
for
constructing interlaced vectors:
library(declared)
(dcl <- declared(c(1, 2, 3, -99, -98), na_values = c(-99, -98)))
#> <declared<numeric>[5]>
#> [1] 1 2 3 NA(-99) NA(-98)
#> Missing values: -99, -98
declared
vectors are similar to
haven_labelled_spss
vectors, except with a critical
innovation: they store actual NA
values where there are
missing values, and then keep track of the missing reasons entirely in
the attributes of the object:
# All the missing reason info is tracked in the attributes
attributes(dcl)
#> $na_index
#> -99 -98
#> 4 5
#>
#> $na_values
#> [1] -99 -98
#>
#> $date
#> [1] FALSE
#>
#> $class
#> [1] "declared" "numeric"
# The data stored has actual NA values, so it works as you would expect
# with summary stats like `mean()`, etc.
attributes(dcl) <- NULL
dcl
#> [1] 1 2 3 NA NA
This means aggregations work exactly as you would expect!
interlacer
interlacer builds on the ideas of haven, labelled, and declared with following goals:
1. Be fully generic: Add a missing value channel to any vector type
As mentioned above, haven::labelled_spss()
only works
with numeric
and character
types, and
haven::tagged_na()
only works with numeric
types. declared::declared()
supports numeric
,
character
and date
types.
interlaced
types, by contrast, can imbue any
vector type with a missing value channel:
interlaced(list(TRUE, FALSE, "reason"), na = "reason")
#> <interlaced<lgl, fct>[3]>
#> [1] TRUE FALSE <reason>
#> NA levels: reason
interlaced(c("2020-01-01", "2020-01-02", "reason"), na = "reason") |>
map_value_channel(as.Date)
#> <interlaced<date, fct>[3]>
#> [1] 2020-01-01 2020-01-02 <reason>
#> NA levels: reason
interlaced(c("red", "green", "reason"), na = "reason") |>
map_value_channel(factor)
#> <interlaced<fct, fct>[3]>
#> [1] red green <reason>
#> Levels: green red
#> NA levels: reason
Like declared
vectors, the missing reasons are tracked
in the attributes. But unlike declared
, missing reasons are
stored as an entirely separate channel rather than by tracking
their indices:
(int <- interlaced(c(1,2,3, -99, -98), na = c(-99, -98)))
#> <interlaced<dbl, int>[5]>
#> [1] 1 2 3 <-99> <-98>
attributes(int)
#> $na_channel_values
#> [1] NA NA NA -99 -98
#>
#> $class
#> [1] "interlacer_interlaced" "vctrs_vctr" "numeric"
attributes(int) <- NULL
int
#> [1] 1 2 3 NA NA
This data structure drives their functional API, described in (3) below.
2. Provide functions for reading / writing interlaced CSV files (not just SPSS / SAS / Stata files)
See interlacer::read_interlaced_csv()
, etc.
3. Provide a functional API that integrates well into tidy pipelines
interlacer provides functions to facilitate working with the
interlaced
type as a Result type, a
well-understood abstraction in functional programming. The functions
na()
map_value_channel()
and
map_na_channel()
all come from this influence.
The na()
function creates an interlaced
type by “lifting” a value into the missing reason channel. This approach
helps create a safer separation between the value and missing reason
channels, because it’s always clear which channel you’re making
comparisons on.
For example:
# haven
labelled_spss(c(-99, 1, 2), na_values = -99) == 1 # value channel comparison
#> [1] FALSE TRUE FALSE
labelled_spss(c(-99, 1, 2), na_values = -99) == -99 # na channel comparison
#> [1] TRUE FALSE FALSE
# declared
declared(c(-99, 1, 2), na_values = -99) == 1 # value channel comparison
#> [1] FALSE TRUE FALSE
declared(c(-99, 1, 2), na_values = -99) == -99 # na channel comparison
#> [1] TRUE FALSE FALSE
# interlacer
interlaced(c(-99, 1, 2), na = -99) == 1 # value channel comparison
#> [1] NA TRUE FALSE
interlaced(c(-99, 1, 2), na = -99) == na(-99) # na channel comparison
#> [1] NA NA NA
Similarly, map_value_channel()
and
map_na_channel()
allow you to safely mutate a particular
channel, without touching the values of the other channel. This
interface is especially useful in tidy pipelines.
Finally, because the interlaced
type is based on the
vctrs
type system, it plays nicely with all the packages in
the tidyverse.