Introduction to interlacer • interlacer

In many datasets, reasons for missing values are interlaced with data as special values or codes. For example, consider the following CSV:

library(readr)
library(interlacer, warn.conflicts = FALSE)

read_file(interlacer_example("colors.csv")) |>
  cat()
#> person_id,age,favorite_color
#> 1,20,BLUE
#> 2,REFUSED,BLUE
#> 3,21,REFUSED
#> 4,30,OMITTED
#> 5,1,N/A
#> 6,41,RED
#> 7,50,OMITTED
#> 8,30,YELLOW
#> 9,REFUSED,REFUSED
#> 10,OMITTED,RED
#> 11,10,REFUSED

This csv data source has three variables: person_id and age, both numeric variables and favorite_color, a character or factor variable. Interlaced in their values are three possible missing reasons: REFUSED, OMITTED, and N/A.

To load the values of this data source, it is an easy call to the venerable readr::read_csv():

(df_simple <- read_csv(
  interlacer_example("colors.csv"),
  na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#>    person_id   age favorite_color
#>        <dbl> <dbl> <chr>         
#>  1         1    20 BLUE          
#>  2         2    NA BLUE          
#>  3         3    21 NA            
#>  4         4    30 NA            
#>  5         5     1 NA            
#>  6         6    41 RED           
#>  7         7    50 NA            
#>  8         8    30 YELLOW        
#>  9         9    NA NA            
#> 10        10    NA RED           
#> 11        11    10 NA

As you can see, the data were loaded into a dataframe with three columns, and all of the missing reasons were replaced with NA values.

Aggregations with missing reasons

Now, if we were only interested in the values of our source data, this functionality is all we need. But what if we wanted to know why some values were NA? Although that information was encoded in our source data, it was lost when all of the missing reasons were converted into NA values.

For example, consider the favorite_color column. How many respondents REFUSED to give their favorite color? How many people just OMITTED their answer? Was the question N/A for some respondents (e.g. wasn’t on their survey form)? What was the mean respondent age for each of these groups?

Our current dataframe only gets us part way:

library(dplyr, warn.conflicts = FALSE)

df_simple |>
  summarize(
    mean_age = mean(age, na.rm = TRUE),
    n = n(),
    .by = favorite_color
  ) |>
  arrange(favorite_color)
#> # A tibble: 4 × 3
#>   favorite_color mean_age     n
#>   <chr>             <dbl> <int>
#> 1 BLUE               20       2
#> 2 RED                41       2
#> 3 YELLOW             30       1
#> 4 NA                 22.4     6

As you can see, because we converted all our missing reasons into a single NA, we can only answer these questions about missingness in general, rather than work with the specific reasons stored in our source data.

Unfortunately, if we try load our data with the missing reasons intact, we lose something else: the type information of the values.

(df_with_missing <- read_csv(
  interlacer_example("colors.csv"),
  col_types = cols(.default = "c")
))
#> # A tibble: 11 × 3
#>    person_id age     favorite_color
#>    <chr>     <chr>   <chr>         
#>  1 1         20      BLUE          
#>  2 2         REFUSED BLUE          
#>  3 3         21      REFUSED       
#>  4 4         30      OMITTED       
#>  5 5         1       N/A           
#>  6 6         41      RED           
#>  7 7         50      OMITTED       
#>  8 8         30      YELLOW        
#>  9 9         REFUSED REFUSED       
#> 10 10        OMITTED RED           
#> 11 11        10      REFUSED

Now we have access to our missing reasons, but all the columns are character vectors. This means that in order to do anything with our values, we always have to filter out the missing reasons, and cast the remaining values to our desired type:

reasons <- c("REFUSED", "OMITTED", "N/A")

df_with_missing |>
  mutate(
    age_values = as.numeric(if_else(age %in% reasons, NA, age)),
  ) |>
  summarize(
    mean_age = mean(age_values, na.rm = TRUE),
    n = n(),
    .by = favorite_color
  ) |>
  arrange(favorite_color)
#> # A tibble: 6 × 3
#>   favorite_color mean_age     n
#>   <chr>             <dbl> <int>
#> 1 BLUE               20       2
#> 2 N/A                 1       1
#> 3 OMITTED            40       2
#> 4 RED                41       2
#> 5 REFUSED            15.5     3
#> 6 YELLOW             30       1

This gives us the information we want, but it is cumbersome. Notice how there’s no distinction between favorite color values and missing reasons! Things start to get really complex when different columns have different sets of possible missing reasons. It means you have to do a lot of type conversion gymnastics to switch between value types and missing types.

The interlacer approach

interlacer was built based on the insight that everything becomes much more tidy, simple, and expressive when we explicitly work with values and missing reasons as separate channels of the same variable. interlacer introduces a new interlaced column type that facilitates this. The read_interlaced_* functions in interlacer import data with this new column type.

(df <- read_interlaced_csv(
  interlacer_example("colors.csv"),
  na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,fct> <dbl,fct> <chr,fct>     
#>  1         1        20 BLUE          
#>  2         2 <REFUSED> BLUE          
#>  3         3        21 <REFUSED>     
#>  4         4        30 <OMITTED>     
#>  5         5         1 <N/A>         
#>  6         6        41 RED           
#>  7         7        50 <OMITTED>     
#>  8         8        30 YELLOW        
#>  9         9 <REFUSED> <REFUSED>     
#> 10        10 <OMITTED> RED           
#> 11        11        10 <REFUSED>

As you can see by the column headers, each column loaded is composed of two channels: a value channel, and missing reason channel. Each channel can have its own type. The age column, for example, has double values and factor missing reasons:

df$age
#> <interlaced<dbl, fct>[11]>
#>  [1] 20        <REFUSED> 21        30         1        41        50       
#>  [8] 30        <REFUSED> <OMITTED> 10       
#> NA levels: REFUSED OMITTED N/A

These channels can be explicitly accessed using the value_channel() and na_channel() helper functions:

value_channel(df$age)
#>  [1] 20 NA 21 30  1 41 50 30 NA NA 10
na_channel(df$age)
#>  [1] <NA>    REFUSED <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    REFUSED
#> [10] OMITTED <NA>   
#> Levels: REFUSED OMITTED N/A

These helpers are rarely needed, however, because computations automatically operate on an interlaced column’s value channel, and ignore the missing reasons channel. The following will compute the mean age, without the missing reasons interfering:

mean(df$age, na.rm = TRUE)
#> [1] 25.375

(We could have equivalently used the value_channel() helper to achieve the same result, albeit with more verbosity):

mean(value_channel(df$age), na.rm = TRUE)
#> [1] 25.375

Although missing reasons are excluded in computations, they still treated as unique values. This means that when we group by age we get a breakdown by the unique missing reasons, rather than being lumped into a single NA:

df |>
  summarize(
    mean_age = mean(age, na.rm = TRUE),
    n = n(),
    .by = favorite_color
  ) |>
  arrange(favorite_color)
#> # A tibble: 4 × 3
#>   favorite_color mean_age     n
#>   <chr,fct>         <dbl> <int>
#> 1 BLUE               20       2
#> 2 RED                41       2
#> 3 YELLOW             30       1
#> 4 <REFUSED>          22.4     6

As you can see, we can generate the same report as we did before, but without needing to do any type gymnastics! Also, the values are neatly distinguished from the missing reasons.

Filtering based on missing reasons

interlaced columns are also helpful when creating samples with inclusion / exclusion criteria based on missing reasons. For example, using our example data, say we wanted to create a sample of respondents that REFUSED to give their age. To indicate that a value should be interpreted as a missing reason, you can use the na() function with the value:

df |>
  filter(age == na("REFUSED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,fct>, age <dbl,fct>, favorite_color <chr,fct>

# na_channel() can also be used to get an equivalent result:
df |>
  filter(na_channel(age) == "REFUSED")
#> # A tibble: 2 × 3
#>   person_id       age favorite_color
#>   <dbl,fct> <dbl,fct> <chr,fct>     
#> 1         2 <REFUSED> BLUE          
#> 2         9 <REFUSED> <REFUSED>

How about people who REFUSED to report their age AND favorite color?

df |>
  filter(age == na("REFUSED") & favorite_color == na("REFUSED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,fct>, age <dbl,fct>, favorite_color <chr,fct>

It’s also possible to combine value conditions with missing reason conditions. For example, this will select everyone who REFUSED to give their favorite color, and was over 20 years old:

df |>
  filter(age > 20 & favorite_color == na("REFUSED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,fct>, age <dbl,fct>, favorite_color <chr,fct>

Mutations

As you might expect, the na() function can be used with values in mutations. The following pipeline will replace the favorite color of respondents with the missing value "REDACTED"

df |>
  mutate(
    favorite_color = na("REDACTED")
  )
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,fct> <dbl,fct>      <???,fct>
#>  1         1        20     <REDACTED>
#>  2         2 <REFUSED>     <REDACTED>
#>  3         3        21     <REDACTED>
#>  4         4        30     <REDACTED>
#>  5         5         1     <REDACTED>
#>  6         6        41     <REDACTED>
#>  7         7        50     <REDACTED>
#>  8         8        30     <REDACTED>
#>  9         9 <REFUSED>     <REDACTED>
#> 10        10 <OMITTED>     <REDACTED>
#> 11        11        10     <REDACTED>

Conditionals also work exactly as you would expect in mutations. The following will replace the favorite color of respondents with age < 18 with the missing reason "REDACTED_UNDERAGE". Respondents missing an age will be replaced with "REDACTED_MISSING_AGE"

df |>
  mutate(
    favorite_color = if_else(
      age < 18,
      na("REDACTED_UNDERAGE"),
      favorite_color,
      missing = na("REDACTED_MISSING_AGE")
    )
  )
#> # A tibble: 11 × 3
#>    person_id       age favorite_color        
#>    <dbl,fct> <dbl,fct> <chr,fct>             
#>  1         1        20 BLUE                  
#>  2         2 <REFUSED> <REDACTED_MISSING_AGE>
#>  3         3        21 <REFUSED>             
#>  4         4        30 <OMITTED>             
#>  5         5         1 <REDACTED_UNDERAGE>   
#>  6         6        41 RED                   
#>  7         7        50 <OMITTED>             
#>  8         8        30 YELLOW                
#>  9         9 <REFUSED> <REDACTED_MISSING_AGE>
#> 10        10 <OMITTED> <REDACTED_MISSING_AGE>
#> 11        11        10 <REDACTED_UNDERAGE>

The following mutation will create a new column called person_type that will be "CHILD" when age < 18, "ADULT" when age >= 18, and missing with reason "AGE_UNAVAILABLE" when age is missing:

df |>
  mutate(
    person_type = if_else(
      age < 18,
      "CHILD",
      "ADULT",
      missing = na("AGE_UNAVAILABLE")
    ),
  )
#> # A tibble: 11 × 4
#>    person_id       age favorite_color person_type      
#>    <dbl,fct> <dbl,fct> <chr,fct>      <chr,fct>        
#>  1         1        20 BLUE           ADULT            
#>  2         2 <REFUSED> BLUE           <AGE_UNAVAILABLE>
#>  3         3        21 <REFUSED>      ADULT            
#>  4         4        30 <OMITTED>      ADULT            
#>  5         5         1 <N/A>          CHILD            
#>  6         6        41 RED            ADULT            
#>  7         7        50 <OMITTED>      ADULT            
#>  8         8        30 YELLOW         ADULT            
#>  9         9 <REFUSED> <REFUSED>      <AGE_UNAVAILABLE>
#> 10        10 <OMITTED> RED            <AGE_UNAVAILABLE>
#> 11        11        10 <REFUSED>      CHILD

Important note: You must use dplyr::if_else() with interlaced vectors instead of R’s base::ifelse() function, because the base function strips out the missing reason channel due to a fundamental limitation in base R.

Multichannel comparison

Comparison operators (e.g. == and !=) compare value channels, and return NA whenever missing values are involved:

5 == NA
#> [1] NA
5 == na("reason")
#> [1] NA

This means that missing reasons cannot be directly compared:

na("reason1") == na("reason2")
#> [1] NA

To simultaneously compare values and missing reason channels, interlacer provides the multichannel comparison operators %==% and %!=%. These operators always return boolean vectors, and never return missing values. With multichannel comparison, two values are considered equal iff:

they are both values, and the values are equal OR
they are both missing values, and their reasons are the same OR
they are both missing values, and both lack missing reasons

This way, if we convert the previous example to use multichannel comparison, we get:

na("reason1") %==% na("reason2")
#> [1] FALSE

Similarly, comparisons with values return booleans rather than NA,

5 %==% NA
#> [1] FALSE
5 %==% na("reason")
#> [1] FALSE

And, as a consequence, regular NA values are equivalent to interlaced values that do not have a missing reason:

NA %==% na()
#> [1] TRUE

# Alternatively this can be considered an `NA` missing reason:
NA %==% na(NA)
#> [1] TRUE

Missing values that are missing their missing reasons are called “empty values”, and discussed in the next section.

Empty values

Empty values are missing values that lack missing reasons. These values can arise when missing actions are not specified. For example, in the example, empty values are created because the missing = argument in the if_else() statement is not specified:

df |>
  mutate(
    favorite_color = if_else(
      age < 18,
      na("REDACTED_UNDERAGE"),
      favorite_color,
    )
  )
#> # A tibble: 11 × 3
#>    person_id       age favorite_color     
#>    <dbl,fct> <dbl,fct> <chr,fct>          
#>  1         1        20 BLUE               
#>  2         2 <REFUSED> <<NA>>             
#>  3         3        21 <REFUSED>          
#>  4         4        30 <OMITTED>          
#>  5         5         1 <REDACTED_UNDERAGE>
#>  6         6        41 RED                
#>  7         7        50 <OMITTED>          
#>  8         8        30 YELLOW             
#>  9         9 <REFUSED> <<NA>>             
#> 10        10 <OMITTED> <<NA>>             
#> 11        11        10 <REDACTED_UNDERAGE>

As you can see, empty values are displayed as <<NA>>.

Empty values can be detected by using the multichannel comparsion operator:

df |>
  mutate(
    favorite_color = if_else(
      age < 18,
      na("REDACTED_UNDERAGE"),
      favorite_color,
    )
  ) |>
  filter(favorite_color %==% na())
#> # A tibble: 3 × 3
#>   person_id       age favorite_color
#>   <dbl,fct> <dbl,fct> <chr,fct>     
#> 1         2 <REFUSED> <<NA>>        
#> 2         9 <REFUSED> <<NA>>        
#> 3        10 <OMITTED> <<NA>>

(Note that regular equality does not work, because it would return NA for all the comparisons!)

Empty values often occur as the result of joins, because the dplyr::*_join() family of functions do not have a missing = parameter, like dplyr::if_else() does. For example, say we had the following data frame we wanted to join to our sample:

conditions <- tribble(
  ~person_id, ~condition,
  1, "TREATMENT",
  2, "CONTROL",
  3, na("TECHNICAL_ERROR"),
  6, "CONTROL",
  8, "TREATMENT",
)

Because we’re missing condition information for some of the respondents, these show up as empty values when we join the data frame to our sample:

df |>
  left_join(conditions, by = join_by(person_id))
#> # A tibble: 11 × 4
#>    person_id       age favorite_color condition        
#>    <dbl,fct> <dbl,fct> <chr,fct>      <chr,fct>        
#>  1         1        20 BLUE           TREATMENT        
#>  2         2 <REFUSED> BLUE           CONTROL          
#>  3         3        21 <REFUSED>      <TECHNICAL_ERROR>
#>  4         4        30 <OMITTED>      <<NA>>           
#>  5         5         1 <N/A>          <<NA>>           
#>  6         6        41 RED            CONTROL          
#>  7         7        50 <OMITTED>      <<NA>>           
#>  8         8        30 YELLOW         TREATMENT        
#>  9         9 <REFUSED> <REFUSED>      <<NA>>           
#> 10        10 <OMITTED> RED            <<NA>>           
#> 11        11        10 <REFUSED>      <<NA>>

We can remedy this by replacing these empty values after the join:

df |>
  left_join(conditions, by = join_by(person_id)) |>
  mutate(
    condition = if_else(condition %==% na(), na("LEFT_STUDY"), condition),
  )
#> # A tibble: 11 × 4
#>    person_id       age favorite_color condition        
#>    <dbl,fct> <dbl,fct> <chr,fct>      <chr,fct>        
#>  1         1        20 BLUE           TREATMENT        
#>  2         2 <REFUSED> BLUE           CONTROL          
#>  3         3        21 <REFUSED>      <TECHNICAL_ERROR>
#>  4         4        30 <OMITTED>      <LEFT_STUDY>     
#>  5         5         1 <N/A>          <LEFT_STUDY>     
#>  6         6        41 RED            CONTROL          
#>  7         7        50 <OMITTED>      <LEFT_STUDY>     
#>  8         8        30 YELLOW         TREATMENT        
#>  9         9 <REFUSED> <REFUSED>      <LEFT_STUDY>     
#> 10        10 <OMITTED> RED            <LEFT_STUDY>     
#> 11        11        10 <REFUSED>      <LEFT_STUDY>

Writing interlaced files

After you’ve made made changes to your data, you probably want to save them. interlacer provides the write_interlaced_* family of functions for this:

write_interlaced_csv(df, "interlaced_output.csv")

This will combine the value and missing reasons into interlaced character columns, and write the result as a csv. Alternatively, if you want to re-interlace the columns without writing to a file for more control in the writing process, you can use flatten_channels():

flatten_channels(df)
#> # A tibble: 11 × 3
#>    person_id age     favorite_color
#>        <dbl> <chr>   <chr>         
#>  1         1 20      BLUE          
#>  2         2 REFUSED BLUE          
#>  3         3 21      REFUSED       
#>  4         4 30      OMITTED       
#>  5         5 1       N/A           
#>  6         6 41      RED           
#>  7         7 50      OMITTED       
#>  8         8 30      YELLOW        
#>  9         9 REFUSED REFUSED       
#> 10        10 OMITTED RED           
#> 11        11 10      REFUSED

# (it works on single vectors as well)
flatten_channels(df$age)
#>  [1] "20"      "REFUSED" "21"      "30"      "1"       "41"      "50"     
#>  [8] "30"      "REFUSED" "OMITTED" "10"

The value and missing reason channels of data frames with interlaced vectors can similarly be accessed using the value_channel() and na_channel() helper functions:

value_channel(df)
#> # A tibble: 11 × 3
#>    person_id   age favorite_color
#>        <dbl> <dbl> <chr>         
#>  1         1    20 BLUE          
#>  2         2    NA BLUE          
#>  3         3    21 NA            
#>  4         4    30 NA            
#>  5         5     1 NA            
#>  6         6    41 RED           
#>  7         7    50 NA            
#>  8         8    30 YELLOW        
#>  9         9    NA NA            
#> 10        10    NA RED           
#> 11        11    10 NA
na_channel(df)
#> # A tibble: 11 × 3
#>    person_id age     favorite_color
#>    <fct>     <fct>   <fct>         
#>  1 NA        NA      NA            
#>  2 NA        REFUSED NA            
#>  3 NA        NA      REFUSED       
#>  4 NA        NA      OMITTED       
#>  5 NA        NA      N/A           
#>  6 NA        NA      NA            
#>  7 NA        NA      OMITTED       
#>  8 NA        NA      NA            
#>  9 NA        REFUSED REFUSED       
#> 10 NA        OMITTED NA            
#> 11 NA        NA      REFUSED

Next steps

So far, we’ve covered how interlacer’s read_interlaced_* family of functions enabled us to load interlaced columns that contain separate challens for value and missing reasons. The interlaced type enables us to create tidy and type-aware pipelines that can flexibly consider a variable’s value AND missing reasons.

In all the examples in this vignette, column types were automatically detected. To explicitly specify value and missing column types, (and specify individual missing reasons for specific columns), interlacer extends readr’s collector() system. This will be covered in the next vignette, vignette("na-column-types").