In many datasets, reasons for missing values are interlaced with data as special values or codes. For example, consider the following CSV:
library(readr)
library(interlacer, warn.conflicts = FALSE)
read_file(interlacer_example("colors.csv")) |>
cat()
#> person_id,age,favorite_color
#> 1,20,BLUE
#> 2,REFUSED,BLUE
#> 3,21,REFUSED
#> 4,30,OMITTED
#> 5,1,N/A
#> 6,41,RED
#> 7,50,OMITTED
#> 8,30,YELLOW
#> 9,REFUSED,REFUSED
#> 10,OMITTED,RED
#> 11,10,REFUSED
This csv data source has three variables: person_id
and
age
, both numeric variables and
favorite_color
, a character or factor variable. Interlaced
in their values are three possible missing reasons:
REFUSED
, OMITTED
, and N/A
.
To load the values of this data source, it is an easy call to the
venerable readr::read_csv()
:
(df_simple <- read_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl> <dbl> <chr>
#> 1 1 20 BLUE
#> 2 2 NA BLUE
#> 3 3 21 NA
#> 4 4 30 NA
#> 5 5 1 NA
#> 6 6 41 RED
#> 7 7 50 NA
#> 8 8 30 YELLOW
#> 9 9 NA NA
#> 10 10 NA RED
#> 11 11 10 NA
As you can see, the data were loaded into a dataframe with three
columns, and all of the missing reasons were replaced with
NA
values.
Aggregations with missing reasons
Now, if we were only interested in the values of our source
data, this functionality is all we need. But what if we wanted to know
why some values were NA
? Although that information
was encoded in our source data, it was lost when all of the missing
reasons were converted into NA
values.
For example, consider the favorite_color
column. How
many respondents REFUSED
to give their favorite color? How
many people just OMITTED
their answer? Was the question
N/A
for some respondents (e.g. wasn’t on their survey
form)? What was the mean respondent age for each of these groups?
Our current dataframe only gets us part way:
library(dplyr, warn.conflicts = FALSE)
df_simple |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
arrange(favorite_color)
#> # A tibble: 4 × 3
#> favorite_color mean_age n
#> <chr> <dbl> <int>
#> 1 BLUE 20 2
#> 2 RED 41 2
#> 3 YELLOW 30 1
#> 4 NA 22.4 6
As you can see, because we converted all our missing reasons into a
single NA
, we can only answer these questions about
missingness in general, rather than work with the specific reasons
stored in our source data.
Unfortunately, if we try load our data with the missing reasons intact, we lose something else: the type information of the values.
(df_with_missing <- read_csv(
interlacer_example("colors.csv"),
col_types = cols(.default = "c")
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <chr> <chr> <chr>
#> 1 1 20 BLUE
#> 2 2 REFUSED BLUE
#> 3 3 21 REFUSED
#> 4 4 30 OMITTED
#> 5 5 1 N/A
#> 6 6 41 RED
#> 7 7 50 OMITTED
#> 8 8 30 YELLOW
#> 9 9 REFUSED REFUSED
#> 10 10 OMITTED RED
#> 11 11 10 REFUSED
Now we have access to our missing reasons, but all the columns are character vectors. This means that in order to do anything with our values, we always have to filter out the missing reasons, and cast the remaining values to our desired type:
reasons <- c("REFUSED", "OMITTED", "N/A")
df_with_missing |>
mutate(
age_values = as.numeric(if_else(age %in% reasons, NA, age)),
) |>
summarize(
mean_age = mean(age_values, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
arrange(favorite_color)
#> # A tibble: 6 × 3
#> favorite_color mean_age n
#> <chr> <dbl> <int>
#> 1 BLUE 20 2
#> 2 N/A 1 1
#> 3 OMITTED 40 2
#> 4 RED 41 2
#> 5 REFUSED 15.5 3
#> 6 YELLOW 30 1
This gives us the information we want, but it is cumbersome. Notice how there’s no distinction between favorite color values and missing reasons! Things start to get really complex when different columns have different sets of possible missing reasons. It means you have to do a lot of type conversion gymnastics to switch between value types and missing types.
The interlacer approach
interlacer was built based on the insight that everything becomes
much more tidy, simple, and expressive when we explicitly work with
values and missing reasons as separate channels of the same
variable. interlacer introduces a new interlaced
column
type that facilitates this. The read_interlaced_*
functions
in interlacer import data with this new column type.
(df <- read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
As you can see by the column headers, each column loaded is composed
of two channels: a value channel, and missing reason channel. Each
channel can have its own type. The age
column, for example,
has double
values and factor
missing
reasons:
df$age
#> <interlaced<dbl, fct>[11]>
#> [1] 20 <REFUSED> 21 30 1 41 50
#> [8] 30 <REFUSED> <OMITTED> 10
#> NA levels: REFUSED OMITTED N/A
These channels can be explicitly accessed using the
value_channel()
and na_channel()
helper
functions:
value_channel(df$age)
#> [1] 20 NA 21 30 1 41 50 30 NA NA 10
na_channel(df$age)
#> [1] <NA> REFUSED <NA> <NA> <NA> <NA> <NA> <NA> REFUSED
#> [10] OMITTED <NA>
#> Levels: REFUSED OMITTED N/A
These helpers are rarely needed, however, because computations
automatically operate on an interlaced
column’s value
channel, and ignore the missing reasons channel. The following will
compute the mean age, without the missing reasons interfering:
mean(df$age, na.rm = TRUE)
#> [1] 25.375
(We could have equivalently used the value_channel()
helper to achieve the same result, albeit with more verbosity):
mean(value_channel(df$age), na.rm = TRUE)
#> [1] 25.375
Although missing reasons are excluded in computations, they still
treated as unique values. This means that when we group by age we get a
breakdown by the unique missing reasons, rather than being lumped into a
single NA
:
df |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
arrange(favorite_color)
#> # A tibble: 4 × 3
#> favorite_color mean_age n
#> <chr,fct> <dbl> <int>
#> 1 BLUE 20 2
#> 2 RED 41 2
#> 3 YELLOW 30 1
#> 4 <REFUSED> 22.4 6
As you can see, we can generate the same report as we did before, but without needing to do any type gymnastics! Also, the values are neatly distinguished from the missing reasons.
Filtering based on missing reasons
interlaced
columns are also helpful when creating
samples with inclusion / exclusion criteria based on missing reasons.
For example, using our example data, say we wanted to create a sample of
respondents that REFUSED
to give their age. To indicate
that a value should be interpreted as a missing reason, you can use the
na()
function with the value:
df |>
filter(age == na("REFUSED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,fct>, age <dbl,fct>, favorite_color <chr,fct>
# na_channel() can also be used to get an equivalent result:
df |>
filter(na_channel(age) == "REFUSED")
#> # A tibble: 2 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 2 <REFUSED> BLUE
#> 2 9 <REFUSED> <REFUSED>
How about people who REFUSED
to report their age AND
favorite color?
df |>
filter(age == na("REFUSED") & favorite_color == na("REFUSED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,fct>, age <dbl,fct>, favorite_color <chr,fct>
It’s also possible to combine value conditions with missing reason
conditions. For example, this will select everyone who
REFUSED
to give their favorite color, and was over 20 years
old:
Mutations
As you might expect, the na()
function can be used with
values in mutations. The following pipeline will replace the favorite
color of respondents with the missing value "REDACTED"
df |>
mutate(
favorite_color = na("REDACTED")
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <???,fct>
#> 1 1 20 <REDACTED>
#> 2 2 <REFUSED> <REDACTED>
#> 3 3 21 <REDACTED>
#> 4 4 30 <REDACTED>
#> 5 5 1 <REDACTED>
#> 6 6 41 <REDACTED>
#> 7 7 50 <REDACTED>
#> 8 8 30 <REDACTED>
#> 9 9 <REFUSED> <REDACTED>
#> 10 10 <OMITTED> <REDACTED>
#> 11 11 10 <REDACTED>
Conditionals also work exactly as you would expect in mutations. The
following will replace the favorite color of respondents with
age < 18
with the missing reason
"REDACTED_UNDERAGE"
. Respondents missing an age will be
replaced with "REDACTED_MISSING_AGE"
df |>
mutate(
favorite_color = if_else(
age < 18,
na("REDACTED_UNDERAGE"),
favorite_color,
missing = na("REDACTED_MISSING_AGE")
)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> <REDACTED_MISSING_AGE>
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <REDACTED_UNDERAGE>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REDACTED_MISSING_AGE>
#> 10 10 <OMITTED> <REDACTED_MISSING_AGE>
#> 11 11 10 <REDACTED_UNDERAGE>
The following mutation will create a new column called
person_type
that will be "CHILD"
when
age < 18
, "ADULT"
when
age >= 18
, and missing with reason
"AGE_UNAVAILABLE"
when age
is missing:
df |>
mutate(
person_type = if_else(
age < 18,
"CHILD",
"ADULT",
missing = na("AGE_UNAVAILABLE")
),
)
#> # A tibble: 11 × 4
#> person_id age favorite_color person_type
#> <dbl,fct> <dbl,fct> <chr,fct> <chr,fct>
#> 1 1 20 BLUE ADULT
#> 2 2 <REFUSED> BLUE <AGE_UNAVAILABLE>
#> 3 3 21 <REFUSED> ADULT
#> 4 4 30 <OMITTED> ADULT
#> 5 5 1 <N/A> CHILD
#> 6 6 41 RED ADULT
#> 7 7 50 <OMITTED> ADULT
#> 8 8 30 YELLOW ADULT
#> 9 9 <REFUSED> <REFUSED> <AGE_UNAVAILABLE>
#> 10 10 <OMITTED> RED <AGE_UNAVAILABLE>
#> 11 11 10 <REFUSED> CHILD
Important note: You must use dplyr::if_else()
with
interlaced
vectors instead of R’s
base::ifelse()
function, because the base function strips
out the missing reason channel due to a fundamental limitation in
base R.
Multichannel comparison
Comparison operators (e.g. ==
and !=
)
compare value channels, and return NA
whenever missing
values are involved:
5 == NA
#> [1] NA
5 == na("reason")
#> [1] NA
This means that missing reasons cannot be directly compared:
To simultaneously compare values and missing reason channels,
interlacer provides the multichannel comparison operators
%==%
and %!=%
. These operators always return
boolean vectors, and never return missing values. With multichannel
comparison, two values are considered equal iff:
- they are both values, and the values are equal OR
- they are both missing values, and their reasons are the same OR
- they are both missing values, and both lack missing reasons
This way, if we convert the previous example to use multichannel comparison, we get:
Similarly, comparisons with values return booleans rather than
NA
,
And, as a consequence, regular NA
values are equivalent
to interlaced values that do not have a missing reason:
NA %==% na()
#> [1] TRUE
# Alternatively this can be considered an `NA` missing reason:
NA %==% na(NA)
#> [1] TRUE
Missing values that are missing their missing reasons are called “empty values”, and discussed in the next section.
Empty values
Empty values are missing values that lack missing reasons. These
values can arise when missing actions are not specified. For example, in
the example, empty values are created because the missing =
argument in the if_else()
statement is not specified:
df |>
mutate(
favorite_color = if_else(
age < 18,
na("REDACTED_UNDERAGE"),
favorite_color,
)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> <<NA>>
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <REDACTED_UNDERAGE>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <<NA>>
#> 10 10 <OMITTED> <<NA>>
#> 11 11 10 <REDACTED_UNDERAGE>
As you can see, empty values are displayed as
<<NA>>
.
Empty values can be detected by using the multichannel comparsion operator:
df |>
mutate(
favorite_color = if_else(
age < 18,
na("REDACTED_UNDERAGE"),
favorite_color,
)
) |>
filter(favorite_color %==% na())
#> # A tibble: 3 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 2 <REFUSED> <<NA>>
#> 2 9 <REFUSED> <<NA>>
#> 3 10 <OMITTED> <<NA>>
(Note that regular equality does not work, because it would return
NA
for all the comparisons!)
Empty values often occur as the result of joins, because the
dplyr::*_join()
family of functions do not have a
missing =
parameter, like dplyr::if_else()
does. For example, say we had the following data frame we wanted to join
to our sample:
conditions <- tribble(
~person_id, ~condition,
1, "TREATMENT",
2, "CONTROL",
3, na("TECHNICAL_ERROR"),
6, "CONTROL",
8, "TREATMENT",
)
Because we’re missing condition information for some of the respondents, these show up as empty values when we join the data frame to our sample:
df |>
left_join(conditions, by = join_by(person_id))
#> # A tibble: 11 × 4
#> person_id age favorite_color condition
#> <dbl,fct> <dbl,fct> <chr,fct> <chr,fct>
#> 1 1 20 BLUE TREATMENT
#> 2 2 <REFUSED> BLUE CONTROL
#> 3 3 21 <REFUSED> <TECHNICAL_ERROR>
#> 4 4 30 <OMITTED> <<NA>>
#> 5 5 1 <N/A> <<NA>>
#> 6 6 41 RED CONTROL
#> 7 7 50 <OMITTED> <<NA>>
#> 8 8 30 YELLOW TREATMENT
#> 9 9 <REFUSED> <REFUSED> <<NA>>
#> 10 10 <OMITTED> RED <<NA>>
#> 11 11 10 <REFUSED> <<NA>>
We can remedy this by replacing these empty values after the join:
df |>
left_join(conditions, by = join_by(person_id)) |>
mutate(
condition = if_else(condition %==% na(), na("LEFT_STUDY"), condition),
)
#> # A tibble: 11 × 4
#> person_id age favorite_color condition
#> <dbl,fct> <dbl,fct> <chr,fct> <chr,fct>
#> 1 1 20 BLUE TREATMENT
#> 2 2 <REFUSED> BLUE CONTROL
#> 3 3 21 <REFUSED> <TECHNICAL_ERROR>
#> 4 4 30 <OMITTED> <LEFT_STUDY>
#> 5 5 1 <N/A> <LEFT_STUDY>
#> 6 6 41 RED CONTROL
#> 7 7 50 <OMITTED> <LEFT_STUDY>
#> 8 8 30 YELLOW TREATMENT
#> 9 9 <REFUSED> <REFUSED> <LEFT_STUDY>
#> 10 10 <OMITTED> RED <LEFT_STUDY>
#> 11 11 10 <REFUSED> <LEFT_STUDY>
Writing interlaced files
After you’ve made made changes to your data, you probably want to
save them. interlacer provides the write_interlaced_*
family of functions for this:
write_interlaced_csv(df, "interlaced_output.csv")
This will combine the value and missing reasons into interlaced
character columns, and write the result as a csv. Alternatively, if you
want to re-interlace the columns without writing to a file for more
control in the writing process, you can use
flatten_channels()
:
flatten_channels(df)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl> <chr> <chr>
#> 1 1 20 BLUE
#> 2 2 REFUSED BLUE
#> 3 3 21 REFUSED
#> 4 4 30 OMITTED
#> 5 5 1 N/A
#> 6 6 41 RED
#> 7 7 50 OMITTED
#> 8 8 30 YELLOW
#> 9 9 REFUSED REFUSED
#> 10 10 OMITTED RED
#> 11 11 10 REFUSED
# (it works on single vectors as well)
flatten_channels(df$age)
#> [1] "20" "REFUSED" "21" "30" "1" "41" "50"
#> [8] "30" "REFUSED" "OMITTED" "10"
The value and missing reason channels of data frames with
interlaced
vectors can similarly be accessed using the
value_channel()
and na_channel()
helper
functions:
value_channel(df)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl> <dbl> <chr>
#> 1 1 20 BLUE
#> 2 2 NA BLUE
#> 3 3 21 NA
#> 4 4 30 NA
#> 5 5 1 NA
#> 6 6 41 RED
#> 7 7 50 NA
#> 8 8 30 YELLOW
#> 9 9 NA NA
#> 10 10 NA RED
#> 11 11 10 NA
na_channel(df)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <fct> <fct> <fct>
#> 1 NA NA NA
#> 2 NA REFUSED NA
#> 3 NA NA REFUSED
#> 4 NA NA OMITTED
#> 5 NA NA N/A
#> 6 NA NA NA
#> 7 NA NA OMITTED
#> 8 NA NA NA
#> 9 NA REFUSED REFUSED
#> 10 NA OMITTED NA
#> 11 NA NA REFUSED
Next steps
So far, we’ve covered how interlacer’s read_interlaced_*
family of functions enabled us to load interlaced
columns
that contain separate challens for value and missing reasons. The
interlaced
type enables us to create tidy and type-aware
pipelines that can flexibly consider a variable’s value AND missing
reasons.
In all the examples in this vignette, column types were automatically
detected. To explicitly specify value and missing column types, (and
specify individual missing reasons for specific columns), interlacer
extends readr
’s collector()
system. This will
be covered in the next vignette,
vignette("na-column-types")
.