In addition to interlacing values and missing reasons, many statistical software packages will store categorical values and missing reasons as alphanumeric codes. Working with these files can be a pain because the codes are often arbitrary magic numbers that obfuscate the meaning of your syntax and results.
To facilitate working with such data, interlacer provides a new
cfactor
type. The cfactor
allows you to attach
labels to coded data and work with it as a regular R
factor
. Unlike a regular R factor
, however, a
cfactor
can be converted back into its coded representation
at any time (whereas R factor
values lose their original
codes).
⚠️ ⚠️ ⚠️ WARNING ⚠️ ⚠️ ⚠️
The cfactor
type is a highly experimental feature (even
compared to the rest of interlacer) and has not been thoroughly tested!
I’m sharing them in a super pre-alpha, unstable state to get feedback on
them before I invest more time polishing their implementation.
SPSS-style codes
As a motivating example, consider this coded version of the
colors.csv
example:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(interlacer, warn.conflicts = FALSE)
read_file(
interlacer_example("colors_coded.csv")
) |>
cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,-98,1
#> 3,21,-98
#> 4,30,-97
#> 5,1,-99
#> 6,41,2
#> 7,50,-97
#> 8,30,3
#> 9,-98,-98
#> 10,-97,2
#> 11,10,-98
Where missing reasons are:
-99
: N/A
-98
: REFUSED
-97
: OMITTED
And colors are coded:
1
: BLUE
2
: RED
3
: YELLOW
This style of coding, with positive values representing categorical levels and negative values representing missing values, is a common format used by SPSS.
These data can be loaded as interlaced numeric values as follows:
(df_coded <- read_interlaced_csv(
interlacer_example("colors_coded.csv"),
na = c(-99, -98, -97)
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <dbl,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
This representation is awkward to work with because the codes are
meaningless and obfuscate the significance of any code you write or any
results you output. If you wanted select everyone with a
BLUE
favorite color, for example, you would write:
df_coded |>
filter(favorite_color == 1)
#> # A tibble: 2 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <dbl,int>
#> 1 1 20 1
#> 2 2 <-98> 1
Similarly, if you wanted to filter for OMITTED
favorite
colors, you would write:
df_coded |>
filter(favorite_color == na(-97))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,int>, age <dbl,int>, favorite_color <dbl,int>
To make these data more ergnomic to work with, you can use
interlacer’s v_col_cfactor()
and
na_col_cfactor()
collector types to load these values as a
cfactor
instead, which allows you to associate codes with
human-readable labels:
(df_decoded <- read_interlaced_csv(
interlacer_example("colors_coded.csv"),
col_types = x_cols(
favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
),
na = na_col_cfactor(REFUSED = -99, OMITTED = -98, `N/A` = -97)
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <OMITTED> BLUE
#> 3 3 21 <OMITTED>
#> 4 4 30 <N/A>
#> 5 5 1 <REFUSED>
#> 6 6 41 RED
#> 7 7 50 <N/A>
#> 8 8 30 YELLOW
#> 9 9 <OMITTED> <OMITTED>
#> 10 10 <N/A> RED
#> 11 11 10 <OMITTED>
Now human-readable labels, instead of the magic codes, can be used when working with the data:
df_decoded |>
filter(favorite_color == "BLUE")
#> # A tibble: 2 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <OMITTED> BLUE
df_decoded |>
filter(favorite_color == na("OMITTED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,cfct>, age <dbl,cfct>,
#> # favorite_color <cfct,cfct>
But you can still convert the labels of values or missing reasons
back to codes if you wish, using as.codes()
. The following
will convert the missing reason channel of age
and the
value channel of the favorite_color
into their coded
representation:
df_decoded |>
mutate(
age = map_na_channel(age, as.codes),
favorite_color = map_value_channel(favorite_color, as.codes)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,int> <int,cfct>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <OMITTED>
#> 4 4 30 <N/A>
#> 5 5 1 <REFUSED>
#> 6 6 41 2
#> 7 7 50 <N/A>
#> 8 8 30 3
#> 9 9 <-98> <OMITTED>
#> 10 10 <-97> 2
#> 11 11 10 <OMITTED>
To recode all cfactor
channels in a data frame into
their coded representation you can do the following:
df_decoded |>
mutate(
across_value_channels(where_value_channel(is.cfactor), as.codes),
across_na_channels(where_na_channel(is.cfactor), as.codes),
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <int,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
SAS- and Stata-style codes
Like SPSS, SAS and Stata will encode factor levels as numeric values, but instead of representing missing reasons as negative codes, they are given character codes:
read_file(
interlacer_example("colors_coded_char.csv")
) |>
cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,.a,1
#> 3,21,.a
#> 4,30,.b
#> 5,1,.
#> 6,41,2
#> 7,50,.b
#> 8,30,3
#> 9,.a,.a
#> 10,.b,2
#> 11,10,.a
In this example, the same value coding scheme is used for
favorite_color
as the previous example, except the missing
reason channels are coded as follows:
“.”: N/A
“.a”: REFUSED
“.b”: OMITTED
These data can be easily loaded by interlacer into a
cfactor
missing reason channel as follows:
read_interlaced_csv(
interlacer_example("colors_coded_char.csv"),
col_types = x_cols(
favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
),
na = c(`N/A` = ".", REFUSED = ".a", OMITTED = ".b"),
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
The cfactor
type
The cfactor
is an extension of base R’s
factor
type. They are created from numeric
or
character
codes using the cfactor()
function:
(example_cfactor <- cfactor(
c(10, 20, 30, 10, 20, 30),
codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
))
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A 10
#> LEVEL_B 20
#> LEVEL_C 30
(example_cfactor2 <- cfactor(
c("a", "b", "c", "a", "b", "c"),
codes = c(LEVEL_A = "a", LEVEL_B = "b", LEVEL_C = "c")
))
#> <cfactor<chr+99cda>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A a
#> LEVEL_B b
#> LEVEL_C c
cfactor
vectors can be used wherever regular base R
factor
types are used, because they are fully-compatible
factor
types:
is.factor(example_cfactor)
#> [1] TRUE
levels(example_cfactor)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
is.factor(example_cfactor2)
#> [1] TRUE
levels(example_cfactor2)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
But unlike a regular factor
, a cfactor
additionally stores the codes for the factor levels. This means you can
convert it back into its coded representation at any time, if
desired:
codes(example_cfactor)
#> LEVEL_A LEVEL_B LEVEL_C
#> 10 20 30
as.codes(example_cfactor)
#> [1] 10 20 30 10 20 30
codes(example_cfactor2)
#> LEVEL_A LEVEL_B LEVEL_C
#> "a" "b" "c"
as.codes(example_cfactor2)
#> [1] "a" "b" "c" "a" "b" "c"
IMPORTANT: The as.numeric()
and
as.integer()
functions do not convert a
cfactor
with numeric codes into its coded representation.
Instead, in order to retain full compatibility with the base R
factor
type, it always returns a result coded by the
index of each level in the factor:
as.numeric(example_cfactor)
#> [1] 1 2 3 1 2 3
as.numeric(example_cfactor2)
#> [1] 1 2 3 1 2 3
When the levels are changed, the cfactor
will drop its
codes and degrade into a regular R factor:
cfactor_copy <- example_cfactor
# cfactory_copy is a cfactor and a factor
is.cfactor(cfactor_copy)
#> [1] TRUE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
codes(cfactor_copy)
#> LEVEL_A LEVEL_B LEVEL_C
#> 10 20 30
# modify the levels of the cfactor as if it was a regular factor
levels(cfactor_copy) <- c("C", "B", "A")
# now cfactor_copy is just a regular factor
is.cfactor(cfactor_copy)
#> [1] FALSE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "C" "B" "A"
codes(cfactor_copy)
#> NULL
Finally, if you have a base R factor
or character vector
of labels, you can add codes to them via as.cfactor()
:
as.cfactor(
c("LEVEL_A", "LEVEL_B", "LEVEL_C", "LEVEL_A", "LEVEL_B", "LEVEL_C"),
codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
)
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A 10
#> LEVEL_B 20
#> LEVEL_C 30
Re-coding and writing an interlaced data frame.
Re-coding and writing an interlaced data frame is as simple as
calling as.codes()
on all cfactor
type value
and missing reason channels, and then calling one of the
write_interlaced_*()
family of functions:
df_decoded |>
mutate(
across_value_channels(where_value_channel(is.cfactor), as.codes),
across_na_channels(where_na_channel(is.cfactor), as.codes),
) |>
write_interlaced_csv("output.csv")
haven
The haven package has
functions for loading native SPSS, SAS, and Stata native file formats
into special data frames that use column attributes and special values
to keep track of value labels and missing reasons. For a complete
discussion of how this compares to interlacer’s approach, see
vignette("other-approaches")
.