Coded Data

In addition to interlacing values and missing reasons, many statistical software packages will store categorical values and missing reasons as alphanumeric codes. Working with these files can be a pain because the codes are often arbitrary magic numbers that obfuscate the meaning of your syntax and results.

To facilitate working with such data, interlacer provides a new cfactor type. The cfactor allows you to attach labels to coded data and work with it as a regular R factor. Unlike a regular R factor, however, a cfactor can be converted back into its coded representation at any time (whereas R factor values lose their original codes).

⚠️ ⚠️ ⚠️ WARNING ⚠️ ⚠️ ⚠️

The cfactor type is a highly experimental feature (even compared to the rest of interlacer) and has not been thoroughly tested! I’m sharing them in a super pre-alpha, unstable state to get feedback on them before I invest more time polishing their implementation.

SPSS-style codes

As a motivating example, consider this coded version of the colors.csv example:

library(readr)
library(dplyr, warn.conflicts = FALSE)
library(interlacer, warn.conflicts = FALSE)

read_file(
  interlacer_example("colors_coded.csv")
) |>
  cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,-98,1
#> 3,21,-98
#> 4,30,-97
#> 5,1,-99
#> 6,41,2
#> 7,50,-97
#> 8,30,3
#> 9,-98,-98
#> 10,-97,2
#> 11,10,-98

Where missing reasons are:

-99: N/A

-98: REFUSED

-97: OMITTED

And colors are coded:

1: BLUE

2: RED

3: YELLOW

This style of coding, with positive values representing categorical levels and negative values representing missing values, is a common format used by SPSS.

These data can be loaded as interlaced numeric values as follows:

(df_coded <- read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  na = c(-99, -98, -97)
))
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,int> <dbl,int>      <dbl,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

This representation is awkward to work with because the codes are meaningless and obfuscate the significance of any code you write or any results you output. If you wanted select everyone with a BLUE favorite color, for example, you would write:

df_coded |>
  filter(favorite_color == 1)
#> # A tibble: 2 × 3
#>   person_id       age favorite_color
#>   <dbl,int> <dbl,int>      <dbl,int>
#> 1         1        20              1
#> 2         2     <-98>              1

Similarly, if you wanted to filter for OMITTED favorite colors, you would write:

df_coded |>
  filter(favorite_color == na(-97))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,int>, age <dbl,int>, favorite_color <dbl,int>

To make these data more ergnomic to work with, you can use interlacer’s v_col_cfactor() and na_col_cfactor() collector types to load these values as a cfactor instead, which allows you to associate codes with human-readable labels:

(df_decoded <- read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  col_types = x_cols(
    favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
  ),
  na = na_col_cfactor(REFUSED = -99, OMITTED = -98, `N/A` = -97)
))
#> # A tibble: 11 × 3
#>     person_id        age favorite_color
#>    <dbl,cfct> <dbl,cfct> <cfct,cfct>   
#>  1          1         20 BLUE          
#>  2          2  <OMITTED> BLUE          
#>  3          3         21 <OMITTED>     
#>  4          4         30 <N/A>         
#>  5          5          1 <REFUSED>     
#>  6          6         41 RED           
#>  7          7         50 <N/A>         
#>  8          8         30 YELLOW        
#>  9          9  <OMITTED> <OMITTED>     
#> 10         10      <N/A> RED           
#> 11         11         10 <OMITTED>

Now human-readable labels, instead of the magic codes, can be used when working with the data:

df_decoded |>
  filter(favorite_color == "BLUE")
#> # A tibble: 2 × 3
#>    person_id        age favorite_color
#>   <dbl,cfct> <dbl,cfct> <cfct,cfct>   
#> 1          1         20 BLUE          
#> 2          2  <OMITTED> BLUE

df_decoded |>
  filter(favorite_color == na("OMITTED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,cfct>, age <dbl,cfct>,
#> #   favorite_color <cfct,cfct>

But you can still convert the labels of values or missing reasons back to codes if you wish, using as.codes(). The following will convert the missing reason channel of age and the value channel of the favorite_color into their coded representation:

df_decoded |>
  mutate(
    age = map_na_channel(age, as.codes),
    favorite_color = map_value_channel(favorite_color, as.codes)
  )
#> # A tibble: 11 × 3
#>     person_id       age favorite_color
#>    <dbl,cfct> <dbl,int>     <int,cfct>
#>  1          1        20              1
#>  2          2     <-98>              1
#>  3          3        21      <OMITTED>
#>  4          4        30          <N/A>
#>  5          5         1      <REFUSED>
#>  6          6        41              2
#>  7          7        50          <N/A>
#>  8          8        30              3
#>  9          9     <-98>      <OMITTED>
#> 10         10     <-97>              2
#> 11         11        10      <OMITTED>

To recode all cfactor channels in a data frame into their coded representation you can do the following:

df_decoded |>
  mutate(
    across_value_channels(where_value_channel(is.cfactor), as.codes),
    across_na_channels(where_na_channel(is.cfactor), as.codes),
  )
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,int> <dbl,int>      <int,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

SAS- and Stata-style codes

Like SPSS, SAS and Stata will encode factor levels as numeric values, but instead of representing missing reasons as negative codes, they are given character codes:

read_file(
  interlacer_example("colors_coded_char.csv")
) |>
  cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,.a,1
#> 3,21,.a
#> 4,30,.b
#> 5,1,.
#> 6,41,2
#> 7,50,.b
#> 8,30,3
#> 9,.a,.a
#> 10,.b,2
#> 11,10,.a

In this example, the same value coding scheme is used for favorite_color as the previous example, except the missing reason channels are coded as follows:

“.”: N/A

“.a”: REFUSED

“.b”: OMITTED

These data can be easily loaded by interlacer into a cfactor missing reason channel as follows:

read_interlaced_csv(
  interlacer_example("colors_coded_char.csv"),
  col_types = x_cols(
    favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
  ),   
  na = c(`N/A` = ".", REFUSED = ".a", OMITTED = ".b"),
)
#> # A tibble: 11 × 3
#>     person_id        age favorite_color
#>    <dbl,cfct> <dbl,cfct> <cfct,cfct>   
#>  1          1         20 BLUE          
#>  2          2  <REFUSED> BLUE          
#>  3          3         21 <REFUSED>     
#>  4          4         30 <OMITTED>     
#>  5          5          1 <N/A>         
#>  6          6         41 RED           
#>  7          7         50 <OMITTED>     
#>  8          8         30 YELLOW        
#>  9          9  <REFUSED> <REFUSED>     
#> 10         10  <OMITTED> RED           
#> 11         11         10 <REFUSED>

The `cfactor` type

The cfactor is an extension of base R’s factor type. They are created from numeric or character codes using the cfactor() function:

(example_cfactor <- cfactor(
  c(10, 20, 30, 10, 20, 30),
  codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
))
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#> 
#> Categorical levels:
#>    label code
#>  LEVEL_A   10
#>  LEVEL_B   20
#>  LEVEL_C   30

(example_cfactor2 <- cfactor(
  c("a", "b", "c", "a", "b", "c"),
  codes = c(LEVEL_A = "a", LEVEL_B = "b", LEVEL_C = "c")
))
#> <cfactor<chr+99cda>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#> 
#> Categorical levels:
#>    label code
#>  LEVEL_A    a
#>  LEVEL_B    b
#>  LEVEL_C    c

cfactor vectors can be used wherever regular base R factor types are used, because they are fully-compatible factor types:

is.factor(example_cfactor)
#> [1] TRUE
levels(example_cfactor)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"

is.factor(example_cfactor2)
#> [1] TRUE
levels(example_cfactor2)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"

But unlike a regular factor, a cfactor additionally stores the codes for the factor levels. This means you can convert it back into its coded representation at any time, if desired:

codes(example_cfactor)
#> LEVEL_A LEVEL_B LEVEL_C 
#>      10      20      30
as.codes(example_cfactor)
#> [1] 10 20 30 10 20 30

codes(example_cfactor2)
#> LEVEL_A LEVEL_B LEVEL_C 
#>     "a"     "b"     "c"
as.codes(example_cfactor2)
#> [1] "a" "b" "c" "a" "b" "c"

IMPORTANT: The as.numeric() and as.integer() functions do not convert a cfactor with numeric codes into its coded representation. Instead, in order to retain full compatibility with the base R factor type, it always returns a result coded by the index of each level in the factor:

as.numeric(example_cfactor)
#> [1] 1 2 3 1 2 3
as.numeric(example_cfactor2)
#> [1] 1 2 3 1 2 3

When the levels are changed, the cfactor will drop its codes and degrade into a regular R factor:

cfactor_copy <- example_cfactor

# cfactory_copy is a cfactor and a factor
is.cfactor(cfactor_copy)
#> [1] TRUE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
codes(cfactor_copy)
#> LEVEL_A LEVEL_B LEVEL_C 
#>      10      20      30

# modify the levels of the cfactor as if it was a regular factor
levels(cfactor_copy) <- c("C", "B", "A")

# now cfactor_copy is just a regular factor
is.cfactor(cfactor_copy)
#> [1] FALSE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "C" "B" "A"
codes(cfactor_copy)
#> NULL

Finally, if you have a base R factor or character vector of labels, you can add codes to them via as.cfactor():

as.cfactor(
  c("LEVEL_A", "LEVEL_B", "LEVEL_C", "LEVEL_A", "LEVEL_B", "LEVEL_C"),
  codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
)
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#> 
#> Categorical levels:
#>    label code
#>  LEVEL_A   10
#>  LEVEL_B   20
#>  LEVEL_C   30

Re-coding and writing an interlaced data frame.

Re-coding and writing an interlaced data frame is as simple as calling as.codes() on all cfactor type value and missing reason channels, and then calling one of the write_interlaced_*() family of functions:

df_decoded |>
  mutate(
    across_value_channels(where_value_channel(is.cfactor), as.codes),
    across_na_channels(where_na_channel(is.cfactor), as.codes),
  ) |>
  write_interlaced_csv("output.csv")

haven