Skip to contents

Like the readr::read_*() family of functions, read_interlaced_*() will automatically guess column types by default:

library(interlacer, warn.conflicts = FALSE)

(read_interlaced_csv(
  interlacer_example("colors.csv"),
  na = c("REFUSED", "OMITTED", "N/A"),
  show_col_types = FALSE
))
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,fct> <dbl,fct> <chr,fct>     
#>  1         1        20 BLUE          
#>  2         2 <REFUSED> BLUE          
#>  3         3        21 <REFUSED>     
#>  4         4        30 <OMITTED>     
#>  5         5         1 <N/A>         
#>  6         6        41 RED           
#>  7         7        50 <OMITTED>     
#>  8         8        30 YELLOW        
#>  9         9 <REFUSED> <REFUSED>     
#> 10        10 <OMITTED> RED           
#> 11        11        10 <REFUSED>

As with readr, these column type guess can be overridden using the col_types parameter with readr’s readr::col_*() column specifiers:

library(readr)

(read_interlaced_csv(
  interlacer_example("colors.csv"),
  na = c("REFUSED", "OMITTED", "N/A"),
  col_types = cols(
    person_id = col_integer(),
    age = col_number(),
    favorite_color = col_factor(levels = c("BLUE", "RED", "YELLOW", "GREEN"))
  )
))
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <int,fct> <dbl,fct> <fct,fct>     
#>  1         1        20 BLUE          
#>  2         2 <REFUSED> BLUE          
#>  3         3        21 <REFUSED>     
#>  4         4        30 <OMITTED>     
#>  5         5         1 <N/A>         
#>  6         6        41 RED           
#>  7         7        50 <OMITTED>     
#>  8         8        30 YELLOW        
#>  9         9 <REFUSED> <REFUSED>     
#> 10        10 <OMITTED> RED           
#> 11        11        10 <REFUSED>

NA collector types

In addition to the standard readr::col_* column specification types, interlacer provides the ability to specify missing reasons at the column level, using the na parameter.

This is useful when you have missing reasons that only apply to particular items as opposed to the file as a whole. For example, say we had a measure with the following two items:

  1. What is your current stress level?
  1. Low
  2. Moderate
  3. High
  4. I don’t know
  5. I don’t understand the question
  1. How well do you feel you manage your time and responsibilities today?
  1. Poorly
  2. Fairly well
  3. Well
  4. Very well
  5. Does not apply (Today was a vacation day)
  6. Does not apply (Other reason)

As you can see, both items have two selection choices that should be mapped to missing reasons. These can be specified with the na_cols() function, which works similarly to readr’s cols() function:

(df_stress <- read_interlaced_csv(
  interlacer_example("stress.csv"),
  col_types = cols(
    person_id = col_integer(),
    current_stress = col_factor(
      levels = c("LOW", "MODERATE", "HIGH")
    ),
    time_management = col_factor(
      levels = c("POORLY", "FAIRLY_WELL", "WELL", "VERY_WELL")
    )
  ),
  na = na_cols(
    .default = c("REFUSED", "OMITTED", "N/A"),
    current_stress = c(.default, "DONT_KNOW", "DONT_UNDERSTAND"),
    time_management = c(.default, "NA_VACATION", "NA_OTHER")
  )
))
#> # A tibble: 8 × 3
#>   person_id current_stress    time_management
#>   <int,fct> <fct,fct>         <fct,fct>      
#> 1         1 LOW               VERY_WELL      
#> 2         2 MODERATE          POORLY         
#> 3         3 <DONT_KNOW>       <NA_OTHER>     
#> 4         4 HIGH              POORLY         
#> 5         5 <DONT_UNDERSTAND> <NA_OTHER>     
#> 6         6 LOW               <NA_VACATION>  
#> 7         7 MODERATE          WELL           
#> 8         8 <OMITTED>         FAIRLY_WELL

Setting na type to NULL indicates the column should be loaded as a regular type instead of an interlaced one. The following will load person_id as a regular, non-interlaced type:

read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  na = na_cols(
    .default = c(-99, -98, -97),
    person_id = NULL,
  ),
  show_col_types = FALSE
)
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>        <dbl> <dbl,int>      <dbl,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

Next steps

In this vignette we covered how the column types for values and missing reasons can be explicitly specified using collectors. We also illustrated how column-level missing values can be specified by creating a missing channel specification using na_cols().

In the final example, we used an example data set with coded values and missing reasons. Coded values are especially common in data sets produced by SPSS, SAS, and Stata. For some recipes for working with coded data like this, check out the next vignette, vignette("coded-data").