Extended Column Types

Like the readr::read_*() family of functions, read_interlaced_*() will automatically guess column types by default:

library(interlacer, warn.conflicts = FALSE)

(read_interlaced_csv(
  interlacer_example("colors.csv"),
  na = c("REFUSED", "OMITTED", "N/A")
))
#> Registered S3 methods overwritten by 'readr':
#>   method                    from 
#>   as.data.frame.spec_tbl_df vroom
#>   as_tibble.spec_tbl_df     vroom
#>   format.col_spec           vroom
#>   print.col_spec            vroom
#>   print.collector           vroom
#>   print.date_names          vroom
#>   print.locale              vroom
#>   str.col_spec              vroom
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <dbl,fct> <dbl,fct> <chr,fct>     
#>  1         1        20 BLUE          
#>  2         2 <REFUSED> BLUE          
#>  3         3        21 <REFUSED>     
#>  4         4        30 <OMITTED>     
#>  5         5         1 <N/A>         
#>  6         6        41 RED           
#>  7         7        50 <OMITTED>     
#>  8         8        30 YELLOW        
#>  9         9 <REFUSED> <REFUSED>     
#> 10        10 <OMITTED> RED           
#> 11        11        10 <REFUSED>

As with readr, these column type guess can be overridden using the col_types parameter with a readr::cols() column specification:

library(readr)

(read_interlaced_csv(
  interlacer_example("colors.csv"),
  col_types = cols(
    person_id = col_integer(),
    age = col_number(),
    favorite_color = col_factor(levels = c("BLUE", "RED", "YELLOW", "GREEN"))
  ),
  na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>    <int,fct> <dbl,fct> <fct,fct>     
#>  1         1        20 BLUE          
#>  2         2 <REFUSED> BLUE          
#>  3         3        21 <REFUSED>     
#>  4         4        30 <OMITTED>     
#>  5         5         1 <N/A>         
#>  6         6        41 RED           
#>  7         7        50 <OMITTED>     
#>  8         8        30 YELLOW        
#>  9         9 <REFUSED> <REFUSED>     
#> 10        10 <OMITTED> RED           
#> 11        11        10 <REFUSED>

`x_cols`: extended `cols` specifications

When you need more fine-grained control over value and missing reason channel types, you can use an x_cols() specification, an extension of readr’s readr::cols() system. With x_cols(), you can control both the value channel and na channel types in the columns of the resulting data frame.

This is useful when you have missing reasons that only apply to particular items as opposed to the file as a whole. For example, say we had a measure with the following two items:

What is your current stress level?

Low

Moderate

High

I don’t know

I don’t understand the question

How well do you feel you manage your time and responsibilities today?

Poorly

Fairly well

Well

Very well

Does not apply (Today was a vacation day)

Does not apply (Other reason)

As you can see, both items have two selection choices that should be mapped to missing reasons. These can be specified with the x_cols() as follows:

(df_stress <- read_interlaced_csv(
  interlacer_example("stress.csv"),
  col_types = x_cols(
    person_id = x_col(
      v_col_integer(),
      na_col_none()
    ),
    current_stress = x_col(
      v_col_factor(levels = c("LOW", "MODERATE", "HIGH")),
      na_col_factor("DONT_KNOW", "DONT_UNDERSTAND")
    ),
    time_management = x_col(
      v_col_factor(levels = c("POORLY", "FAIRLY_WELL", "WELL", "VERY_WELL")),
      na_col_factor("NA_VACATION", "NA_OTHER")
    )
  )
))
#> # A tibble: 8 × 3
#>   person_id current_stress    time_management
#>       <int> <fct,fct>         <fct,fct>      
#> 1         1 LOW               VERY_WELL      
#> 2         2 MODERATE          POORLY         
#> 3         3 <DONT_KNOW>       <NA_OTHER>     
#> 4         4 HIGH              POORLY         
#> 5         5 <DONT_UNDERSTAND> <NA_OTHER>     
#> 6         6 LOW               <NA_VACATION>  
#> 7         7 MODERATE          WELL           
#> 8         8 <DONT_KNOW>       FAIRLY_WELL

Like readr’s readr::cols() function, each named x_cols() describes a column in the resulting data frame. Value and missing reason channel types are declared via calls to v_col_*() and na_col_*() respectively, which are assembled by x_col().

v_col_*() types mirror readr’s readr::col_* column “collectors”. So v_col_double() is equivalent to readr::col_double(), v_col_character() is equivalent to readr::col_character(), etc. See vroom’s documentation for a list of available column types.

na_col_*() collectors allow you to declare missing reason channel type of the loaded column, and the values that should be interpreted as missing reasons. Currently, there are five options:

na_col_default(): Use the collector defined by the na = argument in the read_interlaced_*() function
na_col_none(): Load the column without a missing reason channel.
na_col_factor(): Use a factor missing reason channel. Character arguments passed form the levels of the factor. (e.g. na_col_factor("REFUSED", "OMITTED", "N/A"))
na_col_integer(): Use an integer na channel. Numeric arguments passed are the values to be interpreted as missing values. (e.g. na_col_integer(-99, -98, -97)))
na_col_cfactor(): Use a cfactor na channel. (cfactor types will be covered in the next vignette,vignette("coded-data"))

The following example shows some of these collectors in action. In this example we use a coded version of the colors.csv example data, to demonstrate integer missing reason types:

read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  col_types = x_cols(
    person_id = x_col(v_col_integer(), na_col_none()),
    age = x_col(v_col_double(), na_col_integer(-99, -98, -97)),
    favorite_color = x_col(v_col_integer(), na_col_integer(-99, -98, -97))
  )
)
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>        <int> <dbl,int>      <int,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

Shortcuts

Default collector types

Like readr’s cols() function, the x_cols() function accepts a .default argument that specifies a default value collector. The na = argument is similarly used to specify a default missing reason collector to be used when no na_col_*() is specified, or when it is set to na_col_default().

By taking advantage of these defaults, the specification in the last example could have been equivalently written as:

read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  col_types = x_cols(
    .default = v_col_integer(),
    person_id = x_col(v_col_integer(), na_col_none()),
    age = v_col_double(),
  ),
  na = na_col_integer(-99, -98, -97)
)
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>        <int> <dbl,int>      <int,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

Concise value and missing reason specifications

Like readr, value collectors can be specified using characters. For example, instead of v_col_integer(), you can use "i". See vroom’s documentation for a complete list of these shortcuts.

Similarly, missing reason collectors can be specified by providing a vector of the missing values; the collector type is inferred via the type of the vector. The conversions are as follows:

na_col_none(): NULL
na_col_factor(): A character vector, e.g. c("REFUSED", "OMITTED", "N/A")
na_col_integer(): A numeric vector, e.g. c(-99, -98, -97))
na_col_cfactor(): A named character or numeric vector, e.g. c(REFUSED = -99, OMITTED = -98, `N/A` = -97)

Using these shortcuts, the previous example could have equivalently been written in more compact form as follows:

read_interlaced_csv(
  interlacer_example("colors_coded.csv"),
  col_types = x_cols(
    .default = "i",
    person_id = x_col("i", NULL),
    age = "d",
  ),
  na = c(-99, -98, -97)
)
#> # A tibble: 11 × 3
#>    person_id       age favorite_color
#>        <int> <dbl,int>      <int,int>
#>  1         1        20              1
#>  2         2     <-98>              1
#>  3         3        21          <-98>
#>  4         4        30          <-97>
#>  5         5         1          <-99>
#>  6         6        41              2
#>  7         7        50          <-97>
#>  8         8        30              3
#>  9         9     <-98>          <-98>
#> 10        10     <-97>              2
#> 11        11        10          <-98>

Next steps

In this vignette we covered how the column types for values and missing reasons can be explicitly specified using collectors. We also illustrated how column-level missing values can be specified by creating extended column type specifications using x_cols().

In the final examples, we used an example data set with coded values and missing reasons. Coded values are especially common in data sets produced by SPSS, SAS, and Stata. interlacer provides a special column type to make working with this sort of data easier: the cfactor type. This will be covered in next vignette, vignette("coded-data").

x_cols: extended cols specifications

Shortcuts

Default collector types

Concise value and missing reason specifications

Next steps

`x_cols`: extended `cols` specifications