Like the readr::read_*()
family of functions,
read_interlaced_*()
will automatically guess column types
by default:
library(interlacer, warn.conflicts = FALSE)
(read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
))
#> Registered S3 methods overwritten by 'readr':
#> method from
#> as.data.frame.spec_tbl_df vroom
#> as_tibble.spec_tbl_df vroom
#> format.col_spec vroom
#> print.col_spec vroom
#> print.collector vroom
#> print.date_names vroom
#> print.locale vroom
#> str.col_spec vroom
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
As with readr, these column type guess can be overridden using the
col_types
parameter with a readr::cols()
column specification:
library(readr)
(read_interlaced_csv(
interlacer_example("colors.csv"),
col_types = cols(
person_id = col_integer(),
age = col_number(),
favorite_color = col_factor(levels = c("BLUE", "RED", "YELLOW", "GREEN"))
),
na = c("REFUSED", "OMITTED", "N/A")
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <int,fct> <dbl,fct> <fct,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
x_cols
: extended cols
specifications
When you need more fine-grained control over value and missing reason
channel types, you can use an x_cols()
specification, an
extension of readr’s readr::cols()
system. With
x_cols()
, you can control both the value channel and na
channel types in the columns of the resulting data frame.
This is useful when you have missing reasons that only apply to particular items as opposed to the file as a whole. For example, say we had a measure with the following two items:
- What is your current stress level?
- Low
- Moderate
- High
- I don’t know
- I don’t understand the question
- How well do you feel you manage your time and responsibilities today?
- Poorly
- Fairly well
- Well
- Very well
- Does not apply (Today was a vacation day)
- Does not apply (Other reason)
As you can see, both items have two selection choices that should be
mapped to missing reasons. These can be specified with the
x_cols()
as follows:
(df_stress <- read_interlaced_csv(
interlacer_example("stress.csv"),
col_types = x_cols(
person_id = x_col(
v_col_integer(),
na_col_none()
),
current_stress = x_col(
v_col_factor(levels = c("LOW", "MODERATE", "HIGH")),
na_col_factor("DONT_KNOW", "DONT_UNDERSTAND")
),
time_management = x_col(
v_col_factor(levels = c("POORLY", "FAIRLY_WELL", "WELL", "VERY_WELL")),
na_col_factor("NA_VACATION", "NA_OTHER")
)
)
))
#> # A tibble: 8 × 3
#> person_id current_stress time_management
#> <int> <fct,fct> <fct,fct>
#> 1 1 LOW VERY_WELL
#> 2 2 MODERATE POORLY
#> 3 3 <DONT_KNOW> <NA_OTHER>
#> 4 4 HIGH POORLY
#> 5 5 <DONT_UNDERSTAND> <NA_OTHER>
#> 6 6 LOW <NA_VACATION>
#> 7 7 MODERATE WELL
#> 8 8 <DONT_KNOW> FAIRLY_WELL
Like readr’s readr::cols()
function, each named
x_cols()
describes a column in the resulting data frame.
Value and missing reason channel types are declared via calls to
v_col_*()
and na_col_*()
respectively, which
are assembled by x_col()
.
v_col_*()
types mirror readr’s readr::col_*
column “collectors”. So v_col_double()
is equivalent to
readr::col_double()
, v_col_character()
is
equivalent to readr::col_character()
, etc. See vroom’s
documentation for a list of
available column types.
na_col_*()
collectors allow you to declare missing
reason channel type of the loaded column, and the values that should be
interpreted as missing reasons. Currently, there are five options:
na_col_default()
: Use the collector defined by thena =
argument in theread_interlaced_*()
functionna_col_none()
: Load the column without a missing reason channel.na_col_factor()
: Use a factor missing reason channel. Character arguments passed form the levels of the factor. (e.g.na_col_factor("REFUSED", "OMITTED", "N/A")
)na_col_integer()
: Use an integer na channel. Numeric arguments passed are the values to be interpreted as missing values. (e.g.na_col_integer(-99, -98, -97))
)na_col_cfactor()
: Use acfactor
na channel. (cfactor
types will be covered in the next vignette,vignette("coded-data")
)
The following example shows some of these collectors in action. In
this example we use a coded version of the colors.csv
example data, to demonstrate integer missing reason types:
read_interlaced_csv(
interlacer_example("colors_coded.csv"),
col_types = x_cols(
person_id = x_col(v_col_integer(), na_col_none()),
age = x_col(v_col_double(), na_col_integer(-99, -98, -97)),
favorite_color = x_col(v_col_integer(), na_col_integer(-99, -98, -97))
)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <int> <dbl,int> <int,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
Shortcuts
Default collector types
Like readr’s cols()
function, the x_cols()
function accepts a .default
argument that specifies a
default value collector. The na =
argument is similarly
used to specify a default missing reason collector to be used when no
na_col_*()
is specified, or when it is set to
na_col_default()
.
By taking advantage of these defaults, the specification in the last example could have been equivalently written as:
read_interlaced_csv(
interlacer_example("colors_coded.csv"),
col_types = x_cols(
.default = v_col_integer(),
person_id = x_col(v_col_integer(), na_col_none()),
age = v_col_double(),
),
na = na_col_integer(-99, -98, -97)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <int> <dbl,int> <int,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
Concise value and missing reason specifications
Like readr, value collectors can be specified using characters. For
example, instead of v_col_integer()
, you can use
"i"
. See vroom’s documentation for a complete
list of these shortcuts.
Similarly, missing reason collectors can be specified by providing a vector of the missing values; the collector type is inferred via the type of the vector. The conversions are as follows:
na_col_none()
:NULL
na_col_factor()
: A character vector, e.g.c("REFUSED", "OMITTED", "N/A")
na_col_integer()
: A numeric vector, e.g.c(-99, -98, -97))
na_col_cfactor()
: A named character or numeric vector, e.g.c(REFUSED = -99, OMITTED = -98, `N/A` = -97)
Using these shortcuts, the previous example could have equivalently been written in more compact form as follows:
read_interlaced_csv(
interlacer_example("colors_coded.csv"),
col_types = x_cols(
.default = "i",
person_id = x_col("i", NULL),
age = "d",
),
na = c(-99, -98, -97)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <int> <dbl,int> <int,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
Next steps
In this vignette we covered how the column types for values and
missing reasons can be explicitly specified using collectors. We also
illustrated how column-level missing values can be specified by creating
extended column type specifications using x_cols()
.
In the final examples, we used an example data set with coded values
and missing reasons. Coded values are especially common in data sets
produced by SPSS, SAS, and Stata. interlacer provides a special column
type to make working with this sort of data easier: the
cfactor
type. This will be covered in next vignette,
vignette("coded-data")
.