NA Column Types
na-column-types.Rmd
Like the readr::read_*()
family of functions,
read_interlaced_*()
will automatically guess column types
by default:
library(interlacer, warn.conflicts = FALSE)
(read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A"),
show_col_types = FALSE
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,fct> <dbl,fct> <chr,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
As with readr, these column type guess can be overridden using the
col_types
parameter with readr’s
readr::col_*()
column specifiers:
library(readr)
(read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A"),
col_types = cols(
person_id = col_integer(),
age = col_number(),
favorite_color = col_factor(levels = c("BLUE", "RED", "YELLOW", "GREEN"))
)
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <int,fct> <dbl,fct> <fct,fct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>
NA
collector types
In addition to the standard readr::col_*
column
specification types, interlacer provides the ability to specify missing
reasons at the column level, using the na
parameter.
This is useful when you have missing reasons that only apply to particular items as opposed to the file as a whole. For example, say we had a measure with the following two items:
- What is your current stress level?
- Low
- Moderate
- High
- I don’t know
- I don’t understand the question
- How well do you feel you manage your time and responsibilities today?
- Poorly
- Fairly well
- Well
- Very well
- Does not apply (Today was a vacation day)
- Does not apply (Other reason)
As you can see, both items have two selection choices that should be
mapped to missing reasons. These can be specified with the
na_cols()
function, which works similarly to readr’s
cols()
function:
(df_stress <- read_interlaced_csv(
interlacer_example("stress.csv"),
col_types = cols(
person_id = col_integer(),
current_stress = col_factor(
levels = c("LOW", "MODERATE", "HIGH")
),
time_management = col_factor(
levels = c("POORLY", "FAIRLY_WELL", "WELL", "VERY_WELL")
)
),
na = na_cols(
.default = c("REFUSED", "OMITTED", "N/A"),
current_stress = c(.default, "DONT_KNOW", "DONT_UNDERSTAND"),
time_management = c(.default, "NA_VACATION", "NA_OTHER")
)
))
#> # A tibble: 8 × 3
#> person_id current_stress time_management
#> <int,fct> <fct,fct> <fct,fct>
#> 1 1 LOW VERY_WELL
#> 2 2 MODERATE POORLY
#> 3 3 <DONT_KNOW> <NA_OTHER>
#> 4 4 HIGH POORLY
#> 5 5 <DONT_UNDERSTAND> <NA_OTHER>
#> 6 6 LOW <NA_VACATION>
#> 7 7 MODERATE WELL
#> 8 8 <OMITTED> FAIRLY_WELL
Setting na type to NULL
indicates the column should be
loaded as a regular type instead of an interlaced
one. The
following will load person_id
as a regular, non-interlaced
type:
read_interlaced_csv(
interlacer_example("colors_coded.csv"),
na = na_cols(
.default = c(-99, -98, -97),
person_id = NULL,
),
show_col_types = FALSE
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl> <dbl,int> <dbl,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>
Next steps
In this vignette we covered how the column types for values and
missing reasons can be explicitly specified using collectors. We also
illustrated how column-level missing values can be specified by creating
a missing channel specification using na_cols()
.
In the final example, we used an example data set with coded values
and missing reasons. Coded values are especially common in data sets
produced by SPSS, SAS, and Stata. For some recipes for working with
coded data like this, check out the next vignette,
vignette("coded-data")
.