Main methods
errorlocate has two main functions to be used:
locate_errorsfor detecting errorsreplace_errorsfor replacing faulty values withNA
Let’s start with a simple example:
We have a rule that age cannot be negative:
And we have the following data set
"age, income
-10, 0
15, 2000
25, 3000
NA, 1000
" -> csv
d <- read.csv(textConnection(csv), strip.white = TRUE)#> age income
#> 1 -10 0
#> 2 15 2000
#> 3 25 3000
#> 4 NA 1000
le <- locate_errors(d, rules)
summary(le)
#> Variable:
#> name errors missing
#> 1 age 1 1
#> 2 income 0 0
#> Errors per record:
#> errors records
#> 1 0 3
#> 2 1 1summary(le) gives an overview of the errors found in
this data set. The complete error listing can be found with:
Which says that record 1 has a faulty value for age.
Suppose we expand our rules
With validate::confront we can see that rule
r2 is violated (record 2).
d |>
confront(rules) |>
summary()
#> name items passes fails nNA error warning expression
#> 1 r1 4 2 1 1 FALSE FALSE age > 0
#> 2 r2 4 2 1 1 FALSE FALSE income <= 0 | (age > 16)What errors will be found by locate_errors?
set.seed(1)
le <- locate_errors(d, rules)
le$errors
#> age income
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
#> [3,] FALSE FALSE
#> [4,] NA FALSEIt now detects that age in observation 2 is also faulty,
since it violates the second rule. Note that we use
set.seed. This is needed because in this example, either
age or income can be considered faulty.
set.seed assures that the procedure is reproducible.
With replace_errors we can remove the errors (which
still need to be imputed).
d_fixed <- replace_errors(d, le)
d_fixed |> confront(rules) |>summary()
#> name items passes fails nNA error warning expression
#> 1 r1 4 1 0 3 FALSE FALSE age > 0
#> 2 r2 4 2 0 2 FALSE FALSE income <= 0 | (age > 16)In which replace_errors set all faulty values to
NA.
Weights
locate_errors allows for supplying weights for the
variables. It is common that the quality of the observed variables
differs. When we have more trust in age because it was
retrieved from the official population register, we can give it more
weight so it chooses income when it has to decide between the two
(record 2):
set.seed(1) # good practice, see later in this document
weight <- c(age = 2, income = 1)
le <- locate_errors(d, rules, weight)
le$errors
#> age income
#> [1,] TRUE FALSE
#> [2,] FALSE TRUE
#> [3,] FALSE FALSE
#> [4,] NA FALSEWeights can be specified in different ways: (see also
errorlocate::expand_weights):
- not specifying: all variables will have weight 1
- named
vector: all records will have same set of weights. Unspecified columns will have weight 1. - named
matrixordata.frame, same dimension as the data: specify weights per record. - Use
Infweights to fixate a variable, so it won’t be changed.