| Title: | Locate Errors with Validation Rules |
|---|---|
| Description: | Errors in data can be located and removed using validation rules from package 'validate'. See also Van der Loo and De Jonge (2018) <doi:10.1002/9781118897126>, chapter 7. |
| Authors: | Edwin de Jonge [aut, cre] (ORCID: <https://orcid.org/0000-0002-6580-4718>), Mark van der Loo [aut] |
| Maintainer: | Edwin de Jonge <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.3 |
| Built: | 2026-06-01 08:47:50 UTC |
| Source: | https://github.com/data-cleaning/errorlocate |
errorlocate helps identify which cells in a record are likely causing rule
violations.
The package is designed to work with validate::validator() rules. While
validate can determine whether a record is valid, errorlocate tries to
identify which fields are most likely erroneous.
This is non-trivial because rules are interdependent: changing one value to satisfy one rule may violate another rule.
errorlocate implements record-level error localization based on the
Fellegi-Holt approach. It translates the localization task into a mixed
integer optimization problem and uses a MIP solver to find a minimum-weight
set of fields to change.
Typical workflow:
Define rules with validate::validator().
Locate likely errors with locate_errors().
Replace flagged cells with missing values using replace_errors().
Maintainer: Edwin de Jonge [email protected] (ORCID)
Authors:
Mark van der Loo [email protected]
T. De Waal (2003) Processing of Erroneous and Unsafe Data. PhD thesis, University of Rotterdam.
Van der Loo, M., de Jonge, E, Data Cleaning With Applications in R
E. De Jonge and Van der Loo, M. (2012) Error localization as a mixed-integer program in editrules.
lp_solve and Kjell Konis. (2011). lpSolveAPI: R Interface for lp_solve version 5.5.2.0. R package version 5.5.2.0-5. http://CRAN.R-project.org/package=lpSolveAPI
Useful links:
Report bugs at https://github.com/data-cleaning/errorlocate/issues
Utility function to add some small positive noise to weights. This is mainly done to randomly choose between solutions of equal weight. Without adding noise to weights, LP solvers may return an identical solution over and over while there are multiple solutions of equal weight. The generated noise is positive to prevent weights from becoming zero or negative.
add_noise(x, max_delta = NULL, ...)add_noise(x, max_delta = NULL, ...)
x |
|
max_delta |
when supplied noise will be drawn from |
... |
currently not used |
When no max_delta is supplied, add_noise will use the minimum difference
larger than zero divided by the length(x).
numeric vector/matrix with noise applied.
x <- c(1, 1, 3, 8) set.seed(123) add_noise(x) m <- rbind(c(1, 2, 3), c(1, 2, 3)) set.seed(123) add_noise(m, max_delta = 0.05)x <- c(1, 1, 3, 8) set.seed(123) add_noise(x) m <- rbind(c(1, 2, 3), c(1, 2, 3)) set.seed(123) add_noise(m, max_delta = 0.05)
ErrorLocalizer can be used as a base class to implement a new error localization algorithm.
The derived class must implement two methods: initialize, which is called
before any error localization is done and locate which operates upon data. The
extra parameter ... can be used to supply algorithm-specific parameters.
errorlocation contains the result of error localization.
It stores, per cell, whether a value is flagged as erroneous (TRUE), valid
(FALSE), or missing (NA).
A record-based error is restricted to one observation.
errorlocate() using the Fellegi-Holt algorithm assumes errors are
record-based.
A variable-based error is a flaw in a uni- or multivariate distribution. Correcting it typically requires changing multiple observations or aggregate totals.
Current implementation assumes record-based errors.
Retrieve error locations with validate::values(), which returns a matrix
with the same dimensions as the checked data.frame.
For errors that are purely column-based, or dataset-based, errorlocation
returns a matrix with all
rows or cells set to TRUE.
validate::values() returns NA for missing values.
$errors: matrix indicating which values are erroneous (TRUE),
missing (NA) or valid (FALSE)
$weight: The total weight per record. A weight of 0 means no errors were detected.
$status: The status of the MIP solver for this record.
$duration: The number of seconds for processing each record.
Other error finding:
errors_removed(),
expand_weights(),
locate_errors(),
replace_errors()
errors_removed retrieves the errors detected by replace_errors()
errors_removed(x, ...)errors_removed(x, ...)
x |
|
... |
not used |
errorlocation-class() object
Other error finding:
errorlocation-class,
expand_weights(),
locate_errors(),
replace_errors()
rules <- validator( profit + cost == turnover , cost - 0.6*turnover >= 0 , cost>= 0 , turnover >= 0 ) data <- data.frame(profit=755, cost=125, turnover=200) data_no_error <- replace_errors(data,rules) # faulty data was replaced with NA data_no_error errors_removed(data_no_error) # a bit more control, you can supply the result of locate_errors # to replace_errors, which is a good thing, otherwise replace_errors will call # locate_errors internally. error_locations <- locate_errors(data, rules) replace_errors(data, error_locations)rules <- validator( profit + cost == turnover , cost - 0.6*turnover >= 0 , cost>= 0 , turnover >= 0 ) data <- data.frame(profit=755, cost=125, turnover=200) data_no_error <- replace_errors(data,rules) # faulty data was replaced with NA data_no_error errors_removed(data_no_error) # a bit more control, you can supply the result of locate_errors # to replace_errors, which is a good thing, otherwise replace_errors will call # locate_errors internally. error_locations <- locate_errors(data, rules) replace_errors(data, error_locations)
Expands a weight specification into a weight matrix to be used
by locate_errors and replace_errors. Weights allow for "guiding" the
error localization process, so that less reliable values/variables with lower
weight are selected first. See details on the specification.
expand_weights(dat, weight = NULL, as.data.frame = FALSE, ...)expand_weights(dat, weight = NULL, as.data.frame = FALSE, ...)
dat |
|
weight |
weight specification, see details. |
as.data.frame |
if |
... |
unused |
If weight fine tuning is needed,
a possible scenario is to generate a weight data.frame using expand_weights and
adjust it before executing locate_errors() or replace_errors().
The following specifications for weight are supported:
NULL: generates a weight matrix with 1's
a named numeric, unmentioned columns will have weight 1
an unnamed numeric with a length equal to ncol(dat)
a data.frame with same number of rows as dat
a matrix with same number of rows as dat
Inf, NA weights are interpreted as variables that must not be
changed. Inf weights perform much better than setting a weight
to a large number.
matrix or data.frame of same dimensions as dat
Other error finding:
errorlocation-class,
errors_removed(),
locate_errors(),
replace_errors()
dat <- read.csv(text= "age,country 49, NL 23, DE ", strip.white=TRUE) weight <- c(age = 2, country = 1) expand_weights(dat, weight) weight <- c(2, 1) expand_weights(dat, weight, as.data.frame = TRUE) # works too weight <- c(country=5) expand_weights(dat, weight) # specify a per row weight for country weight <- data.frame(country=c(1,5)) expand_weights(dat, weight) # country should not be changed! weight <- c(country = Inf) expand_weights(dat, weight)dat <- read.csv(text= "age,country 49, NL 23, DE ", strip.white=TRUE) weight <- c(age = 2, country = 1) expand_weights(dat, weight) weight <- c(2, 1) expand_weights(dat, weight, as.data.frame = TRUE) # works too weight <- c(country=5) expand_weights(dat, weight) # specify a per row weight for country weight <- data.frame(country=c(1,5)) expand_weights(dat, weight) # country should not be changed! weight <- c(country = Inf) expand_weights(dat, weight)
Implementation of the Fellegi-Holt algorithm using the ErrorLocalizer base class.
Given a set of validation rules and a dataset the Fellegi-Holt algorithm finds for each record
the smallest (weighted) combination of variables that are erroneous (if any).
Most users do not need this class and can use locate_errors().
errorlocalizer implements Fellegi-Holt using a MIP solver. For problems in which
coefficients of the validation rules or the data are too different, you should consider scaling
the data.
Utility function to inspect the MIP problem for one record.
inspect_mip can be used as a drop-in replacement for locate_errors(), but
it only uses the first record when multiple rows are supplied.
inspect_mip(data, x, weight, ...)inspect_mip(data, x, weight, ...)
data |
data to be checked |
x |
validation rules or errorlocalizer object to be used for finding possible errors. |
weight |
|
... |
optional parameters that are passed to |
This is useful for debugging how one record is translated into a mixed integer
problem, including the generated rules, objective, and LP representation.
See vignette("inspect_mip") for more details.
Other Mixed Integer Problem:
MipRules-class
rules <- validator(x > 1) data <- list(x = 0) weight <- c(x = 1) mip <- inspect_mip(data, rules) print(mip) # inspect the LP problem (prior to solving it with lpSolveAPI) lp <- mip$to_lp() print(lp) # for large problems write the LP problem to disk for inspection # lpSolveAPI::write.lp(lp, "my_problem.lp") # solve the MIP system / find a solution res <- mip$execute() names(res) # lpSolveAPI status of finding a solution res$s # LP problem after solving (often simplified version of first LP) res$lp # records that are deemed "faulty" res$errors # values of variables used in the MIP formulation. Also contains a valid solution # for "faulty" variables res$values # see the derived MIP rules and objective function, used in the construction of # the LP problem mip$mip_rules() mip$objectiverules <- validator(x > 1) data <- list(x = 0) weight <- c(x = 1) mip <- inspect_mip(data, rules) print(mip) # inspect the LP problem (prior to solving it with lpSolveAPI) lp <- mip$to_lp() print(lp) # for large problems write the LP problem to disk for inspection # lpSolveAPI::write.lp(lp, "my_problem.lp") # solve the MIP system / find a solution res <- mip$execute() names(res) # lpSolveAPI status of finding a solution res$s # LP problem after solving (often simplified version of first LP) res$lp # records that are deemed "faulty" res$errors # values of variables used in the MIP formulation. Also contains a valid solution # for "faulty" variables res$values # see the derived MIP rules and objective function, used in the construction of # the LP problem mip$mip_rules() mip$objective
Check if rules are categorical
is_categorical(x, ...)is_categorical(x, ...)
x |
validator or expression object |
... |
not used |
logical indicating which rules are purely categorical/logical
errorlocate supports linear,
categorical and conditional rules to be used in finding errors. Other rule types
are ignored during error finding.
Other rule type:
is_conditional(),
is_linear()
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x ) is_categorical(v)v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x ) is_categorical(v)
Check if rules are conditional rules
is_conditional(rules, ...)is_conditional(rules, ...)
rules |
validator object containing validation rules |
... |
not used |
logical indicating which rules are conditional
errorlocate supports linear,
categorical and conditional rules to be used in finding errors. Other rule types
are ignored during error finding.
Other rule type:
is_categorical(),
is_linear()
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") x > 1 # conditional , if (y > 0) x >= 0 # conditional , if (A == "a1") B == "b1" # categorical ) is_conditional(v)v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") x > 1 # conditional , if (y > 0) x >= 0 # conditional , if (A == "a1") B == "b1" # categorical ) is_conditional(v)
Check which rules are linear rules.
is_linear(x, ...)is_linear(x, ...)
x |
|
... |
not used |
logical indicating which rules are (purely) linear.
errorlocate supports linear,
categorical and conditional rules to be used in finding errors. Other rule types
are ignored during error finding.
Other rule type:
is_categorical(),
is_conditional()
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x , z + 1 < 2*x + 3*y ) is_linear(v)v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x , z + 1 < 2*x + 3*y ) is_linear(v)
Locate fields in a data.frame that are likely erroneous under a set of
validation rules. The method returns an errorlocation-class() object,
computed with localizer x.
locate_errors( data, x, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 ) ## S4 method for signature 'data.frame,validator' locate_errors( data, x, weight = NULL, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 ) ## S4 method for signature 'data.frame,ErrorLocalizer' locate_errors( data, x, weight = NULL, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 )locate_errors( data, x, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 ) ## S4 method for signature 'data.frame,validator' locate_errors( data, x, weight = NULL, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 ) ## S4 method for signature 'data.frame,ErrorLocalizer' locate_errors( data, x, weight = NULL, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), timeout = 60 )
data |
data to be checked |
x |
validation rules or errorlocalizer object to be used for finding possible errors. |
... |
optional parameters that are passed to |
cl |
optional parallel / cluster. |
Ncpus |
number of nodes to use. See details |
timeout |
maximum number of seconds that the localizer should use per record. |
weight |
|
ref |
|
Use replace_errors() to remove flagged fields, typically by setting them to
NA. Use base::set.seed() beforehand to make calls reproducible.
Use an Inf weight specification to fix variables that should not be
changed.
See expand_weights() for more details.
locate_errors uses lpSolveAPI to formulate and solve a mixed integer
problem. See the vignettes for details.
The solver has many options, see lpSolveAPI::lp.control.options().
Noteworthy options include:
timeout: restricts the time the solver spends on a record (seconds)
break.at.value: set this to minimum weight + 1 to improve speed.
presolve: default in errorlocate is "rows". Set to "none" when you have
solutions where all variables are deemed wrong.
locate_errors can run on multiple cores using package parallel.
The easiest option is setting Ncpus to the number of desired cores.
Alternatively one can create a cluster object (parallel::makeCluster())
and use cl to pass the cluster object.
Or set cl to an integer which results in parallel::mclapply(), which only works
on non-Windows systems.
errorlocation-class() object describing the errors found.
Other error finding:
errorlocation-class,
errors_removed(),
expand_weights(),
replace_errors()
rules <- validator( profit + cost == turnover , cost >= 0.6 * turnover # cost should be at least 60% of turnover , turnover >= 0 # cannot be negative. ) data <- data.frame( profit = 755 , cost = 125 , turnover = 200 ) # use set.seed to make results reproducible set.seed(42) le <- locate_errors(data, rules) print(le) summary(le) v_categorical <- validator( branch %in% c("government", "industry") , tax %in% c("none", "VAT") , if (tax == "VAT") branch == "industry" ) data <- read.csv(text= " branch, tax government, VAT industry , VAT ", strip.white = TRUE) locate_errors(data, v_categorical)$errors v_logical <- validator( citizen %in% c(TRUE, FALSE) , voted %in% c(TRUE, FALSE) , if (voted == TRUE) citizen == TRUE ) data <- data.frame(voted = TRUE, citizen = FALSE) set.seed(42) locate_errors(data, v_logical, weight=c(2,1))$errors # try a conditional rule v <- validator( married %in% c(TRUE, FALSE) , if (married==TRUE) age >= 17 ) data <- data.frame( married = TRUE, age = 16) set.seed(42) locate_errors(data, v, weight=c(married=1, age=2))$errors # different weights per row data <- read.csv(text= "married, age TRUE, 16 TRUE, 14 ", strip.white = TRUE) weight <- read.csv(text= "married, age 1, 2 2, 1 ", strip.white = TRUE) set.seed(42) locate_errors(data, v, weight = weight)$errors # fixate / exclude a variable from error localization # using an Inf weight weight <- c(age = Inf) set.seed(42) locate_errors(data, v, weight = weight)$errorsrules <- validator( profit + cost == turnover , cost >= 0.6 * turnover # cost should be at least 60% of turnover , turnover >= 0 # cannot be negative. ) data <- data.frame( profit = 755 , cost = 125 , turnover = 200 ) # use set.seed to make results reproducible set.seed(42) le <- locate_errors(data, rules) print(le) summary(le) v_categorical <- validator( branch %in% c("government", "industry") , tax %in% c("none", "VAT") , if (tax == "VAT") branch == "industry" ) data <- read.csv(text= " branch, tax government, VAT industry , VAT ", strip.white = TRUE) locate_errors(data, v_categorical)$errors v_logical <- validator( citizen %in% c(TRUE, FALSE) , voted %in% c(TRUE, FALSE) , if (voted == TRUE) citizen == TRUE ) data <- data.frame(voted = TRUE, citizen = FALSE) set.seed(42) locate_errors(data, v_logical, weight=c(2,1))$errors # try a conditional rule v <- validator( married %in% c(TRUE, FALSE) , if (married==TRUE) age >= 17 ) data <- data.frame( married = TRUE, age = 16) set.seed(42) locate_errors(data, v, weight=c(married=1, age=2))$errors # different weights per row data <- read.csv(text= "married, age TRUE, 16 TRUE, 14 ", strip.white = TRUE) weight <- read.csv(text= "married, age 1, 2 2, 1 ", strip.white = TRUE) set.seed(42) locate_errors(data, v, weight = weight)$errors # fixate / exclude a variable from error localization # using an Inf weight weight <- c(age = Inf) set.seed(42) locate_errors(data, v, weight = weight)$errors
Create a MipRules object from validate::validator() rules.
This utility class translates rules into a mixed integer problem.
Most users should use locate_errors(), which handles translation and
execution automatically. MipRules is mainly for advanced users who want to
inspect or customize the optimization setup.
The MipRules class contains the following methods:
$execute() solves the mixed integer problem.
$to_lp() transforms the object into an lp_solve problem object.
$is_infeasible checks whether the current rule system is infeasible.
$set_values() sets observed values and weights (objective function).
Other Mixed Integer Problem:
inspect_mip()
rules <- validator(x > 1) mr <- miprules(rules) mr$to_lp() mr$set_values(c(x=0), weights=c(x=1)) mr$execute()rules <- validator(x > 1) mr <- miprules(rules) mr$to_lp() mr$set_values(c(x=0), weights=c(x=1)) mr$execute()
Find erroneous fields using locate_errors() and replace these
fields automatically with NA or a suggestion that is provided by the error detection algorithm.
replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,validator' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,ErrorLocalizer' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,errorlocation' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = 1, value = c("NA", "suggestion") )replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,validator' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,ErrorLocalizer' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = getOption("Ncpus", 1), value = c("NA", "suggestion") ) ## S4 method for signature 'data.frame,errorlocation' replace_errors( data, x, ref = NULL, ..., cl = NULL, Ncpus = 1, value = c("NA", "suggestion") )
data |
data to be checked |
x |
|
ref |
optional reference data set |
... |
these parameters are handed over to |
cl |
optional cluster for parallel execution (see details) |
Ncpus |
number of nodes to use. (see details) |
value |
|
Note that you can also use the result of locate_errors() with replace_errors.
When the procedure takes a long time and locate_errors was called previously
this is the preferred way, because otherwise locate_errors will be executed again.
The errors that were removed from the data.frame can be retrieved with the function
errors_removed(). For more control over error localization see locate_errors().
replace_errors has the same parallelization options as locate_errors() (see there).
data with erroneous values removed.
In general it is better to replace the erroneous fields with NA and apply a proper
imputation method. Suggested values from the error localization method may introduce an undesired bias.
Other error finding:
errorlocation-class,
errors_removed(),
expand_weights(),
locate_errors()
rules <- validator( profit + cost == turnover , cost - 0.6*turnover >= 0 , cost>= 0 , turnover >= 0 ) data <- data.frame(profit=755, cost=125, turnover=200) data_no_error <- replace_errors(data,rules) # faulty data was replaced with NA data_no_error errors_removed(data_no_error) # a bit more control, you can supply the result of locate_errors # to replace_errors, which is a good thing, otherwise replace_errors will call # locate_errors internally. error_locations <- locate_errors(data, rules) replace_errors(data, error_locations)rules <- validator( profit + cost == turnover , cost - 0.6*turnover >= 0 , cost>= 0 , turnover >= 0 ) data <- data.frame(profit=755, cost=125, turnover=200) data_no_error <- replace_errors(data,rules) # faulty data was replaced with NA data_no_error errors_removed(data_no_error) # a bit more control, you can supply the result of locate_errors # to replace_errors, which is a good thing, otherwise replace_errors will call # locate_errors internally. error_locations <- locate_errors(data, rules) replace_errors(data, error_locations)
Translate linear rules into an LP problem
translate_mip_lp(rules, objective = NULL, eps = 0.001, ...)translate_mip_lp(rules, objective = NULL, eps = 0.001, ...)
rules |
MIP rules |
objective |
function |
eps |
accuracy for equality/inequality |
... |
additional |