Title: | Generate Suggestions for Validation Rules |
---|---|
Description: | Generate suggestions for validation rules from a reference data set, which can be used as a starting point for domain specific rules to be checked with package 'validate'. |
Authors: | Edwin de Jonge [aut, cre] , Olav ten Bosch [aut] |
Maintainer: | Edwin de Jonge <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.2 |
Built: | 2024-11-05 05:28:23 UTC |
Source: | https://github.com/data-cleaning/validatesuggest |
A constructed data set useful for detecting conditinal dependencies.
car_owner
car_owner
A data frame with 200 rows and 4 variables. Each row is a person with:
age of person
has a driver license, only persons older then 17 can have a license in this data set
monthly income
only persons with a drivers license , and a monthly income > 1500 can own a car
NA when there is no car
data("car_owner") rules <- suggest_cond_rule(car_owner) rules$rules
data("car_owner") rules <- suggest_cond_rule(car_owner) rules$rules
Suggests rules using the various suggestion checks.
Use the more specific suggest
functions for more control.
suggest_rules( d, vars = names(d), domain_check = TRUE, range_check = TRUE, pos_check = TRUE, type_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE ) suggest_all( d, vars = names(d), domain_check = TRUE, range_check = TRUE, pos_check = TRUE, type_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE ) write_all_suggestions( d, vars = names(d), file = stdout(), domain_check = TRUE, range_check = TRUE, type_check = TRUE, pos_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE )
suggest_rules( d, vars = names(d), domain_check = TRUE, range_check = TRUE, pos_check = TRUE, type_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE ) suggest_all( d, vars = names(d), domain_check = TRUE, range_check = TRUE, pos_check = TRUE, type_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE ) write_all_suggestions( d, vars = names(d), file = stdout(), domain_check = TRUE, range_check = TRUE, type_check = TRUE, pos_check = TRUE, na_check = TRUE, unique_check = TRUE, ratio_check = TRUE, conditional_rule = TRUE )
d |
|
vars |
|
domain_check |
if |
range_check |
if |
pos_check |
if |
type_check |
if |
na_check |
if |
unique_check |
if |
ratio_check |
if |
conditional_rule |
if |
file |
file to which the checks will be written to. |
returns validate::validator()
object with the suggested rules.
write_all_suggestions
write the rules to file and returns invisibly a named list of ranges for each variable.
Fictuous test data set from European (ESSnet) project on validation 2017.
task2
task2
ID
Age of person
Marital status
Employed or not
Working hours
European (ESSnet) project on validation 2017
Suggest a conditional rule based on a association rule. This functions derives conditional rules based on the non-existance of combinations of categories in pairs of variables. For each numerical variable a logical variable is derived that tests for positivity. It generates IF THEN rules based on two variables.
write_cond_rule(d, vars = names(d), file = stdout()) suggest_cond_rule(d, vars = names(d))
write_cond_rule(d, vars = names(d), file = stdout()) suggest_cond_rule(d, vars = names(d))
d |
|
vars |
|
file |
file to which the checks will be written to. |
suggest_cond_rule
returns validate::validator()
object with the suggested rules.
write_cond_rule
returns invisibly a named list of ranges for each variable.
data(retailers, package="validate") # will generate check for all columns in retailers that are # complete. suggest_na_check(retailers) data("car_owner") rules <- suggest_cond_rule(car_owner) rules$rules
data(retailers, package="validate") # will generate check for all columns in retailers that are # complete. suggest_na_check(retailers) data("car_owner") rules <- suggest_cond_rule(car_owner) rules$rules
Suggest a range check
write_domain_check(d, vars = names(d), only_positive = TRUE, file = stdout()) suggest_domain_check(d, vars = names(d), only_positive = TRUE)
write_domain_check(d, vars = names(d), only_positive = TRUE, file = stdout()) suggest_domain_check(d, vars = names(d), only_positive = TRUE)
d |
|
vars |
|
only_positive |
if |
file |
file to which the checks will be written to. |
suggest_domain_check
returns validate::validator()
object with the suggested rules.
write_domain_check
returns invisibly a named list of checks for each variable.
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest a check for completeness.
write_na_check(d, vars = names(d), file = stdout()) suggest_na_check(d, vars = names(d))
write_na_check(d, vars = names(d), file = stdout()) suggest_na_check(d, vars = names(d))
d |
|
vars |
|
file |
file to which the checks will be written to. |
suggest_na_check
returns validate::validator()
object with the suggested rules.
write_na_check
write the rules to file and returns invisibly a named list of ranges for each variable.
data(retailers, package="validate") # will generate check for all columns in retailers that are # complete. suggest_na_check(retailers)
data(retailers, package="validate") # will generate check for all columns in retailers that are # complete. suggest_na_check(retailers)
Suggest a range check
write_pos_check(d, vars = names(d), only_positive = TRUE, file = stdout()) suggest_pos_check(d, vars = names(d), only_positive = TRUE)
write_pos_check(d, vars = names(d), only_positive = TRUE, file = stdout()) suggest_pos_check(d, vars = names(d), only_positive = TRUE)
d |
|
vars |
|
only_positive |
if |
file |
file to which the checks will be written to. |
suggest_pos_check
returns validate::validator()
object with the suggested rules.
write_pos_check
write the rules to file and returns invisibly a named list of checks for each variable.
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest a range check
write_range_check(d, vars = names(d), min = TRUE, max = FALSE, file = stdout()) suggest_range_check(d, vars = names(d), min = TRUE, max = FALSE)
write_range_check(d, vars = names(d), min = TRUE, max = FALSE, file = stdout()) suggest_range_check(d, vars = names(d), min = TRUE, max = FALSE)
d |
|
vars |
|
min |
|
max |
|
file |
file to which the checks will be written to. |
suggest_range_check
returns validate::validator()
object with the suggested rules.
write_range_check
write the rules to file and returns invisibly a named list of ranges for each variable.
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
data(SBS2000, package="validate") suggest_range_check(SBS2000) # checks the ranges of each variable suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE) # checks the ranges of each variable suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest ratio checks
write_ratio_check( d, vars = names(d), file = stdout(), lin_cor = 0.95, digits = 2 ) suggest_ratio_check(d, vars = names(d), lin_cor = 0.95, digits = 2)
write_ratio_check( d, vars = names(d), file = stdout(), lin_cor = 0.95, digits = 2 ) suggest_ratio_check(d, vars = names(d), lin_cor = 0.95, digits = 2)
d |
|
vars |
|
file |
file to which the checks will be written to. |
lin_cor |
threshold for abs correlation to be included (details) |
digits |
number of digits for rounding |
suggest_ratio_check
returns validate::validator()
object with the suggested rules.
write_ratio_check
write the rules to file and returns invisibly a named list of check for each variable.
data(SBS2000, package="validate") # generates upper and lower checks for the # ratio of two variables if their correlation is # bigger then `lin_cor` suggest_ratio_check(SBS2000, lin_cor=0.98)
data(SBS2000, package="validate") # generates upper and lower checks for the # ratio of two variables if their correlation is # bigger then `lin_cor` suggest_ratio_check(SBS2000, lin_cor=0.98)
suggest type check
write_type_check(d, vars = names(d), file = stdout()) suggest_type_check(d, vars = names(d))
write_type_check(d, vars = names(d), file = stdout()) suggest_type_check(d, vars = names(d))
d |
|
vars |
|
file |
file to which the checks will be written to. |
suggest_type_check
returns validate::validator()
object with the suggested rules.
write_type_check
write the rules to file and returns invisibly a named list of types for each variable.
Suggest range checks
write_unique_check(d, vars = names(d), file = stdout(), fraction = 0.95) suggest_unique_check(d, vars = names(d), fraction = 0.95)
write_unique_check(d, vars = names(d), file = stdout(), fraction = 0.95) suggest_unique_check(d, vars = names(d), fraction = 0.95)
d |
|
vars |
|
file |
file to which the checks will be written to. |
fraction |
if values in a column > |
suggest_unique_check
returns validate::validator()
object with the suggested rules.
write_unique_check
write the rules to file and returns invisibly a named list of checks for each variable.