dcmodify
Rule-based data processing for data cleaning.
Package version 0.9.0.
Please use citation("dcmodify")
to cite the package.
The dcmodify
package allows users to declare
(conditional) data processing steps without hard-coding them in an R
script.
The motivating use case is where domain experts need to frequently update (fix) values of a data set depending on data conditions. Such updates require (1) selecting the cases where the conditions apply, and (2) updating the values.
We create a dataset where we asked some imaginary respondents about their average water consumption per day.
> water <- data.frame(
+ name = c("Ross", "Robert", "Martin", "Brian", "Simon")
+ , consumption = c(110, 105, 0.15, 95, -100)
+ )
> water
name consumption
1 Ross 110.00
2 Robert 105.00
3 Martin 0.15
4 Brian 95.00
5 Simon -100.00
Here, Martin submitted his water consumption in m3 while the others responded in liters. Such a unit of measure error can be detected and fixed. Simon took the engineering approach, interpreted consumption as a sink, and put a negative sign in front of his answer. Again, this is a data error that can be detected and fixed easily.
If such errors occur frequently, it makes sense to store the
treatment. In dcmodify
this can be solved as follows.
> library(dcmodify)
> # define a rule set (here with one rule)
> rules <- modifier(
+ if ( abs(consumption) <= 1 ) consumption <- 1000*consumption
+ , if ( consumption < 0 ) consumption <- -1 * consumption )
> # apply the ruleset to the data
> out <- modify(water, rules)
> out
name consumption
1 Ross 110
2 Robert 105
3 Martin 150
4 Brian 95
5 Simon 100
In the first step we define a set of conditional data modifying rules of the form:
if (some condition) change somthing
next, using modifier()
, these rules are applied
record-wise.
In dcmodify
conditional data modifying rules are first
class citizens. Modifying rules can be created, deleted, read from or
written to file, filtered, selected, and investigated. And of course,
applied to data.
In particular, the rules
object of the previous example
is basically a list of class modifier
.
> rules
Object of class modifier with 2 elements:
M1:
if (abs(consumption) <= 1) consumption <- 1000 * consumption
M2:
if (consumption < 0) consumption <- -1 * consumption
For example, we may select one rule and apply it to our original data set.
> modify(water, rules[2])
name consumption
1 Ross 110.00
2 Robert 105.00
3 Martin 0.15
4 Brian 95.00
5 Simon 100.00
We can ask which variables are used in the modifying rules (here: only one).
retailers
dataset from the validate package
using data("retailers", package="validate")
.other.rev
to zero if it
is missing, (2) replaces negative other.rev
with the
absolute value.The dcmodify
package supports reading/writing rules from
free text file, yaml
files, or data frames. For example,
consider the contents of the file myrules.txt
(to try the
following code, create such a file yourself).
# myrules.txt
# unit of measure error
if (abs(consumption) <= 1){
consumption <- 1000*consumption
}
# sign error
if (consumption < 0 ){
consumption <- -1 * consumption
}
Reading the rules is done with the .file
argument.
> rules <- modifier(.file="myrules.txt")
> rules
Object of class modifier with 2 elements:
M1:
if (abs(consumption) <= 1) {
consumption <- 1000 * consumption
}
M2:
if (consumption < 0) {
consumption <- -1 * consumption
}
A second way to store rules is in yaml format. This allows one to add
metadata to rules, including a name, label or description. To
demonstrate this, we will write the rules to yaml
file and
print the file contents.
> fn <- tempfile()
> # export rules to yaml format
> export_yaml(rules,file=fn)
> # print file contents
> readLines(fn) |> paste(collapse="\n") |> cat()
rules:
- expr:
- if (abs(consumption) <= 1) {
- ' consumption <- 1000 * consumption'
- '}'
name: M1
label: ''
description: ''
created: 2024-12-11 04:10:15.274011
origin: myrules.txt
meta: []
- expr:
- if (consumption < 0) {
- ' consumption <- -1 * consumption'
- '}'
name: M2
label: ''
description: ''
created: 2024-12-11 04:10:15.274011
origin: myrules.txt
meta: []
Finally, it is possible to read rules from (and export to) data
frame. A rule data frame must at least contain a character
column named rule
; all other columns are considered
metadata.
> d <- data.frame(
+ name = c("U1","S1")
+ , label = c("Unit error", "sign error")
+ )
> d$rule <- c(
+ "if(abs(consumption)<=1) consuption <- 1000 * consumption"
+ ,"if(consumption < 0) consumption <- -1 * consumption"
+ )
> d
name label rule
1 U1 Unit error if(abs(consumption)<=1) consuption <- 1000 * consumption
2 S1 sign error if(consumption < 0) consumption <- -1 * consumption
Reading from data frame is done with the .data
argument.
> myrules <- modifier(.data=d)
> myrules
Object of class modifier with 2 elements:
U1: Unit error
if (abs(consumption) <= 1) consuption <- 1000 * consumption
S1: sign error
if (consumption < 0) consumption <- -1 * consumption
Using the retailers
dataset of the previous exercises,
define a file with rules that
other.rev
with zero (0)See ?retailers
for the meaning of the variables.
It is possible to use data not in the treated dataset in your data
modification rules. This so-called reference data needs to be added in
the call to modify()
.
> dat <- data.frame(x=seq_len(nrow(women)))
> m <- modifier(if (x > 2) x <- ref$height/2)
> out <- modify(dat, m, ref=women)
> head(out)
x
1 1.0
2 2.0
3 30.0
4 30.5
5 31.0
6 31.5
Note that in this form, it is necessary to use ref$
to
refer to the women
dataset in the context of modifying
rules. This can be customized by passing the reference data as a named
list or environment.
> m <- modifier(if (x > 2) x <- women$height/2)
> out <- modify(dat, m, ref=list(women=women))
> head(out,3)
x
1 1
2 2
3 30
Or, equivalently
The package supports rules of the form
if (some condition holds){
change some existing values
}
where the else
clause is optional, and the rules are
executed record-by-record. There may be multiple expressions in each
{}
block, and it is also allowed to have nested
if-else
statements.
dcmodify
allows rule-by-rule change tracking via
integration with the lumberjack
package1. There are many ways of
following what happens to a data file, and we refere to 1 for an overview of the possibilities of
the lumberjack
package. Here we demonstrate how to use the
cellwise logger, which logs all changes cell-by-cell.
> library(lumberjack)
> # create a logger (see ?cellwise)
> lgr <- cellwise$new(key="name")
> # create rules
>
> rules <- modifier(
+ if ( abs(consumption) <= 1 ) consumption <- 1000*consumption
+ , if ( consumption < 0 ) consumption <- -1 * consumption )
> # apply rules, and pass logger object to modify()
> out <- modify(water, rules, logger=lgr)
> # check what happened, by dumping the log and reading in
> # the csv.
> logfile <- tempfile()
> lgr$dump(file=logfile)
Dumped a log at /tmp/RtmpfgtHwT/file3fd33bf5e5b
> read.csv(logfile)
step time srcref expression
1 1 2024-12-11 04:10:15 UTC <modifier>#1-1 consumption <- 1000 * consumption
2 2 2024-12-11 04:10:15 UTC <modifier>#2-2 consumption <- -1 * consumption
key variable old new
1 Martin consumption 0.15 150
2 Simon consumption -100.00 100
1 van der Loo MPJ (2021). Monitoring Data in R with the lumberjack Package. Journal of Statistical Software, 98(1), 1–13. doi:10.18637/jss.v098.i01.