Title: | Unified Framework for Data Quality Control |
---|---|
Description: | An easy framework to set a quality control workflow on a dataset. Includes a various range of functions that allow to establish an adaptable data quality control. |
Authors: | Luis Garcez [aut, cre, cph] |
Maintainer: | Luis Garcez <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-03 05:10:35 UTC |
Source: | https://github.com/luisgarcez11/qualitycontrol |
An Amyotrophic lateral sclerosis related example dataset.
als_data
als_data
A list
subjidSubject ID
p1ALSFRS-R 1
p2ALSFRS-R 2
p3ALSFRS-R 3
p4ALSFRS-R 4
p5ALSFRS-R 5
p6ALSFRS-R 6
p7ALSFRS-R 7
p8ALSFRS-R 8
p9ALSFRS-R 9
x1rALSFRS-R R1
x2rALSFRS-R R2
x3rALSFRS-R R3
age_at_baselineAge at baseline
age_at_onsetAge at onsite
onsetRegion of onset
baseline_dateBaseline date3
death_dateDeath date
An example dataset containing a Quality Control mapping
als_data_qc_mapping
als_data_qc_mapping
A list of 3 tibbles
.
missingTable with all the 'missing' tests.
inconsistenciesTable with all the 'inconsistencies' tests.
rangeTable with all the 'out of range' tests.
QC dataset using a specific variable mapping
qc_data(data, qc_mapping, output_file = NULL)
qc_data(data, qc_mapping, output_file = NULL)
data |
A data frame, data frame extension (e.g. a |
qc_mapping |
A list of data frame or data frame extension (e.g. a |
output_file |
(optional) File path ended in |
A data frame containing all the findings.
qc_data(als_data, als_data_qc_mapping)
qc_data(als_data, als_data_qc_mapping)
read_qc_mapping
reads an .xlsx
file that contains
the QC mapping.
read_qc_mapping(path)
read_qc_mapping(path)
path |
excel file path to be read. Each tab should contain 3 tabs with the names missing, inconsistencies and range. Each tab will correspond to one QC mapping table. QC mapping
The columns specified above should contain specific values:
|
A list containing all the QC mapping tables
Test if variable values are duplicated
test_duplicated(data, variable)
test_duplicated(data, variable)
data |
data to be tested. |
variable |
The variable to be tested. |
A data frame containing all the findings regarding the applied test.
test_duplicated(als_data, 'subjid')
test_duplicated(als_data, 'subjid')
Test the inconsistencies between variables on a dataset
test_inconsistencies(data, variable1, variable2, relation)
test_inconsistencies(data, variable1, variable2, relation)
data |
data to be tested. |
variable1 |
The variable to be tested. |
variable2 |
The variable to be tested. |
relation |
String such as 'greater_than', 'greater_than_or_equal' 'lower_than_or_equal' and 'lower_than'. |
A data frame containing all the findings regarding the applied test.
test_inconsistencies(als_data, 'baseline_date', 'death_date', relation = 'lower_than') test_inconsistencies(als_data, 'age_at_baseline', 'age_at_onset', relation = 'greater_than')
test_inconsistencies(als_data, 'baseline_date', 'death_date', relation = 'lower_than') test_inconsistencies(als_data, 'age_at_baseline', 'age_at_onset', relation = 'greater_than')
Test the variable missingness on a dataset
test_missing(data, variable)
test_missing(data, variable)
data |
data to be tested. |
variable |
The variable to be tested. |
A data frame containing all the findings regarding the applied test.
test_missing(als_data, 'p8') test_missing(als_data, 'p1')
test_missing(als_data, 'p8') test_missing(als_data, 'p1')
Test the range of a variable on a dataset
test_range( data, variable, type, categories = NULL, lower_value = NULL, upper_value = NULL )
test_range( data, variable, type, categories = NULL, lower_value = NULL, upper_value = NULL )
data |
data to be tested. |
variable |
The variable to be tested. |
type |
String such as 'categorical', 'date' or 'numeric' |
categories |
Only to be filled if |
lower_value |
Only to be filled if |
upper_value |
Only to be filled if |
A data frame containing all the findings regarding the applied test.
test_range(als_data, 'onset', c('bulbar','respiratory', 'spinal'), type = 'categorical') test_range(als_data, 'age_at_baseline', lower_value = 20, upper_value = 100, type = 'numeric') test_range(als_data, 'age_at_onset', lower_value = 20, upper_value = 100, type = 'numeric') test_range(als_data, 'baseline_date', lower_value = '2000-01-01', upper_value = '2022-01-01', type = 'date') test_range(als_data, 'death_date', lower_value = '2000-01-01', upper_value = '2022-01-01', type = 'date')
test_range(als_data, 'onset', c('bulbar','respiratory', 'spinal'), type = 'categorical') test_range(als_data, 'age_at_baseline', lower_value = 20, upper_value = 100, type = 'numeric') test_range(als_data, 'age_at_onset', lower_value = 20, upper_value = 100, type = 'numeric') test_range(als_data, 'baseline_date', lower_value = '2000-01-01', upper_value = '2022-01-01', type = 'date') test_range(als_data, 'death_date', lower_value = '2000-01-01', upper_value = '2022-01-01', type = 'date')