Spout Coverage is one of the first tools used to check the quality of data, once the datasets and data dictionaries have been assembled. It allows Spout to take the prepared datasets and compare them to the variables that are found in the data dictionary. The command to test the coverage of a dataset is spout c
in terminal or the command prompt, and gives an output similar to the one below.
informinion:example-data-dictionary michael$ spout c
Parsing: csvs/0.1.0/example-dataset-0.1.0.csv
.........................................................................
Using dataset in csvs/0.1.0/
Generating: index.html
/Users/michael/code/examples/example-data-dictionary/coverage/index.html
In order to function properly, Spout requires the user to have a local copy of the latest dataset to be saved within the repository in a 'csvs' folder.
Example: /Users/michael/code/examples/example-data-dictionary/csvs/0.1.0/example-dataset-0.1.0.csv
Spout will use the CSV dataset in this folder, and try to match it against variables in the data dictionary that are of the same version and generate a coverage report, similar to this one, which has been annotated to highlight the report's features:
Features:
VERSION
fileVariables Not Found in Any CSV
row represents the number of variables that are in the data dictionary and do not correspond to any variables in the datasetThe different columns in the report give more detailed information about each variable, such as the type, filename, variable ID, the domain (if applicable) and all of the values for type: choices
variables. The report uses a similar highlighting color scheme to note where domain options do not properly cover the options in the dataset
Once the coverage report shows the data dictionary to be 100% matching with the dataset, someone who is familiar with the project can begin to look at the actual quality of data in the dataset.