Validation scripts for checking CSV files against expected schemas, column types, and data constraints.
Install required dependencies:
pip install -r requirements.txtA helper file for loading and manipulating data
Validates all CSV files in a folder against their schema definitions.
Usage:
python validate_data.py <folder_path> [--sample-size N]Examples:
# Validate all CSV files in a folder
python validate_data.py /path/to/csv/folder
# Validate with custom sample size
python validate_data.py /path/to/csv/folder --sample-size 200000Options:
folder_path: Path to folder containing CSV files (required)--sample-size: Number of rows to sample from each CSV (default: 100000)
Output:
- Validates each CSV file found in the folder
- Provides a summary report showing passed/failed files
- Exits with code 0 if all files pass, code 1 if any fail
The scripts validate the following Africa data tables:
africa_rural_urbanafrica_employed_employment_typetotal_working_populationafrica_education_inactiveafrica_education_studentafrica_education_unemployedemployed_education_by_sectoremployed_working_poorafrica_sector_employedemployed_educationemployed_formality_statusafrica_employed_sector_group_incomesubnational_studentsubnational_unemployedsubnational_inactivesubnational_employedsubnational_employed_working_poorsubnational_employed_sector_group_incomesubnational_employed_employment_type
The scripts perform the following validations:
- Missing Columns: Checks that all required columns are present
- Column Types: Validates data types (Int64, Float64, String)
- Nullable Constraints: Ensures non-nullable columns don't contain nulls
- Categorical Values: Validates
sector_groupcontains only:Industry,Agriculture,Services - Non-empty Strings: Checks that required string columns are not empty
- Integer Validation: Ensures integer columns contain valid integers
- CSV filenames should match table names (e.g.,
africa_rural_urban.csv) - Only the first N rows (default: 10,000) are sampled for validation to conserve memory
- Files with unrecognized table names are skipped with a warning