GitHub - ottogroup/koality: Library for data quality monitoring based on duckdb.

Data Quality Monitoring powered by DuckDB

Koality is a Python library for data quality monitoring (DQM) using DuckDB. It provides configurable checks that validate data in tables and can persist results to database tables for monitoring and alerting.

We would like to thank Norbert Maager who is the original inventor of Koality.

Warning

This library is a work in progress!

Breaking changes should be expected until a 1.0 release, so version pinning is recommended.

Documentation

For comprehensive documentation, visit the Koality Documentation.

Core Features

Configurable Checks: Define data quality checks via simple YAML configuration files
DuckDB-Powered: Fast, in-process analytics with DuckDB's in-memory engine
External Database Support: Currently supports Google Cloud BigQuery via DuckDB extensions
Multiple Check Types: Null ratios, regex matching, value sets, duplicates, counts, match rates, outlier detection, and more
Flexible Filtering: Dynamic filtering system with column/value pairs for targeted checks
Result Persistence: Store check results in database tables for historical tracking
CLI Tool: Easy-to-use command-line interface for running checks
Threshold Validation: Compare check results against configurable lower/upper bounds

Supported Databases

Database	Status
DuckDB (in-memory)	✅ Fully supported
Google Cloud BigQuery	✅ Fully supported

Koality uses DuckDB as its query engine. External databases are accessed through DuckDB extensions (e.g., the BigQuery extension for Google Cloud). External databases may need custom handling in execute_query!

Note on missing tables/datasets

Koality maps provider-specific "not found" errors (e.g., BigQuery Binder "Not found: Dataset ..." or DuckDB "does not exist") to a unified table_exists metric. When a check's query fails because the target table or dataset is missing, Koality records a table_exists failure for the affected table instead of a generic error to make missing-data diagnostics consistent across providers.

Available Checks

Check Type	Description
`NullRatioCheck`	Share of NULL values in a column
`RegexMatchCheck`	Share of values matching a regex pattern
`ValuesInSetCheck`	Share of values matching a predefined set
`RollingValuesInSetCheck`	Values in set over a rolling time window
`DuplicateCheck`	Number of duplicate values in a column
`CountCheck`	Row count or distinct value count
`AverageCheck`	Average of a column
`MaxCheck`	Maximum of a column
`MinCheck`	Minimum of a column
`MatchRateCheck`	Match rate between two tables after joining
`RelCountChangeCheck`	Relative count change vs. historical average
`IqrOutlierCheck`	Detect outliers using interquartile range
`OccurrenceCheck`	Check value occurrence frequency

Installation

pip install koality

Or add to your pyproject.toml:

[project]
dependencies = [
    "koality>=0.1.0",
]

Quick Start

1. Create a configuration file

# koality_config.yaml
name: My Data Quality Checks

# Database connection setup - executed before running checks
database_setup: |
  INSTALL bigquery;
  LOAD bigquery;
  ATTACH 'project=${PROJECT_ID}' AS bq (TYPE bigquery, READ_ONLY);

# Prefix for table references (use attached database name)
database_accessor: bq

defaults:
  result_table: bq.dqm.results
  log_path: dqm_failures.txt
  filters:
    partition_date:
      column: date
      value: yesterday
      type: date

check_bundles:
  - name: null_ratio_checks
    defaults:
      check_type: NullRatioCheck
      table: bq.dataset.orders
      lower_threshold: 0
      upper_threshold: 0.05
    checks:
      - check_column: customer_id
      - check_column: order_date
      - check_column: total_amount

For in-memory DuckDB (local testing or CSV/Parquet files):

database_setup: |
  CREATE TABLE orders AS SELECT * FROM 'data/orders.parquet';
  CREATE TABLE results (check_name VARCHAR, result DOUBLE, timestamp TIMESTAMP);
database_accessor: ""

2. Run checks via CLI

# Pass database setup variables via CLI
koality run --config_path koality_config.yaml -dsv PROJECT_ID=my-gcp-project

# Or via environment variable
DATABASE_SETUP_VARIABLES="PROJECT_ID=my-gcp-project" koality run --config_path koality_config.yaml

3. Review results

Results are persisted to your configured result table and failures are logged to the specified log path.

Configuration Hierarchy

Koality uses a hierarchical configuration system where more specific settings override general ones:

defaults: Base settings for all checks (result table, persistence, filters)
check_bundles.defaults: Bundle-level defaults (check type, table, thresholds)
checks: Individual check configurations (specific columns, custom thresholds)

Filter System

Apply dynamic filters to check specific data subsets using the structured filters syntax:

defaults:
  filters:
    partition_date:
      column: created_at
      value: yesterday
      type: date           # Required for rolling checks; auto-parses date values
    shop_id:
      column: shop_id
      value: SHOP01
      type: identifier     # Marks this as the identifier filter for result grouping
    revenue:
      column: total_revenue
      value: 1000
      operator: ">="       # Supports =, !=, >, >=, <, <=, IN, NOT IN, LIKE, NOT LIKE

Identifier filters and naming

Koality supports an identifier filter type which can be used to mark the field that identifies data partitions (e.g., shop, tenant). Use the global identifier_format setting in defaults to control how the identifier appears in result rows:

identifier (default): result column IDENTIFIER contains column=value (e.g., shop_code=EC0601).
filter_name: result column uses the filter name (e.g., SHOP_ID) and contains the value only.
column_name: result column uses the database column name (e.g., SHOP_CODE) and contains the value only.

If an identifier-type filter is defined without a concrete column or value (for example in global defaults), it is treated as a naming-only hint and will not be turned into a WHERE clause; this is useful when you only want to control the result identifier column name (e.g., SHOP_ID) across checks.

Behavior for missing identifier values

When an identifier-type filter is present but its value is missing or explicitly null, Koality substitutes a configurable placeholder for logging and naming (defaults.identifier_placeholder, default: ALL) to avoid None appearing in metric messages. You can override the placeholder at bundle or check level by setting identifier_placeholder in the corresponding defaults.

Additional docs: see docs/identifier_placeholder.md for usage examples and configuration details.

Filter Properties

Property	Description
`column`	Database column name to filter on (optional in defaults, required after merge)
`value`	Filter value (optional in defaults, required after merge)
`type`	`date`, `identifier`, or `other` (default). Only one of each type allowed
`operator`	SQL operator: `=`, `!=`, `>`, `>=`, `<`, `<=`, `IN`, `NOT IN`, `LIKE`
`parse_as_date`	If `true`, parse value as date (for non-date-type filters)

Date Parsing

Koality automatically parses date values when type: date is set:

Relative dates: today, yesterday, tomorrow
ISO dates: 2024-01-15, 20240115
With inline offset: yesterday-2 (2 days before yesterday), today+1 (tomorrow)

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests on GitHub.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
docs		docs
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE.md		LICENSE.md
README.md		README.md
codecov.yml		codecov.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zizmor.yml		zizmor.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation

Core Features

Supported Databases

Available Checks

Installation

Quick Start

1. Create a configuration file

2. Run checks via CLI

3. Review results

Configuration Hierarchy

Filter System

Filter Properties

Date Parsing

Contributing

License

About

Uh oh!

Releases 14

Uh oh!

Contributors 7

Uh oh!

Languages

License

ottogroup/koality

Folders and files

Latest commit

History

Repository files navigation

Documentation

Core Features

Supported Databases

Available Checks

Installation

Quick Start

1. Create a configuration file

2. Run checks via CLI

3. Review results

Configuration Hierarchy

Filter System

Filter Properties

Date Parsing

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 14

Uh oh!

Contributors 7

Uh oh!

Languages