Skip to content

finopsfoundation/focus_validator

Repository files navigation

FOCUS (FinOps Open Source Cost and Usage Specification) Validator

Validator resource for checking datasets against the FOCUS specification.

Overview

The FOCUS Validator is a comprehensive Python application designed to validate cloud cost and usage datasets against the FinOps Foundation's FOCUS specification. It provides a robust validation framework that can process large datasets, execute complex validation rules, and generate detailed reports about compliance with FOCUS standards.

Codebase Architecture

The FOCUS Validator follows a modular architecture with clear separation of concerns. The codebase is organized into several key components:

Core Architecture Overview

focus_validator/
β”œβ”€β”€ main.py              # CLI entry point and application orchestration
β”œβ”€β”€ validator.py         # Main validation orchestrator
β”œβ”€β”€ config/              # Configuration files (logging, etc.)
β”œβ”€β”€ config_objects/      # Core validation logic and rule definitions
β”œβ”€β”€ data_loaders/        # Data input handling (CSV, Parquet)
β”œβ”€β”€ outputter/          # Result output formatting (Console, XML, etc.)
β”œβ”€β”€ rules/              # Rule definitions and validation execution
└── utils/              # Utility functions (performance, currency codes)

1. Application Entry Point (main.py)

The main module serves as the CLI interface and application orchestrator:

  • Argument Parsing: Comprehensive command-line interface with support for various validation options
  • Logging Setup: Flexible logging configuration with YAML/INI support and multiple fallback strategies
  • Validation Orchestration: Coordinates the entire validation workflow from data loading to result output
  • Visualization Support: Optional generation of validation dependency graphs using Graphviz
  • Version Management: Handles both local and remote FOCUS specification versions

Key Features:

  • Support for applicability criteria filtering
  • Remote specification downloading from GitHub
  • Performance timing and logging
  • Cross-platform file opening for visualizations

2. Validation Orchestrator (validator.py)

The Validator class is the central coordinator that manages the validation process:

  • Data Loading: Delegates to appropriate data loaders based on file type
  • Rule Loading: Manages FOCUS specification rule loading and version handling
  • Validation Execution: Orchestrates rule execution against loaded datasets
  • Result Management: Coordinates output formatting and result persistence
  • Performance Monitoring: Integrated performance logging using decorators

Key Responsibilities:

  • Version compatibility checking (local vs remote specifications)
  • Applicability criteria processing and filtering
  • Data and rule loading coordination
  • Validation result aggregation and output

3. Config Objects (config_objects/)

This module contains the core validation logic and configuration management:

Rule Definition (rule.py)

  • ModelRule: Pydantic model for FOCUS specification rules
  • ValidationCriteria: Detailed validation requirements and conditions
  • CompositeCheck: Support for complex AND/OR logic in rule dependencies
  • Status Management: Rule lifecycle management (Active, Draft, etc.)

DuckDB Schema Conversion (focus_to_duckdb_converter.py)

The FocusToDuckDBSchemaConverter represents the most sophisticated component of the validation engine, implementing a comprehensive SQL code generation framework with advanced pattern matching, error recovery, and performance optimization:

Internal Architecture & Core Components:

Check Generator Registry System:

CHECK_GENERATORS: Dict[str, Dict[str, Any]] = {
    "ColumnPresent": {"generator": ColumnPresentCheckGenerator, "factory": lambda args: "ColumnName"},
    "TypeString": {"generator": TypeStringCheckGenerator, "factory": lambda args: "ColumnName"},
    "FormatDateTime": {"generator": FormatDateTimeGenerator, "factory": lambda args: "ColumnName"},
    "CheckModelRule": {"generator": CheckModelRuleGenerator, "factory": lambda args: "ModelRuleId"},
    "AND": {"generator": CompositeANDRuleGenerator, "factory": lambda args: "Items"},
    "OR": {"generator": CompositeORRuleGenerator, "factory": lambda args: "Items"}
    # ... 20+ total generators
}

Generator Architecture Patterns:

  • Abstract Base Class: DuckDBCheckGenerator defines the contract with REQUIRED_KEYS, DEFAULTS, and FREEZE_PARAMS class variables
  • Template Method Pattern: Subclasses implement generateSql() for SQL generation and getCheckType() for type identification
  • Factory Pattern: Registry system enables dynamic generator selection based on FOCUS rule function names
  • Parameter Validation: Automatic validation of required parameters using Pydantic-style checking with REQUIRED_KEYS
  • Immutable Parameters: FREEZE_PARAMS creates read-only parameter objects using MappingProxyType for thread safety

SQL Generation Framework:

Template System Architecture:

  • Dual Template Support: Handles both {table_name} and {{table_name}} patterns for backward compatibility
  • Parameter Substitution: Safe parameter injection using _lit() method with SQL injection prevention
  • Dynamic SQL Construction: Runtime SQL assembly based on rule requirements and data types
  • Query Optimization: Template caching and SQL query plan reuse for repeated validations

build_check() - Advanced Check Construction Pipeline:

Multi-Phase Construction Process:

# Detailed build_check() workflow
1. Rule Type Analysis β†’ Determine generator class from CHECK_GENERATORS registry
2. Applicability Assessment β†’ Evaluate rule inclusion via _should_include_rule()
3. Parameter Extraction β†’ Parse ValidationCriteria.requirement for generator parameters
4. Generator Instantiation β†’ Create generator with validated parameters and context
5. SQL Generation β†’ Execute generateSql() with template substitution
6. Check Object Assembly β†’ Wrap in DuckDBColumnCheck with metadata and execution context
7. Nested Check Handling β†’ Recursively process composite rule children
8. Performance Optimization β†’ Cache compiled checks for reuse

Technical Implementation Details:

Dynamic Rule Processing:

  • Rule Type Detection: Uses rule.is_dynamic() to identify rules requiring runtime data analysis
  • Skipped Check Generation: Creates SkippedDynamicCheck objects for rules that cannot be statically validated
  • Metadata Preservation: Maintains rule context and reasoning for debugging and reporting

Applicability Filtering Engine:

def _should_include_rule(self, rule, parent_edges=None) -> bool:
    # Hierarchical applicability checking
    1. Check rule's own ApplicabilityCriteria against validated_applicability_criteria
    2. Traverse parent_edges to validate entire dependency chain
    3. Apply AND logic - rule included only if all criteria in chain are satisfied
    4. Generate SkippedNonApplicableCheck for filtered rules

Composite Rule Architecture:

  • Recursive Descent Parsing: Processes ValidationCriteria.requirement.Items recursively for nested AND/OR structures
  • Child Check Generation: Each composite rule item becomes a separate DuckDBColumnCheck object
  • Logic Handler Assignment: Assigns operator.and_ or operator.or_ functions for result aggregation
  • Dependency Context Propagation: Passes parent results and edge context down to child generators

run_check() - High-Performance Execution Engine:

Multi-Level Execution Strategy:

# Comprehensive execution pipeline
1. Connection Validation β†’ Ensure DuckDB connection is active and prepared
2. Check Type Dispatch β†’ Route to appropriate execution path (skipped/leaf/composite)
3. SQL Template Resolution β†’ Substitute table names and parameters safely
4. Query Execution β†’ Execute with comprehensive error handling and recovery
5. Result Processing β†’ Parse DuckDB results and format for aggregation
6. Error Analysis β†’ Extract column names from error messages for intelligent reporting
7. Nested Aggregation β†’ Process composite rule children and apply logic operators
8. Performance Metrics β†’ Capture execution timing and resource usage

Advanced Error Recovery System:

Missing Column Detection:

def _extract_missing_columns(err_msg: str) -> List[str]:
    # Multi-pattern column extraction from DuckDB error messages
    patterns = [
        r'Column with name ([A-Za-z0-9_"]+) does not exist',
        r'Binder Error: .*? column ([A-Za-z0-9_"]+)',
        r'"([A-Za-z0-9_]+)" not found'
    ]
    # Returns sorted list of missing column names for precise error reporting

SQL Error Classification:

  • Syntax Errors: Template substitution failures and malformed SQL detection
  • Schema Mismatches: Column type conflicts and missing table detection
  • Data Validation Errors: Constraint violations and format validation failures
  • Performance Issues: Timeout detection and resource exhaustion handling

Composite Rule Execution Engine:

Upstream Dependency Short-Circuiting:

# Advanced dependency failure handling
if upstream_failure_detected:
    1. Mark all child checks as failed with upstream reason
    2. Preserve failure context and dependency chain information
    3. Skip actual SQL execution to avoid cascading errors
    4. Generate detailed failure report with root cause analysis

Result Aggregation Logic:

  • AND Logic: all(child_results) - requires all children to pass
  • OR Logic: any(child_results) - requires at least one child to pass
  • Failure Propagation: Maintains detailed failure context including which specific children failed
  • Performance Optimization: Early termination for AND (first failure) and OR (first success) operations

Advanced SQL Generation Patterns:

Type-Specific Generators:

ColumnPresentCheckGenerator:

WITH col_check AS (
    SELECT COUNT(*) AS found
    FROM information_schema.columns
    WHERE table_name = '{table_name}' AND column_name = '{column_name}'
)
SELECT 
    CASE WHEN found = 0 THEN 1 ELSE 0 END AS violations,
    CASE WHEN found = 0 THEN '{error_message}' END AS error_message
FROM col_check

TypeStringCheckGenerator:

SELECT 
    COUNT(*) AS violations,
    '{column_name} must be string type' AS error_message
FROM {table_name}
WHERE {column_name} IS NOT NULL 
    AND typeof({column_name}) != 'VARCHAR'

FormatDateTimeGenerator:

WITH datetime_violations AS (
    SELECT COUNT(*) AS violations
    FROM {table_name}
    WHERE {column_name} IS NOT NULL 
        AND TRY_STRPTIME({column_name}, '{expected_format}') IS NULL
)
SELECT violations, 
    CASE WHEN violations > 0 
        THEN '{column_name} format violations: ' || violations || ' rows'
    END AS error_message
FROM datetime_violations

Performance & Memory Optimization:

Connection Management:

  • Connection Pooling: Reuses DuckDB connections across multiple validations
  • Memory Monitoring: Tracks memory usage for large dataset processing
  • Query Plan Caching: DuckDB query plan reuse for repeated validation patterns
  • Parallel Execution: Thread-safe generator design for concurrent validation

Algorithmic Efficiency:

  • Lazy Evaluation: SQL queries only executed when results are needed
  • Batch Processing: Groups similar validations for bulk execution
  • Result Streaming: Processes large result sets without loading entire datasets into memory
  • Index Utilization: Generates SQL that leverages DuckDB's columnar indices

Extensibility Framework:

Custom Generator Development:

class CustomCheckGenerator(DuckDBCheckGenerator):
    REQUIRED_KEYS = {"ColumnName", "ThresholdValue"}
    DEFAULTS = {"TolerancePercent": 0.1}
    
    def generateSql(self) -> str:
        # Custom SQL generation logic
        return f"SELECT COUNT(*) FROM {{table_name}} WHERE {self.params.ColumnName} > {self.params.ThresholdValue}"
    
    def getCheckType(self) -> str:
        return "custom_threshold_check"

Registry Integration:

  • Dynamic Registration: New generators can be added to CHECK_GENERATORS at runtime
  • Parameter Validation: Automatic validation using REQUIRED_KEYS and type checking
  • Factory Function: Consistent parameter extraction across all generator types
  • Metadata Preservation: Full context and provenance tracking for custom validations

Rule Dependency Resolution (rule_dependency_resolver.py)

The RuleDependencyResolver is the most sophisticated component in the validation engine, implementing advanced graph algorithms to analyze complex rule interdependencies and optimize execution paths:

Internal Architecture & Data Structures:

The resolver maintains three core data structures for efficient dependency management:

  • dependency_graph: Dict[str, Set[str]] mapping rule_id β†’ {dependent_rule_ids} representing forward edges
  • reverse_graph: Dict[str, List[str]] mapping rule_id β†’ [rules_that_depend_on_this] for backward traversal
  • in_degree: Dict[str, int] tracking prerequisite counts for Kahn's algorithm implementation

buildDependencyGraph() - Advanced Dependency Analysis:

Algorithm Implementation:

# Pseudocode for dependency graph construction
1. Filter rules by target prefix (e.g., "BilledCost*")
2. Recursively collect transitive dependencies using BFS
3. Build bidirectional graph structures for O(1) lookups
4. Calculate in-degree counts for topological processing
5. Validate graph integrity and detect potential cycles

Technical Details:

  • Rule Filtering: Supports prefix-based filtering (e.g., target_rule_prefix="BilledCost") to process subsets of the full rule graph, essential for large FOCUS specifications with 200+ rules
  • Transitive Closure: Uses _collectAllDependencies() with deque-based BFS to recursively discover all child dependencies, ensuring composite rules include their nested components even when they don't match the prefix filter
  • Graph Construction: Builds forward and reverse adjacency lists simultaneously for O(1) dependency lookups during execution
  • Memory Optimization: Uses defaultdict(set) and defaultdict(list) to minimize memory allocation overhead
  • Cycle Prevention: Maintains processed sets to prevent infinite recursion during dependency discovery

Composite Rule Propagation:

  • Condition Inheritance: Implements _propagate_composite_conditions() to push parent composite rule conditions down to child rules via private attributes
  • CheckModelRule Processing: Analyzes ValidationCriteria.requirement.Items to identify referenced rules and propagate preconditions
  • Runtime Condition Evaluation: Child rules inherit parent conditions and evaluate them dynamically during execution

getTopologicalOrder() - Advanced Graph Algorithms:

Kahn's Algorithm Implementation:

# Enhanced Kahn's algorithm with cycle detection
1. Initialize in-degree counts from dependency graph
2. Queue all zero-degree nodes (no prerequisites)
3. Process nodes level by level, updating dependent in-degrees
4. Detect cycles when remaining nodes > 0 after processing
5. Handle circular dependencies by appending remaining nodes

Cycle Detection & Analysis:

  • Tarjan's SCC Algorithm: Implements _tarjan_scc() for strongly connected component detection with O(V+E) complexity
  • Cycle Visualization: _export_scc_dot() generates Graphviz DOT files for each strongly connected component, enabling visual debugging of complex cycles
  • Simple Cycle Reconstruction: _find_simple_cycle() uses DFS with path tracking to identify specific circular dependency chains
  • Detailed Cycle Logging: Provides comprehensive cycle analysis including component sizes, adjacency matrices, and example cycle paths

Advanced Debugging & Instrumentation:

Graph Analytics:

  • _log_graph_snapshot(): Captures comprehensive graph metrics (node count, edge count, zero-degree nodes) with sampling
  • _trace_node(): Implements bounded DFS to trace dependency chains up to configurable depth for debugging blocked nodes
  • _dump_blockers(): Analyzes remaining nodes after topological sort to identify specific blocking dependencies

Performance Monitoring:

  • Edge Counting: Tracks total dependency relationships for complexity analysis
  • Zero-Degree Analysis: Identifies entry points (rules with no dependencies) for parallel execution opportunities
  • Blocking Analysis: Detailed reporting of rules that cannot be scheduled due to unsatisfied prerequisites

build_plan_and_schedule() - Execution Plan Optimization:

PlanBuilder Integration:

# Advanced execution planning workflow
1. Create PlanBuilder with filtered relevant rules
2. Build parent-preserving forest from entry points
3. Compile to layered ValidationPlan for parallel execution
4. Apply runtime context and edge predicates
5. Generate deterministic scheduling with tie-breaking

Technical Implementation:

  • Forest Construction: Uses builder.build_forest(roots) to create parent-preserving execution trees that maintain rule relationships
  • Compilation Pipeline: Transforms abstract plan graph into concrete ValidationPlan with index-based node references for O(1) lookups
  • Edge Context Propagation: Maintains EdgeCtx objects throughout planning to preserve dependency reasoning and conditional activation
  • Execution Context: Supports runtime exec_ctx parameter for dynamic rule filtering and conditional execution paths

Memory & Performance Optimizations:

Data Structure Efficiency:

  • Deque-Based Processing: Uses collections.deque for BFS traversals to minimize memory reallocation
  • Set Operations: Leverages Python set operations for O(1) membership testing and efficient union/intersection operations
  • Processed Tracking: Maintains processed sets to prevent redundant work during recursive dependency discovery

Algorithmic Complexity:

  • Graph Construction: O(V + E) where V = rules, E = dependencies
  • Topological Sort: O(V + E) with Kahn's algorithm
  • SCC Detection: O(V + E) with Tarjan's algorithm
  • Memory Usage: O(V + E) for graph storage with minimal overhead

Applicability Criteria Integration:

Dynamic Rule Filtering:

  • Criteria Validation: Supports validated_applicability_criteria parameter for runtime rule inclusion/exclusion
  • Hierarchical Processing: Always includes rules in dependency graph but marks them for potential skipping during execution
  • SkippedNonApplicableCheck Generation: Defers actual filtering to execution phase via converter's applicability checking

Plan Building (plan_builder.py)

  • Execution Planning: Creates layered execution plans optimized for parallel processing
  • Edge Context Management: Tracks why dependencies exist with conditional activation predicates
  • Topological Scheduling: Implements Kahn's algorithm with deterministic tie-breaking for consistent execution order
  • Parent Preservation: Maintains parent-child relationships throughout the planning process for result aggregation

JSON Loading (json_loader.py)

  • Specification Parsing: Loads and parses FOCUS JSON rule definitions
  • Version Management: Handles multiple FOCUS specification versions
  • Remote Downloading: Supports fetching specifications from GitHub releases

4. Data Loaders (data_loaders/)

Extensible data loading framework supporting multiple file formats:

Base Data Loader (data_loader.py)

  • Format Detection: Automatic detection of file formats based on extension
  • Performance Monitoring: Integrated loading performance tracking
  • Error Handling: Comprehensive error handling and logging
  • Memory Management: Efficient handling of large datasets

Format-Specific Loaders

  • CSV Loader (csv_data_loader.py): Optimized CSV parsing with configurable options
  • Parquet Loader (parquet_data_loader.py): High-performance Parquet file processing

Key Features:

  • Automatic file type detection
  • Performance monitoring and logging
  • Memory-efficient processing for large datasets
  • Extensible architecture for additional formats

CSV Data Loading with Schema Inference

The CSV data loader uses Polars with infer_schema_length=10000 to analyze the first 10,000 rows for data type detection. This provides a good balance between accuracy and performance for most datasets.

Understanding Schema Inference:

The infer_schema_length parameter determines how many rows Polars examines to infer column data types:

  • Default: 10,000 rows - Suitable for most FOCUS datasets
  • Higher values - More accurate type detection but slower loading
  • Lower values - Faster loading but may miss type patterns

Handling Datasets with Many Null Values:

If your dataset has more than 10,000 rows of null values at the beginning, you may encounter data type inference issues. Here are several solutions:

Option 1: Specify Column Types Explicitly

When loading CSV data programmatically, provide explicit column types:

from focus_validator.data_loaders.csv_data_loader import CSVDataLoader
import polars as pl

# Define column types for problematic columns
column_types = {
    "BilledCost": pl.Float64(),
    "EffectiveCost": pl.Float64(),
    "ListCost": pl.Float64(),
    "BillingPeriodStart": pl.Datetime("us", "UTC"),
    "BillingPeriodEnd": pl.Datetime("us", "UTC"),
    "AccountId": pl.Utf8(),
    "ResourceId": pl.Utf8(),
    # Add other columns as needed
}

loader = CSVDataLoader("your_data.csv", column_types=column_types)
df = loader.load()
Option 2: Clean Your Data

Pre-process your CSV file to move non-null data towards the beginning:

# Sort CSV by a column that has early non-null values
head -1 your_data.csv > header.csv
tail -n +2 your_data.csv | sort -t',' -k1,1 > sorted_data.csv
cat header.csv sorted_data.csv > cleaned_data.csv
Option 3: Increase Schema Inference Length

For very large datasets with sparse data, you can modify the inference length by extending the CSV loader. Create a custom loader:

import polars as pl

# Load with increased inference length
df = pl.read_csv(
    "your_data.csv",
    infer_schema_length=50000,  # Analyze first 50,000 rows
    try_parse_dates=True,
    null_values=["", "NULL", "INVALID", "INVALID_COST", "BAD_DATE"]
)
Option 4: Two-Pass Loading

For extremely challenging datasets, use a two-pass approach:

import polars as pl

# First pass: scan the entire file to understand data patterns
schema_sample = pl.scan_csv("your_data.csv").sample(n=10000).collect()

# Analyze the sample to determine appropriate types
inferred_types = {}
for col in schema_sample.columns:
    non_null_values = schema_sample[col].drop_nulls()
    if len(non_null_values) > 0:
        # Determine appropriate type based on sample
        if non_null_values.dtype == pl.Utf8:
            # Try to infer if it should be numeric or datetime
            try:
                pl.Series(non_null_values.to_list()).cast(pl.Float64)
                inferred_types[col] = pl.Float64()
            except:
                try:
                    pl.Series(non_null_values.to_list()).str.strptime(pl.Datetime)
                    inferred_types[col] = pl.Datetime("us", "UTC")
                except:
                    inferred_types[col] = pl.Utf8()

# Second pass: load with determined schema
df = pl.read_csv("your_data.csv", schema_overrides=inferred_types)

Common Signs of Schema Inference Issues:

  • Numeric columns loaded as strings (Utf8) when they should be Float64
  • Date columns not parsed as datetime types
  • Validation failures for type-checking rules
  • Unexpected null value counts in validation output

Best Practices for Large Datasets:

  1. Validate Your Data First: Use tools like head, tail, and wc -l to understand your data structure
  2. Check for Empty Rows: Ensure the beginning of your file contains representative data
  3. Use Consistent Null Representations: Stick to standard null values like empty strings or "NULL"
  4. Test with Sample Data: Validate schema inference with a smaller sample before processing full datasets
  5. Monitor Loading Logs: The CSV loader provides detailed logging about type conversion attempts and failures

Performance Considerations:

  • Higher infer_schema_length values increase loading time but improve accuracy
  • For very large files (>1GB), consider processing in chunks or using Parquet format instead
  • The CSV loader includes automatic fallback mechanisms for problematic columns

5. Validation Rules Engine (rules/)

The core validation execution engine:

Specification Rules (spec_rules.py)

  • Rule Loading: Manages loading of FOCUS specification rules from JSON
  • Version Management: Handles multiple FOCUS versions and compatibility
  • Validation Execution: Orchestrates rule execution against datasets using DuckDB
  • Result Aggregation: Collects and organizes validation results
  • Remote Specification Support: Downloads and caches remote specifications

Validation Process:

  1. Load FOCUS specification from JSON files or remote sources
  2. Parse rules and build dependency graph
  3. Convert rules to executable DuckDB SQL queries
  4. Execute validation queries against dataset
  5. Aggregate results and generate comprehensive reports

6. Output Formatters (outputter/)

Flexible output system supporting multiple formats:

Base Outputter (outputter.py)

  • Format Selection: Factory pattern for output format selection
  • Result Processing: Standardized result processing and formatting

Format-Specific Outputters

  • Console Outputter (outputter_console.py): Human-readable terminal output
  • XML/JUnit Outputter (outputter_unittest.py): CI/CD compatible XML reports
  • Validation Graph Outputter (outputter_validation_graph.py): Graphviz visualizations

7. Utility Functions (utils/)

Supporting utilities for specialized functionality:

  • Performance Logging (performance_logging.py): Decorator-based performance monitoring
  • Currency Code Downloads (download_currency_codes.py): Dynamic currency validation support

Data Flow Architecture

The validation process follows this high-level data flow:

  1. Input Processing: CLI arguments parsed and configuration loaded
  2. Data Loading: Dataset loaded using appropriate data loader (CSV/Parquet)
  3. Rule Loading: FOCUS specification rules loaded and parsed from JSON
  4. Plan Building: Validation execution plan built considering rule dependencies
  5. SQL Generation: Rules converted to optimized DuckDB SQL queries
  6. Validation Execution: SQL queries executed against dataset using DuckDB engine
  7. Result Aggregation: Validation results collected and organized
  8. Output Generation: Results formatted and output using selected formatter
  9. Optional Visualization: Dependency graph and results visualized using Graphviz

Key Design Principles

  1. Modularity: Clear separation of concerns with pluggable components
  2. Performance: Optimized for large datasets using DuckDB and efficient algorithms
  3. Extensibility: Easy to add new data formats, output formats, and validation rules
  4. Reliability: Comprehensive error handling and logging throughout
  5. Standards Compliance: Full adherence to FOCUS specification requirements
  6. Developer Experience: Rich logging, performance monitoring, and debugging support

Technology Stack

  • Core Language: Python 3.9+ with type hints and modern language features
  • Data Processing: DuckDB for high-performance SQL-based validation
  • Data Formats: Pandas/Polars for CSV/Parquet processing
  • Configuration: Pydantic for type-safe configuration management
  • CLI: argparse for comprehensive command-line interface
  • Visualization: Graphviz for validation dependency graphs
  • Testing: pytest with comprehensive test coverage
  • Code Quality: Black, isort, flake8, mypy for code formatting and quality

Testing Architecture

The FOCUS Validator maintains high code quality through comprehensive testing:

Test Organization

  • Unit Tests: Component-level testing for individual modules
  • Integration Tests: End-to-end validation workflow testing
  • Performance Tests: Large dataset processing validation
  • Configuration Tests: Rule loading and version compatibility testing

Test Coverage Strategy

  • Data Loaders: 100% coverage ensuring reliable data ingestion
  • Config Objects: 97%+ coverage for core validation logic
  • Outputters: 85%+ coverage across all output formats
  • Overall Project: 70% coverage with targeted improvement areas

Quality Assurance Tools

  • pytest: Primary testing framework with fixture support
  • pytest-cov: Coverage reporting and analysis
  • Black: Automated code formatting
  • isort: Import organization and sorting
  • flake8: Code style and complexity checking
  • mypy: Static type checking and validation
  • pre-commit: Git hook integration for quality checks

Development Workflow

Local Development Setup

  1. Environment Management: Poetry-based dependency management
  2. Code Quality: Automated formatting and linting on commit
  3. Performance Monitoring: Built-in performance logging and profiling
  4. Testing: Makefile targets for module-specific and comprehensive testing

Continuous Integration

The project uses GitHub Actions for automated quality assurance:

  • Linting Pipeline: mypy, black, isort, flake8 validation
  • Test Execution: Full test suite with coverage reporting
  • Performance Validation: Memory and execution time monitoring
  • Multi-Platform Testing: Cross-platform compatibility verification

Code Organization Principles

  • Type Safety: Comprehensive type hints throughout codebase
  • Error Handling: Structured exception hierarchy with detailed logging
  • Performance Focus: Optimized for large dataset processing
  • Extensibility: Plugin architecture for new formats and outputs
  • Documentation: Inline documentation and comprehensive README

Environment Setup

Prerequisites

  • Python 3.9+
  • Poetry (Package & Dependency Manager)

Installation

1. Install Poetry

If you haven't installed Poetry yet, you can do it by running:

curl -sSL https://install.python-poetry.org | python3 -

For alternative installation methods or more information about Poetry, please refer to the official documentation.

2. Clone the repository

git clone https://github.com/finopsfoundation/focus-spec-validator.git
cd focus-spec-validator

3. Install dependencies

Using Poetry, you can install the project's dependencies with:

poetry install

Usage

Activate the virtual environment (See: poetry-plugin-shell):

poetry shell

Preferred Usage: Web Output

The recommended way to run FOCUS validations is using the web output type, which generates an interactive HTML report with advanced filtering, rule grouping, and detailed validation insights:

poetry run focus-validator --data-file tests/samples/focus_sample_10000.csv --validate-version 1.2 --applicability-criteria ALL --output-type web --output-destination '10000_sample_validate.html'

This command:

  • Validates a dataset against FOCUS version 1.2
  • Includes all applicable validation rules (--applicability-criteria ALL)
  • Generates an interactive web report saved as 10000_sample_validate.html
  • Provides filtering by rule status (PASS/FAILED/SKIPPED), entity types, and rule functions
  • Shows both entity-centric and rule-centric views of validation results

Additional Usage Options

For help and more options:

focus-validator --help

For console output or other formats:

# Console output (simple text format)
focus-validator --data-file your_data.csv --validate-version 1.2

# XML output for CI/CD integration
focus-validator --data-file your_data.csv --validate-version 1.2 --output-type unittest --output-destination results.xml

Explain Mode

The FOCUS Validator includes a powerful Explain Mode that allows you to inspect validation rules and their underlying SQL logic without executing actual validation or requiring input data. This feature is invaluable for understanding FOCUS specification requirements, debugging validation logic, and learning how rules are implemented.

What is Explain Mode?

Explain Mode generates comprehensive explanations for all validation rules in a FOCUS specification version, including:

  • Rule Metadata: Rule ID, type, check method, and generator information
  • MustSatisfy Requirements: The human-readable requirement that each rule validates
  • SQL Queries: Complete, formatted SQL queries that implement the validation logic
  • Rule Relationships: Hierarchy and dependencies between composite and child rules
  • Condition Logic: Row-level conditions and filters applied during validation

How to Use Explain Mode

Basic Usage

# Explain all rules for FOCUS 1.2 with CostAndUsage dataset
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode

With Applicability Criteria

# Show all rules including those with specific applicability requirements
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --applicability-criteria ALL

# Show rules for specific applicability criteria
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --applicability-criteria "AVAILABILITY_ZONE_SUPPORTED,MULTIPLE_SUB_ACCOUNT_TYPES_SUPPORTED"

View Available Applicability Criteria

# List all applicability criteria for a FOCUS version
focus-validator --validate-version 1.2 --show-applicability-criteria

Explain Mode Output Format

The output is organized alphabetically by rule ID and includes comprehensive information for each rule:

Example Output

=== SQL Explanations for 578 rules ===

πŸ“‹ AvailabilityZone-C-001-M
   Type: leaf
   Check: type_string
   Generator: TypeStringCheckGenerator
   MustSatisfy: AvailabilityZone MUST be of type String.
   SQL:
     WITH invalid AS (
       SELECT 1
       FROM focus_data
       WHERE AvailabilityZone IS NOT NULL AND typeof(AvailabilityZone) != 'VARCHAR'
     )
     SELECT COUNT(*) AS violations, 
            CASE WHEN COUNT(*) > 0 
                 THEN 'AvailabilityZone MUST be of type VARCHAR (string).' 
            END AS error_message
     FROM invalid

πŸ“‹ BilledCost-C-000-M
   Type: composite
   Check: composite
   Generator: CompositeANDRuleGenerator
   MustSatisfy: The BilledCost column adheres to the following requirements:
   Children: 7 rules
     - BilledCost-C-000-M -> BilledCost-C-001-M (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-002-M (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-003-M (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-004-M (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-005-C (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-006-M (reference, model_rule_reference)
     - BilledCost-C-000-M -> BilledCost-C-007-M (reference, model_rule_reference)

πŸ“‹ RegionId-C-000-C
   Type: skipped
   Check: None
   Generator: None due to non-applicable rule
   MustSatisfy: RegionId is RECOMMENDED to be present in a FOCUS dataset when the provider supports deploying resources or services within regions.

Output Components Explained

πŸ“‹ Rule Header: Rule ID in alphabetical order

Type:

  • leaf - Individual validation rule with SQL query
  • composite - Rule that combines multiple child rules (AND/OR logic)
  • reference - Rule that references another rule's outcome
  • skipped - Rule that cannot be executed (dynamic or non-applicable)

Check: The validation method used:

  • type_string - Validates column data type is string
  • column_presence - Validates column exists in dataset
  • format_datetime - Validates datetime format compliance
  • composite - Combines multiple child rule results
  • model_rule_reference - References another rule's result

Generator: The code generator that creates the validation logic:

  • TypeStringCheckGenerator - Generates SQL for type validation
  • ColumnPresentCheckGenerator - Generates SQL for column existence
  • CompositeANDRuleGenerator - Combines child rules with AND logic
  • CompositeORRuleGenerator - Combines child rules with OR logic
  • None due to dynamic rule - Rule requires runtime data analysis
  • None due to non-applicable rule - Rule doesn't apply to current criteria

MustSatisfy: Human-readable description of what the rule validates, directly from the FOCUS specification

SQL: Complete, formatted SQL query that implements the validation (for leaf rules only)

Children: For composite rules, shows all child rules with their types and references

Condition: Row-level conditions applied during validation (when present)

Rule Types and Categories

Leaf Rules

Individual validation rules that execute SQL queries against the dataset. These represent the core validation logic.

Examples:

  • Column presence validation
  • Data type checking
  • Format validation (dates, currencies, etc.)
  • Value constraint checking
  • Pattern matching validation

Composite Rules

Rules that combine multiple child rules using logical operators:

  • AND Rules (CompositeANDRuleGenerator): All child rules must pass
  • OR Rules (CompositeORRuleGenerator): At least one child rule must pass

Reference Rules

Rules that mirror the outcome of another rule without executing additional SQL.

Skipped Rules

Rules that cannot be executed for various reasons:

  • Dynamic Rules: Require runtime data analysis to determine validation logic
  • Non-Applicable Rules: Don't apply based on current applicability criteria
  • Missing Dependencies: Reference unavailable components or data

Understanding Applicability Criteria

Some FOCUS rules only apply under specific conditions (e.g., when a provider supports availability zones). The --applicability-criteria option controls which rules are included:

  • Default (none specified): Shows only universally applicable rules
  • ALL: Shows all rules including those with specific requirements
  • Specific criteria: Shows rules for particular provider capabilities

Use --show-applicability-criteria to see available criteria for a FOCUS version.

SQL Query Analysis

The SQL queries in explain mode show exactly how each validation is implemented:

Common SQL Patterns

Column Presence Check:

WITH col_check AS (
    SELECT COUNT(*) AS found
    FROM information_schema.columns
    WHERE table_name = 'focus_data' AND column_name = 'ColumnName'
)
SELECT CASE WHEN found = 0 THEN 1 ELSE 0 END AS violations,
       CASE WHEN found = 0 THEN 'Column "ColumnName" MUST be present in the table.' END AS error_message
FROM col_check

Type Validation:

SELECT COUNT(*) AS violations,
       'ColumnName MUST be of type VARCHAR (string).' AS error_message
FROM focus_data
WHERE ColumnName IS NOT NULL AND typeof(ColumnName) != 'VARCHAR'

Format Validation:

WITH invalid AS (
    SELECT ColumnName::TEXT AS value
    FROM focus_data
    WHERE ColumnName IS NOT NULL 
      AND NOT (ColumnName::TEXT ~ '^[pattern]$')
)
SELECT COUNT(*) AS violations,
       CASE WHEN COUNT(*) > 0 THEN 'Format validation message' END AS error_message
FROM invalid

Use Cases for Explain Mode

1. Learning FOCUS Specification

  • Understand what each rule validates
  • See the relationship between rules
  • Learn validation requirements for each column

2. Debugging Validation Issues

  • Inspect SQL logic for failing rules
  • Understand why certain rules are skipped
  • Analyze composite rule dependencies

3. Implementation Reference

  • Use SQL patterns for custom validation tools
  • Understand FOCUS rule implementation details
  • Reference validation logic for documentation

4. Rule Development

  • Validate new rule implementations
  • Test rule logic without full datasets
  • Debug rule generation and SQL creation

SQL Transpilation for Database Migration

The FOCUS Validator includes a powerful SQL Transpilation feature that allows you to see how DuckDB validation queries would appear in different SQL dialects. This is invaluable for database migration planning, cross-platform compatibility analysis, and understanding how validation logic translates across different database systems.

Using the --transpile Option

The --transpile option can only be used with --explain-mode and allows you to specify a target SQL dialect:

# Transpile validation queries to PostgreSQL
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile postgres

# Transpile to MySQL
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile mysql

# Transpile to BigQuery
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile bigquery

# Transpile to Snowflake
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile snowflake

Supported SQL Dialects

The transpiler supports 18+ SQL dialects including:

  • PostgreSQL: postgres - Enterprise-grade RDBMS
  • MySQL: mysql - Popular open-source database
  • BigQuery: bigquery - Google Cloud data warehouse
  • Snowflake: snowflake - Cloud data platform
  • Microsoft SQL Server: tsql - Enterprise database system
  • Oracle: oracle - Enterprise database platform
  • SQLite: sqlite - Lightweight embedded database
  • Amazon Redshift: redshift - AWS data warehouse
  • Databricks: databricks - Unified analytics platform
  • Apache Spark: spark - Distributed computing engine
  • ClickHouse: clickhouse - Columnar database
  • Teradata: teradata - Enterprise data warehouse
  • Hive: hive - Hadoop data warehouse
  • Presto/Trino: presto - Distributed SQL engine
  • Apache Drill: drill - Schema-free SQL engine
  • StarRocks: starrocks - MPP database
  • DuckDB: duckdb - Analytical database (original format)
  • ANSI SQL: ansi - Standard SQL compliance

Transpilation Output

When using --transpile, the SQL queries in explain mode are automatically converted to the target dialect:

=== SQL Explanations for 578 rules (transpiled to postgres) ===

πŸ“‹ AvailabilityZone-C-001-M
   Type: leaf
   Check: type_string
   Generator: TypeStringCheckGenerator
   MustSatisfy: AvailabilityZone MUST be of type String.
   SQL (PostgreSQL):
     WITH invalid AS (
       SELECT 1
       FROM focus_data
       WHERE "AvailabilityZone" IS NOT NULL AND pg_typeof("AvailabilityZone") != 'text'
     )
     SELECT COUNT(*) AS violations, 
            CASE WHEN COUNT(*) > 0 
                 THEN 'AvailabilityZone MUST be of type VARCHAR (string).' 
            END AS error_message
     FROM invalid

Key Transpilation Features

Dialect-Specific Syntax: Automatically converts DuckDB syntax to target dialect conventions:

  • Function Names: typeof() β†’ pg_typeof() (PostgreSQL), type() (BigQuery)
  • Type Names: VARCHAR β†’ TEXT (PostgreSQL), STRING (BigQuery)
  • Identifiers: Automatic quoting and escaping based on dialect requirements
  • Date/Time Functions: Converts temporal functions to dialect-specific equivalents
  • String Operations: Adapts regex and string manipulation functions

Error Handling: Robust fallback mechanism for complex queries:

  • Graceful Degradation: Shows original DuckDB SQL if transpilation fails
  • Error Logging: Detailed error messages for debugging transpilation issues
  • Partial Success: Successfully transpiled queries are shown even if others fail

Performance Optimization: Efficient transpilation without data processing:

  • Parse-Only Mode: Analyzes SQL structure without executing queries
  • Cached Results: Reuses transpilation results for similar query patterns
  • Fast Execution: No impact on explain mode performance

Use Cases for SQL Transpilation

Database Migration Planning

Understand how FOCUS validation logic would need to be adapted when migrating from DuckDB to other database systems:

# See how validations would work in PostgreSQL environment
focus-validator --explain-mode --transpile postgres --filter-rules "BilledCost*"

# Compare BigQuery compatibility for cloud migration
focus-validator --explain-mode --transpile bigquery --filter-rules "EffectiveCost*"
Cross-Platform Compatibility Analysis

Validate that your FOCUS implementation can work across multiple database platforms:

# Check MySQL compatibility for web application backends
focus-validator --explain-mode --transpile mysql

# Verify Snowflake compatibility for data warehouse implementations
focus-validator --explain-mode --transpile snowflake
Multi-Database Validation Implementation

Use transpiled SQL as a reference for implementing FOCUS validation in different database environments:

# Generate SQL for Oracle-based validation systems
focus-validator --explain-mode --transpile oracle > oracle_validation_queries.sql

# Create Spark-compatible validation logic
focus-validator --explain-mode --transpile spark > spark_validation_logic.sql
Educational and Learning Purposes

Learn how SQL patterns differ across database systems:

# Compare how type checking differs between databases
focus-validator --explain-mode --transpile postgres --filter-rules "*-C-001-M" | grep "typeof"
focus-validator --explain-mode --transpile mysql --filter-rules "*-C-001-M" | grep "typeof"

Error Handling and Limitations

Unsupported Dialect Handling: If an invalid dialect is specified, the tool provides helpful feedback:

focus-validator --explain-mode --transpile invalid_dialect
# Error: Unsupported dialect 'invalid_dialect'. Available dialects: postgres, mysql, bigquery, snowflake, ...

Complex Query Limitations: Some advanced DuckDB features may not have direct equivalents in all dialects:

  • Complex window functions
  • Advanced temporal operations
  • Specialized analytical functions
  • Custom DuckDB extensions

When transpilation isn't possible, the original DuckDB SQL is shown with a note about the limitation.

Fallback Behavior: The transpiler prioritizes showing useful information:

  1. Successful Transpilation: Shows converted SQL in target dialect
  2. Partial Failure: Shows original SQL with transpilation error note
  3. Complete Failure: Shows original DuckDB SQL without modification

Combining with Other Explain Mode Features

The --transpile option works seamlessly with all other explain mode features:

# Transpile with applicability criteria
focus-validator --explain-mode --transpile postgres --applicability-criteria ALL

# Transpile specific rule groups
focus-validator --explain-mode --transpile bigquery --filter-rules "BilledCost*,EffectiveCost*"

# Transpile and save to file for analysis
focus-validator --explain-mode --transpile snowflake > snowflake_validation_queries.sql

Performance and Output Management

Explain mode is designed for fast execution since it doesn't process actual data:

  • Fast Execution: No data loading or SQL execution against datasets
  • Complete Coverage: Analyzes all rules in the specification (500+ rules for FOCUS 1.2)
  • Alphabetical Ordering: Predictable rule ordering for easy navigation
  • Detailed Output: Comprehensive information for each rule
  • Transpilation Speed: SQL conversion adds minimal overhead to explain mode execution

The output can be extensive (500+ rules), so consider using shell tools for navigation:

# Search for specific rules with transpilation
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile postgres | grep "BilledCost"

# Page through transpiled output
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile mysql | less

# Save transpiled queries to file for analysis
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile bigquery > bigquery_validation_queries.sql

Running Tests

Basic Testing

Run all tests:

poetry run pytest

Run tests with verbose output:

poetry run pytest -v

Test Coverage

Generate coverage report:

# Terminal coverage report
make coverage

# HTML coverage report (opens in browser)
make coverage-html

# Comprehensive coverage report (HTML + XML + terminal)
make coverage-report

Module-Specific Testing

Run tests for specific modules:

# Data loaders only
make test-data-loaders

# Config objects only  
make test-config-objects

# Output formatters only
make test-outputter

Coverage by Module

# Coverage for data loaders (100% coverage)
make coverage-data-loaders

# Coverage for config objects (97%+ coverage)
make coverage-config-objects

# Coverage for outputters (85%+ coverage)
make coverage-outputter

Current Test Coverage: 70% overall (257 tests passing)

  • Data Loaders: 100% coverage (50 tests)
  • Config Objects: 97%+ coverage (120 tests)
  • Outputters: 85%+ coverage (80 tests)
  • Core Components: 89%+ coverage (7 tests)

See COVERAGE_REPORT.md for detailed coverage analysis and improvement recommendations.

If running on legacy CPUs and the tests crash on the polars library, run the following locally only:

poetry add polars-lts-cpu

This will align the polars execution with your system hardware. It should NOT be committed back into the repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Validator resource for checking datasets against the FOCUS specification

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages