Validator resource for checking datasets against the FOCUS specification.
The FOCUS Validator is a comprehensive Python application designed to validate cloud cost and usage datasets against the FinOps Foundation's FOCUS specification. It provides a robust validation framework that can process large datasets, execute complex validation rules, and generate detailed reports about compliance with FOCUS standards.
The FOCUS Validator follows a modular architecture with clear separation of concerns. The codebase is organized into several key components:
focus_validator/
βββ main.py # CLI entry point and application orchestration
βββ validator.py # Main validation orchestrator
βββ config/ # Configuration files (logging, etc.)
βββ config_objects/ # Core validation logic and rule definitions
βββ data_loaders/ # Data input handling (CSV, Parquet)
βββ outputter/ # Result output formatting (Console, XML, etc.)
βββ rules/ # Rule definitions and validation execution
βββ utils/ # Utility functions (performance, currency codes)
The main module serves as the CLI interface and application orchestrator:
- Argument Parsing: Comprehensive command-line interface with support for various validation options
- Logging Setup: Flexible logging configuration with YAML/INI support and multiple fallback strategies
- Validation Orchestration: Coordinates the entire validation workflow from data loading to result output
- Visualization Support: Optional generation of validation dependency graphs using Graphviz
- Version Management: Handles both local and remote FOCUS specification versions
Key Features:
- Support for applicability criteria filtering
- Remote specification downloading from GitHub
- Performance timing and logging
- Cross-platform file opening for visualizations
The Validator class is the central coordinator that manages the validation process:
- Data Loading: Delegates to appropriate data loaders based on file type
- Rule Loading: Manages FOCUS specification rule loading and version handling
- Validation Execution: Orchestrates rule execution against loaded datasets
- Result Management: Coordinates output formatting and result persistence
- Performance Monitoring: Integrated performance logging using decorators
Key Responsibilities:
- Version compatibility checking (local vs remote specifications)
- Applicability criteria processing and filtering
- Data and rule loading coordination
- Validation result aggregation and output
This module contains the core validation logic and configuration management:
- ModelRule: Pydantic model for FOCUS specification rules
- ValidationCriteria: Detailed validation requirements and conditions
- CompositeCheck: Support for complex AND/OR logic in rule dependencies
- Status Management: Rule lifecycle management (Active, Draft, etc.)
The FocusToDuckDBSchemaConverter represents the most sophisticated component of the validation engine, implementing a comprehensive SQL code generation framework with advanced pattern matching, error recovery, and performance optimization:
Internal Architecture & Core Components:
Check Generator Registry System:
CHECK_GENERATORS: Dict[str, Dict[str, Any]] = {
"ColumnPresent": {"generator": ColumnPresentCheckGenerator, "factory": lambda args: "ColumnName"},
"TypeString": {"generator": TypeStringCheckGenerator, "factory": lambda args: "ColumnName"},
"FormatDateTime": {"generator": FormatDateTimeGenerator, "factory": lambda args: "ColumnName"},
"CheckModelRule": {"generator": CheckModelRuleGenerator, "factory": lambda args: "ModelRuleId"},
"AND": {"generator": CompositeANDRuleGenerator, "factory": lambda args: "Items"},
"OR": {"generator": CompositeORRuleGenerator, "factory": lambda args: "Items"}
# ... 20+ total generators
}Generator Architecture Patterns:
- Abstract Base Class:
DuckDBCheckGeneratordefines the contract withREQUIRED_KEYS,DEFAULTS, andFREEZE_PARAMSclass variables - Template Method Pattern: Subclasses implement
generateSql()for SQL generation andgetCheckType()for type identification - Factory Pattern: Registry system enables dynamic generator selection based on FOCUS rule function names
- Parameter Validation: Automatic validation of required parameters using Pydantic-style checking with
REQUIRED_KEYS - Immutable Parameters:
FREEZE_PARAMScreates read-only parameter objects usingMappingProxyTypefor thread safety
SQL Generation Framework:
Template System Architecture:
- Dual Template Support: Handles both
{table_name}and{{table_name}}patterns for backward compatibility - Parameter Substitution: Safe parameter injection using
_lit()method with SQL injection prevention - Dynamic SQL Construction: Runtime SQL assembly based on rule requirements and data types
- Query Optimization: Template caching and SQL query plan reuse for repeated validations
build_check() - Advanced Check Construction Pipeline:
Multi-Phase Construction Process:
# Detailed build_check() workflow
1. Rule Type Analysis β Determine generator class from CHECK_GENERATORS registry
2. Applicability Assessment β Evaluate rule inclusion via _should_include_rule()
3. Parameter Extraction β Parse ValidationCriteria.requirement for generator parameters
4. Generator Instantiation β Create generator with validated parameters and context
5. SQL Generation β Execute generateSql() with template substitution
6. Check Object Assembly β Wrap in DuckDBColumnCheck with metadata and execution context
7. Nested Check Handling β Recursively process composite rule children
8. Performance Optimization β Cache compiled checks for reuseTechnical Implementation Details:
Dynamic Rule Processing:
- Rule Type Detection: Uses
rule.is_dynamic()to identify rules requiring runtime data analysis - Skipped Check Generation: Creates
SkippedDynamicCheckobjects for rules that cannot be statically validated - Metadata Preservation: Maintains rule context and reasoning for debugging and reporting
Applicability Filtering Engine:
def _should_include_rule(self, rule, parent_edges=None) -> bool:
# Hierarchical applicability checking
1. Check rule's own ApplicabilityCriteria against validated_applicability_criteria
2. Traverse parent_edges to validate entire dependency chain
3. Apply AND logic - rule included only if all criteria in chain are satisfied
4. Generate SkippedNonApplicableCheck for filtered rulesComposite Rule Architecture:
- Recursive Descent Parsing: Processes ValidationCriteria.requirement.Items recursively for nested AND/OR structures
- Child Check Generation: Each composite rule item becomes a separate DuckDBColumnCheck object
- Logic Handler Assignment: Assigns
operator.and_oroperator.or_functions for result aggregation - Dependency Context Propagation: Passes parent results and edge context down to child generators
run_check() - High-Performance Execution Engine:
Multi-Level Execution Strategy:
# Comprehensive execution pipeline
1. Connection Validation β Ensure DuckDB connection is active and prepared
2. Check Type Dispatch β Route to appropriate execution path (skipped/leaf/composite)
3. SQL Template Resolution β Substitute table names and parameters safely
4. Query Execution β Execute with comprehensive error handling and recovery
5. Result Processing β Parse DuckDB results and format for aggregation
6. Error Analysis β Extract column names from error messages for intelligent reporting
7. Nested Aggregation β Process composite rule children and apply logic operators
8. Performance Metrics β Capture execution timing and resource usageAdvanced Error Recovery System:
Missing Column Detection:
def _extract_missing_columns(err_msg: str) -> List[str]:
# Multi-pattern column extraction from DuckDB error messages
patterns = [
r'Column with name ([A-Za-z0-9_"]+) does not exist',
r'Binder Error: .*? column ([A-Za-z0-9_"]+)',
r'"([A-Za-z0-9_]+)" not found'
]
# Returns sorted list of missing column names for precise error reportingSQL Error Classification:
- Syntax Errors: Template substitution failures and malformed SQL detection
- Schema Mismatches: Column type conflicts and missing table detection
- Data Validation Errors: Constraint violations and format validation failures
- Performance Issues: Timeout detection and resource exhaustion handling
Composite Rule Execution Engine:
Upstream Dependency Short-Circuiting:
# Advanced dependency failure handling
if upstream_failure_detected:
1. Mark all child checks as failed with upstream reason
2. Preserve failure context and dependency chain information
3. Skip actual SQL execution to avoid cascading errors
4. Generate detailed failure report with root cause analysisResult Aggregation Logic:
- AND Logic:
all(child_results)- requires all children to pass - OR Logic:
any(child_results)- requires at least one child to pass - Failure Propagation: Maintains detailed failure context including which specific children failed
- Performance Optimization: Early termination for AND (first failure) and OR (first success) operations
Advanced SQL Generation Patterns:
Type-Specific Generators:
ColumnPresentCheckGenerator:
WITH col_check AS (
SELECT COUNT(*) AS found
FROM information_schema.columns
WHERE table_name = '{table_name}' AND column_name = '{column_name}'
)
SELECT
CASE WHEN found = 0 THEN 1 ELSE 0 END AS violations,
CASE WHEN found = 0 THEN '{error_message}' END AS error_message
FROM col_checkTypeStringCheckGenerator:
SELECT
COUNT(*) AS violations,
'{column_name} must be string type' AS error_message
FROM {table_name}
WHERE {column_name} IS NOT NULL
AND typeof({column_name}) != 'VARCHAR'FormatDateTimeGenerator:
WITH datetime_violations AS (
SELECT COUNT(*) AS violations
FROM {table_name}
WHERE {column_name} IS NOT NULL
AND TRY_STRPTIME({column_name}, '{expected_format}') IS NULL
)
SELECT violations,
CASE WHEN violations > 0
THEN '{column_name} format violations: ' || violations || ' rows'
END AS error_message
FROM datetime_violationsPerformance & Memory Optimization:
Connection Management:
- Connection Pooling: Reuses DuckDB connections across multiple validations
- Memory Monitoring: Tracks memory usage for large dataset processing
- Query Plan Caching: DuckDB query plan reuse for repeated validation patterns
- Parallel Execution: Thread-safe generator design for concurrent validation
Algorithmic Efficiency:
- Lazy Evaluation: SQL queries only executed when results are needed
- Batch Processing: Groups similar validations for bulk execution
- Result Streaming: Processes large result sets without loading entire datasets into memory
- Index Utilization: Generates SQL that leverages DuckDB's columnar indices
Extensibility Framework:
Custom Generator Development:
class CustomCheckGenerator(DuckDBCheckGenerator):
REQUIRED_KEYS = {"ColumnName", "ThresholdValue"}
DEFAULTS = {"TolerancePercent": 0.1}
def generateSql(self) -> str:
# Custom SQL generation logic
return f"SELECT COUNT(*) FROM {{table_name}} WHERE {self.params.ColumnName} > {self.params.ThresholdValue}"
def getCheckType(self) -> str:
return "custom_threshold_check"Registry Integration:
- Dynamic Registration: New generators can be added to CHECK_GENERATORS at runtime
- Parameter Validation: Automatic validation using REQUIRED_KEYS and type checking
- Factory Function: Consistent parameter extraction across all generator types
- Metadata Preservation: Full context and provenance tracking for custom validations
The RuleDependencyResolver is the most sophisticated component in the validation engine, implementing advanced graph algorithms to analyze complex rule interdependencies and optimize execution paths:
Internal Architecture & Data Structures:
The resolver maintains three core data structures for efficient dependency management:
dependency_graph:Dict[str, Set[str]]mapping rule_id β {dependent_rule_ids} representing forward edgesreverse_graph:Dict[str, List[str]]mapping rule_id β [rules_that_depend_on_this] for backward traversalin_degree:Dict[str, int]tracking prerequisite counts for Kahn's algorithm implementation
buildDependencyGraph() - Advanced Dependency Analysis:
Algorithm Implementation:
# Pseudocode for dependency graph construction
1. Filter rules by target prefix (e.g., "BilledCost*")
2. Recursively collect transitive dependencies using BFS
3. Build bidirectional graph structures for O(1) lookups
4. Calculate in-degree counts for topological processing
5. Validate graph integrity and detect potential cyclesTechnical Details:
- Rule Filtering: Supports prefix-based filtering (e.g.,
target_rule_prefix="BilledCost") to process subsets of the full rule graph, essential for large FOCUS specifications with 200+ rules - Transitive Closure: Uses
_collectAllDependencies()with deque-based BFS to recursively discover all child dependencies, ensuring composite rules include their nested components even when they don't match the prefix filter - Graph Construction: Builds forward and reverse adjacency lists simultaneously for O(1) dependency lookups during execution
- Memory Optimization: Uses
defaultdict(set)anddefaultdict(list)to minimize memory allocation overhead - Cycle Prevention: Maintains processed sets to prevent infinite recursion during dependency discovery
Composite Rule Propagation:
- Condition Inheritance: Implements
_propagate_composite_conditions()to push parent composite rule conditions down to child rules via private attributes - CheckModelRule Processing: Analyzes ValidationCriteria.requirement.Items to identify referenced rules and propagate preconditions
- Runtime Condition Evaluation: Child rules inherit parent conditions and evaluate them dynamically during execution
getTopologicalOrder() - Advanced Graph Algorithms:
Kahn's Algorithm Implementation:
# Enhanced Kahn's algorithm with cycle detection
1. Initialize in-degree counts from dependency graph
2. Queue all zero-degree nodes (no prerequisites)
3. Process nodes level by level, updating dependent in-degrees
4. Detect cycles when remaining nodes > 0 after processing
5. Handle circular dependencies by appending remaining nodesCycle Detection & Analysis:
- Tarjan's SCC Algorithm: Implements
_tarjan_scc()for strongly connected component detection with O(V+E) complexity - Cycle Visualization:
_export_scc_dot()generates Graphviz DOT files for each strongly connected component, enabling visual debugging of complex cycles - Simple Cycle Reconstruction:
_find_simple_cycle()uses DFS with path tracking to identify specific circular dependency chains - Detailed Cycle Logging: Provides comprehensive cycle analysis including component sizes, adjacency matrices, and example cycle paths
Advanced Debugging & Instrumentation:
Graph Analytics:
_log_graph_snapshot(): Captures comprehensive graph metrics (node count, edge count, zero-degree nodes) with sampling_trace_node(): Implements bounded DFS to trace dependency chains up to configurable depth for debugging blocked nodes_dump_blockers(): Analyzes remaining nodes after topological sort to identify specific blocking dependencies
Performance Monitoring:
- Edge Counting: Tracks total dependency relationships for complexity analysis
- Zero-Degree Analysis: Identifies entry points (rules with no dependencies) for parallel execution opportunities
- Blocking Analysis: Detailed reporting of rules that cannot be scheduled due to unsatisfied prerequisites
build_plan_and_schedule() - Execution Plan Optimization:
PlanBuilder Integration:
# Advanced execution planning workflow
1. Create PlanBuilder with filtered relevant rules
2. Build parent-preserving forest from entry points
3. Compile to layered ValidationPlan for parallel execution
4. Apply runtime context and edge predicates
5. Generate deterministic scheduling with tie-breakingTechnical Implementation:
- Forest Construction: Uses
builder.build_forest(roots)to create parent-preserving execution trees that maintain rule relationships - Compilation Pipeline: Transforms abstract plan graph into concrete ValidationPlan with index-based node references for O(1) lookups
- Edge Context Propagation: Maintains EdgeCtx objects throughout planning to preserve dependency reasoning and conditional activation
- Execution Context: Supports runtime
exec_ctxparameter for dynamic rule filtering and conditional execution paths
Memory & Performance Optimizations:
Data Structure Efficiency:
- Deque-Based Processing: Uses
collections.dequefor BFS traversals to minimize memory reallocation - Set Operations: Leverages Python set operations for O(1) membership testing and efficient union/intersection operations
- Processed Tracking: Maintains processed sets to prevent redundant work during recursive dependency discovery
Algorithmic Complexity:
- Graph Construction: O(V + E) where V = rules, E = dependencies
- Topological Sort: O(V + E) with Kahn's algorithm
- SCC Detection: O(V + E) with Tarjan's algorithm
- Memory Usage: O(V + E) for graph storage with minimal overhead
Applicability Criteria Integration:
Dynamic Rule Filtering:
- Criteria Validation: Supports
validated_applicability_criteriaparameter for runtime rule inclusion/exclusion - Hierarchical Processing: Always includes rules in dependency graph but marks them for potential skipping during execution
- SkippedNonApplicableCheck Generation: Defers actual filtering to execution phase via converter's applicability checking
- Execution Planning: Creates layered execution plans optimized for parallel processing
- Edge Context Management: Tracks why dependencies exist with conditional activation predicates
- Topological Scheduling: Implements Kahn's algorithm with deterministic tie-breaking for consistent execution order
- Parent Preservation: Maintains parent-child relationships throughout the planning process for result aggregation
- Specification Parsing: Loads and parses FOCUS JSON rule definitions
- Version Management: Handles multiple FOCUS specification versions
- Remote Downloading: Supports fetching specifications from GitHub releases
Extensible data loading framework supporting multiple file formats:
- Format Detection: Automatic detection of file formats based on extension
- Performance Monitoring: Integrated loading performance tracking
- Error Handling: Comprehensive error handling and logging
- Memory Management: Efficient handling of large datasets
- CSV Loader (
csv_data_loader.py): Optimized CSV parsing with configurable options - Parquet Loader (
parquet_data_loader.py): High-performance Parquet file processing
Key Features:
- Automatic file type detection
- Performance monitoring and logging
- Memory-efficient processing for large datasets
- Extensible architecture for additional formats
The CSV data loader uses Polars with infer_schema_length=10000 to analyze the first 10,000 rows for data type detection. This provides a good balance between accuracy and performance for most datasets.
Understanding Schema Inference:
The infer_schema_length parameter determines how many rows Polars examines to infer column data types:
- Default: 10,000 rows - Suitable for most FOCUS datasets
- Higher values - More accurate type detection but slower loading
- Lower values - Faster loading but may miss type patterns
Handling Datasets with Many Null Values:
If your dataset has more than 10,000 rows of null values at the beginning, you may encounter data type inference issues. Here are several solutions:
When loading CSV data programmatically, provide explicit column types:
from focus_validator.data_loaders.csv_data_loader import CSVDataLoader
import polars as pl
# Define column types for problematic columns
column_types = {
"BilledCost": pl.Float64(),
"EffectiveCost": pl.Float64(),
"ListCost": pl.Float64(),
"BillingPeriodStart": pl.Datetime("us", "UTC"),
"BillingPeriodEnd": pl.Datetime("us", "UTC"),
"AccountId": pl.Utf8(),
"ResourceId": pl.Utf8(),
# Add other columns as needed
}
loader = CSVDataLoader("your_data.csv", column_types=column_types)
df = loader.load()Pre-process your CSV file to move non-null data towards the beginning:
# Sort CSV by a column that has early non-null values
head -1 your_data.csv > header.csv
tail -n +2 your_data.csv | sort -t',' -k1,1 > sorted_data.csv
cat header.csv sorted_data.csv > cleaned_data.csvFor very large datasets with sparse data, you can modify the inference length by extending the CSV loader. Create a custom loader:
import polars as pl
# Load with increased inference length
df = pl.read_csv(
"your_data.csv",
infer_schema_length=50000, # Analyze first 50,000 rows
try_parse_dates=True,
null_values=["", "NULL", "INVALID", "INVALID_COST", "BAD_DATE"]
)For extremely challenging datasets, use a two-pass approach:
import polars as pl
# First pass: scan the entire file to understand data patterns
schema_sample = pl.scan_csv("your_data.csv").sample(n=10000).collect()
# Analyze the sample to determine appropriate types
inferred_types = {}
for col in schema_sample.columns:
non_null_values = schema_sample[col].drop_nulls()
if len(non_null_values) > 0:
# Determine appropriate type based on sample
if non_null_values.dtype == pl.Utf8:
# Try to infer if it should be numeric or datetime
try:
pl.Series(non_null_values.to_list()).cast(pl.Float64)
inferred_types[col] = pl.Float64()
except:
try:
pl.Series(non_null_values.to_list()).str.strptime(pl.Datetime)
inferred_types[col] = pl.Datetime("us", "UTC")
except:
inferred_types[col] = pl.Utf8()
# Second pass: load with determined schema
df = pl.read_csv("your_data.csv", schema_overrides=inferred_types)Common Signs of Schema Inference Issues:
- Numeric columns loaded as strings (Utf8) when they should be Float64
- Date columns not parsed as datetime types
- Validation failures for type-checking rules
- Unexpected null value counts in validation output
Best Practices for Large Datasets:
- Validate Your Data First: Use tools like
head,tail, andwc -lto understand your data structure - Check for Empty Rows: Ensure the beginning of your file contains representative data
- Use Consistent Null Representations: Stick to standard null values like empty strings or "NULL"
- Test with Sample Data: Validate schema inference with a smaller sample before processing full datasets
- Monitor Loading Logs: The CSV loader provides detailed logging about type conversion attempts and failures
Performance Considerations:
- Higher
infer_schema_lengthvalues increase loading time but improve accuracy - For very large files (>1GB), consider processing in chunks or using Parquet format instead
- The CSV loader includes automatic fallback mechanisms for problematic columns
The core validation execution engine:
- Rule Loading: Manages loading of FOCUS specification rules from JSON
- Version Management: Handles multiple FOCUS versions and compatibility
- Validation Execution: Orchestrates rule execution against datasets using DuckDB
- Result Aggregation: Collects and organizes validation results
- Remote Specification Support: Downloads and caches remote specifications
Validation Process:
- Load FOCUS specification from JSON files or remote sources
- Parse rules and build dependency graph
- Convert rules to executable DuckDB SQL queries
- Execute validation queries against dataset
- Aggregate results and generate comprehensive reports
Flexible output system supporting multiple formats:
- Format Selection: Factory pattern for output format selection
- Result Processing: Standardized result processing and formatting
- Console Outputter (
outputter_console.py): Human-readable terminal output - XML/JUnit Outputter (
outputter_unittest.py): CI/CD compatible XML reports - Validation Graph Outputter (
outputter_validation_graph.py): Graphviz visualizations
Supporting utilities for specialized functionality:
- Performance Logging (
performance_logging.py): Decorator-based performance monitoring - Currency Code Downloads (
download_currency_codes.py): Dynamic currency validation support
The validation process follows this high-level data flow:
- Input Processing: CLI arguments parsed and configuration loaded
- Data Loading: Dataset loaded using appropriate data loader (CSV/Parquet)
- Rule Loading: FOCUS specification rules loaded and parsed from JSON
- Plan Building: Validation execution plan built considering rule dependencies
- SQL Generation: Rules converted to optimized DuckDB SQL queries
- Validation Execution: SQL queries executed against dataset using DuckDB engine
- Result Aggregation: Validation results collected and organized
- Output Generation: Results formatted and output using selected formatter
- Optional Visualization: Dependency graph and results visualized using Graphviz
- Modularity: Clear separation of concerns with pluggable components
- Performance: Optimized for large datasets using DuckDB and efficient algorithms
- Extensibility: Easy to add new data formats, output formats, and validation rules
- Reliability: Comprehensive error handling and logging throughout
- Standards Compliance: Full adherence to FOCUS specification requirements
- Developer Experience: Rich logging, performance monitoring, and debugging support
- Core Language: Python 3.9+ with type hints and modern language features
- Data Processing: DuckDB for high-performance SQL-based validation
- Data Formats: Pandas/Polars for CSV/Parquet processing
- Configuration: Pydantic for type-safe configuration management
- CLI: argparse for comprehensive command-line interface
- Visualization: Graphviz for validation dependency graphs
- Testing: pytest with comprehensive test coverage
- Code Quality: Black, isort, flake8, mypy for code formatting and quality
The FOCUS Validator maintains high code quality through comprehensive testing:
- Unit Tests: Component-level testing for individual modules
- Integration Tests: End-to-end validation workflow testing
- Performance Tests: Large dataset processing validation
- Configuration Tests: Rule loading and version compatibility testing
- Data Loaders: 100% coverage ensuring reliable data ingestion
- Config Objects: 97%+ coverage for core validation logic
- Outputters: 85%+ coverage across all output formats
- Overall Project: 70% coverage with targeted improvement areas
- pytest: Primary testing framework with fixture support
- pytest-cov: Coverage reporting and analysis
- Black: Automated code formatting
- isort: Import organization and sorting
- flake8: Code style and complexity checking
- mypy: Static type checking and validation
- pre-commit: Git hook integration for quality checks
- Environment Management: Poetry-based dependency management
- Code Quality: Automated formatting and linting on commit
- Performance Monitoring: Built-in performance logging and profiling
- Testing: Makefile targets for module-specific and comprehensive testing
The project uses GitHub Actions for automated quality assurance:
- Linting Pipeline: mypy, black, isort, flake8 validation
- Test Execution: Full test suite with coverage reporting
- Performance Validation: Memory and execution time monitoring
- Multi-Platform Testing: Cross-platform compatibility verification
- Type Safety: Comprehensive type hints throughout codebase
- Error Handling: Structured exception hierarchy with detailed logging
- Performance Focus: Optimized for large dataset processing
- Extensibility: Plugin architecture for new formats and outputs
- Documentation: Inline documentation and comprehensive README
- Python 3.9+
- Poetry (Package & Dependency Manager)
If you haven't installed Poetry yet, you can do it by running:
curl -sSL https://install.python-poetry.org | python3 -For alternative installation methods or more information about Poetry, please refer to the official documentation.
git clone https://github.com/finopsfoundation/focus-spec-validator.git
cd focus-spec-validatorUsing Poetry, you can install the project's dependencies with:
poetry installActivate the virtual environment (See: poetry-plugin-shell):
poetry shellThe recommended way to run FOCUS validations is using the web output type, which generates an interactive HTML report with advanced filtering, rule grouping, and detailed validation insights:
poetry run focus-validator --data-file tests/samples/focus_sample_10000.csv --validate-version 1.2 --applicability-criteria ALL --output-type web --output-destination '10000_sample_validate.html'This command:
- Validates a dataset against FOCUS version 1.2
- Includes all applicable validation rules (
--applicability-criteria ALL) - Generates an interactive web report saved as
10000_sample_validate.html - Provides filtering by rule status (PASS/FAILED/SKIPPED), entity types, and rule functions
- Shows both entity-centric and rule-centric views of validation results
For help and more options:
focus-validator --helpFor console output or other formats:
# Console output (simple text format)
focus-validator --data-file your_data.csv --validate-version 1.2
# XML output for CI/CD integration
focus-validator --data-file your_data.csv --validate-version 1.2 --output-type unittest --output-destination results.xmlThe FOCUS Validator includes a powerful Explain Mode that allows you to inspect validation rules and their underlying SQL logic without executing actual validation or requiring input data. This feature is invaluable for understanding FOCUS specification requirements, debugging validation logic, and learning how rules are implemented.
Explain Mode generates comprehensive explanations for all validation rules in a FOCUS specification version, including:
- Rule Metadata: Rule ID, type, check method, and generator information
- MustSatisfy Requirements: The human-readable requirement that each rule validates
- SQL Queries: Complete, formatted SQL queries that implement the validation logic
- Rule Relationships: Hierarchy and dependencies between composite and child rules
- Condition Logic: Row-level conditions and filters applied during validation
# Explain all rules for FOCUS 1.2 with CostAndUsage dataset
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode# Show all rules including those with specific applicability requirements
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --applicability-criteria ALL
# Show rules for specific applicability criteria
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --applicability-criteria "AVAILABILITY_ZONE_SUPPORTED,MULTIPLE_SUB_ACCOUNT_TYPES_SUPPORTED"# List all applicability criteria for a FOCUS version
focus-validator --validate-version 1.2 --show-applicability-criteriaThe output is organized alphabetically by rule ID and includes comprehensive information for each rule:
=== SQL Explanations for 578 rules ===
π AvailabilityZone-C-001-M
Type: leaf
Check: type_string
Generator: TypeStringCheckGenerator
MustSatisfy: AvailabilityZone MUST be of type String.
SQL:
WITH invalid AS (
SELECT 1
FROM focus_data
WHERE AvailabilityZone IS NOT NULL AND typeof(AvailabilityZone) != 'VARCHAR'
)
SELECT COUNT(*) AS violations,
CASE WHEN COUNT(*) > 0
THEN 'AvailabilityZone MUST be of type VARCHAR (string).'
END AS error_message
FROM invalid
π BilledCost-C-000-M
Type: composite
Check: composite
Generator: CompositeANDRuleGenerator
MustSatisfy: The BilledCost column adheres to the following requirements:
Children: 7 rules
- BilledCost-C-000-M -> BilledCost-C-001-M (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-002-M (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-003-M (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-004-M (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-005-C (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-006-M (reference, model_rule_reference)
- BilledCost-C-000-M -> BilledCost-C-007-M (reference, model_rule_reference)
π RegionId-C-000-C
Type: skipped
Check: None
Generator: None due to non-applicable rule
MustSatisfy: RegionId is RECOMMENDED to be present in a FOCUS dataset when the provider supports deploying resources or services within regions.
π Rule Header: Rule ID in alphabetical order
Type:
leaf- Individual validation rule with SQL querycomposite- Rule that combines multiple child rules (AND/OR logic)reference- Rule that references another rule's outcomeskipped- Rule that cannot be executed (dynamic or non-applicable)
Check: The validation method used:
type_string- Validates column data type is stringcolumn_presence- Validates column exists in datasetformat_datetime- Validates datetime format compliancecomposite- Combines multiple child rule resultsmodel_rule_reference- References another rule's result
Generator: The code generator that creates the validation logic:
TypeStringCheckGenerator- Generates SQL for type validationColumnPresentCheckGenerator- Generates SQL for column existenceCompositeANDRuleGenerator- Combines child rules with AND logicCompositeORRuleGenerator- Combines child rules with OR logicNone due to dynamic rule- Rule requires runtime data analysisNone due to non-applicable rule- Rule doesn't apply to current criteria
MustSatisfy: Human-readable description of what the rule validates, directly from the FOCUS specification
SQL: Complete, formatted SQL query that implements the validation (for leaf rules only)
Children: For composite rules, shows all child rules with their types and references
Condition: Row-level conditions applied during validation (when present)
Individual validation rules that execute SQL queries against the dataset. These represent the core validation logic.
Examples:
- Column presence validation
- Data type checking
- Format validation (dates, currencies, etc.)
- Value constraint checking
- Pattern matching validation
Rules that combine multiple child rules using logical operators:
- AND Rules (
CompositeANDRuleGenerator): All child rules must pass - OR Rules (
CompositeORRuleGenerator): At least one child rule must pass
Rules that mirror the outcome of another rule without executing additional SQL.
Rules that cannot be executed for various reasons:
- Dynamic Rules: Require runtime data analysis to determine validation logic
- Non-Applicable Rules: Don't apply based on current applicability criteria
- Missing Dependencies: Reference unavailable components or data
Some FOCUS rules only apply under specific conditions (e.g., when a provider supports availability zones). The --applicability-criteria option controls which rules are included:
- Default (none specified): Shows only universally applicable rules
ALL: Shows all rules including those with specific requirements- Specific criteria: Shows rules for particular provider capabilities
Use --show-applicability-criteria to see available criteria for a FOCUS version.
The SQL queries in explain mode show exactly how each validation is implemented:
Column Presence Check:
WITH col_check AS (
SELECT COUNT(*) AS found
FROM information_schema.columns
WHERE table_name = 'focus_data' AND column_name = 'ColumnName'
)
SELECT CASE WHEN found = 0 THEN 1 ELSE 0 END AS violations,
CASE WHEN found = 0 THEN 'Column "ColumnName" MUST be present in the table.' END AS error_message
FROM col_checkType Validation:
SELECT COUNT(*) AS violations,
'ColumnName MUST be of type VARCHAR (string).' AS error_message
FROM focus_data
WHERE ColumnName IS NOT NULL AND typeof(ColumnName) != 'VARCHAR'Format Validation:
WITH invalid AS (
SELECT ColumnName::TEXT AS value
FROM focus_data
WHERE ColumnName IS NOT NULL
AND NOT (ColumnName::TEXT ~ '^[pattern]$')
)
SELECT COUNT(*) AS violations,
CASE WHEN COUNT(*) > 0 THEN 'Format validation message' END AS error_message
FROM invalid- Understand what each rule validates
- See the relationship between rules
- Learn validation requirements for each column
- Inspect SQL logic for failing rules
- Understand why certain rules are skipped
- Analyze composite rule dependencies
- Use SQL patterns for custom validation tools
- Understand FOCUS rule implementation details
- Reference validation logic for documentation
- Validate new rule implementations
- Test rule logic without full datasets
- Debug rule generation and SQL creation
The FOCUS Validator includes a powerful SQL Transpilation feature that allows you to see how DuckDB validation queries would appear in different SQL dialects. This is invaluable for database migration planning, cross-platform compatibility analysis, and understanding how validation logic translates across different database systems.
The --transpile option can only be used with --explain-mode and allows you to specify a target SQL dialect:
# Transpile validation queries to PostgreSQL
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile postgres
# Transpile to MySQL
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile mysql
# Transpile to BigQuery
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile bigquery
# Transpile to Snowflake
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile snowflakeThe transpiler supports 18+ SQL dialects including:
- PostgreSQL:
postgres- Enterprise-grade RDBMS - MySQL:
mysql- Popular open-source database - BigQuery:
bigquery- Google Cloud data warehouse - Snowflake:
snowflake- Cloud data platform - Microsoft SQL Server:
tsql- Enterprise database system - Oracle:
oracle- Enterprise database platform - SQLite:
sqlite- Lightweight embedded database - Amazon Redshift:
redshift- AWS data warehouse - Databricks:
databricks- Unified analytics platform - Apache Spark:
spark- Distributed computing engine - ClickHouse:
clickhouse- Columnar database - Teradata:
teradata- Enterprise data warehouse - Hive:
hive- Hadoop data warehouse - Presto/Trino:
presto- Distributed SQL engine - Apache Drill:
drill- Schema-free SQL engine - StarRocks:
starrocks- MPP database - DuckDB:
duckdb- Analytical database (original format) - ANSI SQL:
ansi- Standard SQL compliance
When using --transpile, the SQL queries in explain mode are automatically converted to the target dialect:
=== SQL Explanations for 578 rules (transpiled to postgres) ===
π AvailabilityZone-C-001-M
Type: leaf
Check: type_string
Generator: TypeStringCheckGenerator
MustSatisfy: AvailabilityZone MUST be of type String.
SQL (PostgreSQL):
WITH invalid AS (
SELECT 1
FROM focus_data
WHERE "AvailabilityZone" IS NOT NULL AND pg_typeof("AvailabilityZone") != 'text'
)
SELECT COUNT(*) AS violations,
CASE WHEN COUNT(*) > 0
THEN 'AvailabilityZone MUST be of type VARCHAR (string).'
END AS error_message
FROM invalid
Dialect-Specific Syntax: Automatically converts DuckDB syntax to target dialect conventions:
- Function Names:
typeof()βpg_typeof()(PostgreSQL),type()(BigQuery) - Type Names:
VARCHARβTEXT(PostgreSQL),STRING(BigQuery) - Identifiers: Automatic quoting and escaping based on dialect requirements
- Date/Time Functions: Converts temporal functions to dialect-specific equivalents
- String Operations: Adapts regex and string manipulation functions
Error Handling: Robust fallback mechanism for complex queries:
- Graceful Degradation: Shows original DuckDB SQL if transpilation fails
- Error Logging: Detailed error messages for debugging transpilation issues
- Partial Success: Successfully transpiled queries are shown even if others fail
Performance Optimization: Efficient transpilation without data processing:
- Parse-Only Mode: Analyzes SQL structure without executing queries
- Cached Results: Reuses transpilation results for similar query patterns
- Fast Execution: No impact on explain mode performance
Understand how FOCUS validation logic would need to be adapted when migrating from DuckDB to other database systems:
# See how validations would work in PostgreSQL environment
focus-validator --explain-mode --transpile postgres --filter-rules "BilledCost*"
# Compare BigQuery compatibility for cloud migration
focus-validator --explain-mode --transpile bigquery --filter-rules "EffectiveCost*"Validate that your FOCUS implementation can work across multiple database platforms:
# Check MySQL compatibility for web application backends
focus-validator --explain-mode --transpile mysql
# Verify Snowflake compatibility for data warehouse implementations
focus-validator --explain-mode --transpile snowflakeUse transpiled SQL as a reference for implementing FOCUS validation in different database environments:
# Generate SQL for Oracle-based validation systems
focus-validator --explain-mode --transpile oracle > oracle_validation_queries.sql
# Create Spark-compatible validation logic
focus-validator --explain-mode --transpile spark > spark_validation_logic.sqlLearn how SQL patterns differ across database systems:
# Compare how type checking differs between databases
focus-validator --explain-mode --transpile postgres --filter-rules "*-C-001-M" | grep "typeof"
focus-validator --explain-mode --transpile mysql --filter-rules "*-C-001-M" | grep "typeof"Unsupported Dialect Handling: If an invalid dialect is specified, the tool provides helpful feedback:
focus-validator --explain-mode --transpile invalid_dialect
# Error: Unsupported dialect 'invalid_dialect'. Available dialects: postgres, mysql, bigquery, snowflake, ...Complex Query Limitations: Some advanced DuckDB features may not have direct equivalents in all dialects:
- Complex window functions
- Advanced temporal operations
- Specialized analytical functions
- Custom DuckDB extensions
When transpilation isn't possible, the original DuckDB SQL is shown with a note about the limitation.
Fallback Behavior: The transpiler prioritizes showing useful information:
- Successful Transpilation: Shows converted SQL in target dialect
- Partial Failure: Shows original SQL with transpilation error note
- Complete Failure: Shows original DuckDB SQL without modification
The --transpile option works seamlessly with all other explain mode features:
# Transpile with applicability criteria
focus-validator --explain-mode --transpile postgres --applicability-criteria ALL
# Transpile specific rule groups
focus-validator --explain-mode --transpile bigquery --filter-rules "BilledCost*,EffectiveCost*"
# Transpile and save to file for analysis
focus-validator --explain-mode --transpile snowflake > snowflake_validation_queries.sqlExplain mode is designed for fast execution since it doesn't process actual data:
- Fast Execution: No data loading or SQL execution against datasets
- Complete Coverage: Analyzes all rules in the specification (500+ rules for FOCUS 1.2)
- Alphabetical Ordering: Predictable rule ordering for easy navigation
- Detailed Output: Comprehensive information for each rule
- Transpilation Speed: SQL conversion adds minimal overhead to explain mode execution
The output can be extensive (500+ rules), so consider using shell tools for navigation:
# Search for specific rules with transpilation
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile postgres | grep "BilledCost"
# Page through transpiled output
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile mysql | less
# Save transpiled queries to file for analysis
focus-validator --validate-version 1.2 --focus-dataset CostAndUsage --explain-mode --transpile bigquery > bigquery_validation_queries.sqlRun all tests:
poetry run pytestRun tests with verbose output:
poetry run pytest -vGenerate coverage report:
# Terminal coverage report
make coverage
# HTML coverage report (opens in browser)
make coverage-html
# Comprehensive coverage report (HTML + XML + terminal)
make coverage-reportRun tests for specific modules:
# Data loaders only
make test-data-loaders
# Config objects only
make test-config-objects
# Output formatters only
make test-outputter# Coverage for data loaders (100% coverage)
make coverage-data-loaders
# Coverage for config objects (97%+ coverage)
make coverage-config-objects
# Coverage for outputters (85%+ coverage)
make coverage-outputter- Data Loaders: 100% coverage (50 tests)
- Config Objects: 97%+ coverage (120 tests)
- Outputters: 85%+ coverage (80 tests)
- Core Components: 89%+ coverage (7 tests)
See COVERAGE_REPORT.md for detailed coverage analysis and improvement recommendations.
If running on legacy CPUs and the tests crash on the polars library, run the following locally only:
poetry add polars-lts-cpuThis will align the polars execution with your system hardware. It should NOT be committed back into the repository.
This project is licensed under the MIT License - see the LICENSE file for details.