tablediff-arrow

Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports - built on Apache Arrow.

Features

Fast: Built on Apache Arrow for high-performance data processing
Multiple Formats: Support for Parquet, CSV, and Arrow IPC files
S3 Support: Read files directly from S3 (optional)
Keyed Comparisons: Compare tables using one or more key columns
Numeric Tolerances: Configure absolute and relative tolerances for numeric columns
Rich Reports: Generate HTML and CSV reports with detailed differences
Python 3.10+: Modern Python with type hints and clean APIs
Well Tested: Comprehensive test suite with high coverage

Installation

Install from PyPI (recommended):

pip install tablediff-arrow

For S3 support:

pip install tablediff-arrow[s3]

For development (from source):

pip install -e ".[dev]"

Quick Start

Command Line Interface

Compare two Parquet files using id as the key column:

tablediff left.parquet right.parquet -k id

Compare with numeric tolerance:

tablediff left.csv right.csv -k id -t amount:0.01

Generate an HTML report:

tablediff left.parquet right.parquet -k id -o report.html

Compare S3 files:

tablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3

Python API

from tablediff_arrow import TableDiff

# Create a differ with key columns and tolerances
differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01},  # Absolute tolerance
    relative_tolerance={'price': 0.001}  # Relative tolerance (0.1%)
)

# Compare files
result = differ.compare_files('left.parquet', 'right.parquet')

# Print summary
print(result.summary())

# Check if there are differences
if result.has_differences:
    print(f"Found {result.changed_rows} changed rows")
    print(f"Found {result.left_only_rows} rows only in left")
    print(f"Found {result.right_only_rows} rows only in right")

# Generate reports
from tablediff_arrow.reports import generate_html_report, generate_csv_report

generate_html_report(result, 'report.html')
generate_csv_report(result, 'output_dir/', prefix='diff')

Usage Examples

Multiple Key Columns

Compare tables using composite keys:

tablediff left.parquet right.parquet -k year -k month -k product

differ = TableDiff(key_columns=['year', 'month', 'product'])
result = differ.compare_files('left.parquet', 'right.parquet')

Numeric Tolerances

Use absolute tolerance for monetary values:

tablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001

Use relative tolerance for percentages:

tablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01

differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01, 'balance': 0.001},
    relative_tolerance={'rate': 0.001, 'score': 0.01}
)

Working with PyArrow Tables

import pyarrow as pa
from tablediff_arrow import TableDiff

# Create tables directly
left = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})
right = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})

# Compare
differ = TableDiff(key_columns=['id'])
result = differ.compare_tables(left, right)

print(result.summary())

S3 Files

import s3fs
from tablediff_arrow import TableDiff

# Create S3 filesystem
fs = s3fs.S3FileSystem()

# Compare S3 files
differ = TableDiff(key_columns=['id'])
result = differ.compare_files(
    's3://my-bucket/left.parquet',
    's3://my-bucket/right.parquet',
    filesystem=fs
)

CLI Options

Usage: tablediff [OPTIONS] LEFT RIGHT

  Compare two tables and generate diff reports.

Arguments:
  LEFT   Path to the left/source table file (local or s3://)
  RIGHT  Path to the right/target table file (local or s3://)

Options:
  -k, --key TEXT              Key column(s) for comparison (required, can be
                              specified multiple times)
  -t, --tolerance TEXT        Absolute tolerance for numeric columns
                              (format: column:value)
  -r, --relative-tolerance    Relative tolerance for numeric columns
                              (format: column:value)
  --left-format [parquet|csv|arrow]
                              Format of the left file
  --right-format [parquet|csv|arrow]
                              Format of the right file
  -o, --output TEXT           Output file path for HTML report
  --csv-output PATH           Output directory for CSV reports
  --s3                        Enable S3 filesystem support
  --help                      Show this message and exit.

Output Reports

HTML Report

The HTML report provides an interactive view of differences:

Summary statistics (matched, changed, added, removed rows)
Color-coded differences table
Separate sections for left-only and right-only rows
Change counts per column

CSV Reports

CSV output generates multiple files:

{prefix}_summary.csv: Summary statistics
{prefix}_changes.csv: Detailed changes with old and new values
{prefix}_left_only.csv: Rows only in the left table
{prefix}_right_only.csv: Rows only in the right table

Development

Setup

# Clone the repository
git clone https://github.com/psmman/tablediff-arrow.git
cd tablediff-arrow

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=tablediff_arrow --cov-report=html

# Run specific test file
pytest tests/test_compare.py

Code Quality

# Format code
black src tests

# Lint
ruff check src tests

# Type check
mypy src

Pre-commit Hooks

The project uses pre-commit hooks to ensure code quality:

trailing-whitespace: Remove trailing whitespace
end-of-file-fixer: Ensure files end with a newline
check-yaml/json/toml: Validate config files
black: Format Python code
ruff: Lint Python code
mypy: Type checking

Requirements

Python 3.10 or higher
pyarrow >= 14.0.0
pandas >= 2.0.0
click >= 8.0.0
jinja2 >= 3.0.0
s3fs >= 2023.0.0 (optional, for S3 support)

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src/tablediff_arrow		src/tablediff_arrow
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
examples.py		examples.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tablediff-arrow

Features

Installation

Quick Start

Command Line Interface

Python API

Usage Examples

Multiple Key Columns

Numeric Tolerances

Working with PyArrow Tables

S3 Files

CLI Options

Output Reports

HTML Report

CSV Reports

Development

Setup

Running Tests

Code Quality

Pre-commit Hooks

Requirements

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

psmman/tablediff-arrow

Folders and files

Latest commit

History

Repository files navigation

tablediff-arrow

Features

Installation

Quick Start

Command Line Interface

Python API

Usage Examples

Multiple Key Columns

Numeric Tolerances

Working with PyArrow Tables

S3 Files

CLI Options

Output Reports

HTML Report

CSV Reports

Development

Setup

Running Tests

Code Quality

Pre-commit Hooks

Requirements

License

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages