Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports - built on Apache Arrow.
- Fast: Built on Apache Arrow for high-performance data processing
- Multiple Formats: Support for Parquet, CSV, and Arrow IPC files
- S3 Support: Read files directly from S3 (optional)
- Keyed Comparisons: Compare tables using one or more key columns
- Numeric Tolerances: Configure absolute and relative tolerances for numeric columns
- Rich Reports: Generate HTML and CSV reports with detailed differences
- Python 3.10+: Modern Python with type hints and clean APIs
- Well Tested: Comprehensive test suite with high coverage
Install from PyPI (recommended):
pip install tablediff-arrowFor S3 support:
pip install tablediff-arrow[s3]For development (from source):
pip install -e ".[dev]"Compare two Parquet files using id as the key column:
tablediff left.parquet right.parquet -k idCompare with numeric tolerance:
tablediff left.csv right.csv -k id -t amount:0.01Generate an HTML report:
tablediff left.parquet right.parquet -k id -o report.htmlCompare S3 files:
tablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3from tablediff_arrow import TableDiff
# Create a differ with key columns and tolerances
differ = TableDiff(
key_columns=['id'],
tolerance={'amount': 0.01}, # Absolute tolerance
relative_tolerance={'price': 0.001} # Relative tolerance (0.1%)
)
# Compare files
result = differ.compare_files('left.parquet', 'right.parquet')
# Print summary
print(result.summary())
# Check if there are differences
if result.has_differences:
print(f"Found {result.changed_rows} changed rows")
print(f"Found {result.left_only_rows} rows only in left")
print(f"Found {result.right_only_rows} rows only in right")
# Generate reports
from tablediff_arrow.reports import generate_html_report, generate_csv_report
generate_html_report(result, 'report.html')
generate_csv_report(result, 'output_dir/', prefix='diff')Compare tables using composite keys:
tablediff left.parquet right.parquet -k year -k month -k productdiffer = TableDiff(key_columns=['year', 'month', 'product'])
result = differ.compare_files('left.parquet', 'right.parquet')Use absolute tolerance for monetary values:
tablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001Use relative tolerance for percentages:
tablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01differ = TableDiff(
key_columns=['id'],
tolerance={'amount': 0.01, 'balance': 0.001},
relative_tolerance={'rate': 0.001, 'score': 0.01}
)import pyarrow as pa
from tablediff_arrow import TableDiff
# Create tables directly
left = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})
right = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})
# Compare
differ = TableDiff(key_columns=['id'])
result = differ.compare_tables(left, right)
print(result.summary())import s3fs
from tablediff_arrow import TableDiff
# Create S3 filesystem
fs = s3fs.S3FileSystem()
# Compare S3 files
differ = TableDiff(key_columns=['id'])
result = differ.compare_files(
's3://my-bucket/left.parquet',
's3://my-bucket/right.parquet',
filesystem=fs
)Usage: tablediff [OPTIONS] LEFT RIGHT
Compare two tables and generate diff reports.
Arguments:
LEFT Path to the left/source table file (local or s3://)
RIGHT Path to the right/target table file (local or s3://)
Options:
-k, --key TEXT Key column(s) for comparison (required, can be
specified multiple times)
-t, --tolerance TEXT Absolute tolerance for numeric columns
(format: column:value)
-r, --relative-tolerance Relative tolerance for numeric columns
(format: column:value)
--left-format [parquet|csv|arrow]
Format of the left file
--right-format [parquet|csv|arrow]
Format of the right file
-o, --output TEXT Output file path for HTML report
--csv-output PATH Output directory for CSV reports
--s3 Enable S3 filesystem support
--help Show this message and exit.
The HTML report provides an interactive view of differences:
- Summary statistics (matched, changed, added, removed rows)
- Color-coded differences table
- Separate sections for left-only and right-only rows
- Change counts per column
CSV output generates multiple files:
{prefix}_summary.csv: Summary statistics{prefix}_changes.csv: Detailed changes with old and new values{prefix}_left_only.csv: Rows only in the left table{prefix}_right_only.csv: Rows only in the right table
# Clone the repository
git clone https://github.com/psmman/tablediff-arrow.git
cd tablediff-arrow
# Install with development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install# Run all tests
pytest
# Run with coverage
pytest --cov=tablediff_arrow --cov-report=html
# Run specific test file
pytest tests/test_compare.py# Format code
black src tests
# Lint
ruff check src tests
# Type check
mypy srcThe project uses pre-commit hooks to ensure code quality:
- trailing-whitespace: Remove trailing whitespace
- end-of-file-fixer: Ensure files end with a newline
- check-yaml/json/toml: Validate config files
- black: Format Python code
- ruff: Lint Python code
- mypy: Type checking
- Python 3.10 or higher
- pyarrow >= 14.0.0
- pandas >= 2.0.0
- click >= 8.0.0
- jinja2 >= 3.0.0
- s3fs >= 2023.0.0 (optional, for S3 support)
MIT License - see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.