Build complete tablediff-arrow Python package with CLI, tests, and CI workflows #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements a complete, production-ready Python package
tablediff-arrowfor fast, file-based diffs of Parquet/CSV/Arrow data. The package supports local and S3 files, provides keyed comparisons with numeric tolerances, and generates HTML/CSV reports—all built on Apache Arrow for high performance.What's Included
Core Library (
src/tablediff_arrow/)Implemented a full-featured data comparison library with four main modules:
loader.py- Handles loading Parquet, CSV, and Arrow IPC files from local filesystem or S3compare.py- Performs keyed table comparisons with configurable absolute and relative numeric tolerancesreports.py- Generates styled HTML reports and structured CSV reports showing differencescli.py- Provides a command-line interface with rich options for file comparisonKey Features
Fast Performance: Built on Apache Arrow and PyArrow for efficient data processing of large datasets.
Multiple Format Support: Automatically detects and loads Parquet (
.parquet,.pq), CSV (.csv), and Arrow IPC (.arrow,.feather) files.S3 Integration: Optional S3 support via
s3fsallows comparing files directly from S3 buckets without downloading.Flexible Comparisons:
amount:0.01for ±$0.01)rate:0.001for ±0.1%)Rich Reports:
Command-Line Interface
Python API
Testing & Quality Assurance
Comprehensive Test Suite: 24 tests covering all modules with 86% code coverage
Code Quality: All code passes strict quality checks
CI/CD Infrastructure
GitHub Actions Workflow (
.github/workflows/ci.yml):Pre-commit Configuration (
.pre-commit-config.yaml):Documentation
Comprehensive documentation included:
Package Configuration
Modern Python packaging with
pyproject.toml:tablediffCLI commandVerification
All functionality has been tested and verified:
The package is production-ready and follows Python best practices for packaging, testing, documentation, and CI/CD.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.