A Rust-Based, Multi-Tenant, Iceberg-Compatible Lakehouse Catalog
Pangolin is a high-performance catalog designed for modern lakehouse architectures. It supports Git-style branching, multi-tenancy, federated catalogs, and tracks any lakehouse asset type.
A pangolin is a strong metaphor for a data lakehouse catalog because its defining traits align closely with the core responsibilities of a catalog.
First, a pangolin is covered in layered scales. Each scale is distinct but part of a coherent whole. A lakehouse catalog works the same way. It organizes many independent assets—tables, views, files, models, and metadata—into a single, structured system. Each asset has its own schema, properties, and lineage, yet all are discoverable through one catalog.
Second, pangolins are defensive by design. They protect what matters by curling into a secure form. A catalog plays a similar role in governance. It enforces access controls, tracks ownership, and provides guardrails around sensitive data. Rather than blocking access outright, it enables safe and intentional use.
Third, pangolins are precise and deliberate. They move carefully and use strong claws to uncover food hidden beneath the surface. A lakehouse catalog does the same for data. It helps users uncover datasets buried across object storage, warehouses, and streams, exposing meaning through metadata, classification, and search.
Finally, pangolins are rare and specialized. They exist for a specific purpose and excel at it. A data lakehouse catalog is not a generic system. It is a purpose-built layer focused on clarity, trust, and navigation across complex data environments.
- Rust 1.92+
- Docker (optional, for MinIO)
cd pangolin
cargo run --bin pangolin_apiSee Quick Start Guide for detailed setup and example curl commands.
- Multi-Tenancy: Full tenant isolation with dedicated namespaces and warehouses.
- Iceberg REST Catalog: 100% compliant with Apache Iceberg REST spec.
- Git-like Branching: Branch, tag, and merge catalogs for safe experimentation.
- 3-Way Merging: Intelligent conflict detection with manual and automatic resolution strategies.
- Federated Catalogs: Connect to external Iceberg catalogs as a transparent proxy.
- Service Users: API key authentication for CI/CD, ETL, and automated pipelines.
- Advanced Audit Logging: Comprehensive tracking of 40+ actions across 19 resource types.
- Multi-Cloud Storage: Native support for AWS S3, Azure Blob, and Google Cloud Storage.
- Credential Vending: Securely vends AWS STS, Azure SAS, and GCP downscoped credentials.
- Multiple Backends: Metadata persistence via PostgreSQL, MongoDB, SQLite, or In-Memory.
- Management UI: Modern SvelteKit-based interface for Admins and Data Explorers.
- Installation & Setup - Get running in 5 minutes.
- Auth Modes - Understanding Auth vs No-Auth and OAuth.
- Service Users - API keys for programmatic access.
- Multi-Tenancy - Understanding isolation.
- User Scopes - Roles: Root, Tenant Admin, and Tenant User.
- Configuration - Server configuration options.
- Environment Variables - Complete reference for all environment variables (DATABASE_URL, storage backends, S3/MinIO, authentication, etc.)
- Warehouses - Managing S3, Azure, and GCS storage.
- Credential Vending - Secure direct-to-storage access.
- Catalogs - Creating Local and Federated catalogs.
- Backend Storage - Metadata persistence with Postgres, Mongo, or SQLite.
- Branching & Versioning - Git-style workflows and auto-add nuances.
- Permissions & RBAC - Asset-level access and cascading grants.
- IAM Roles - Cloud provider integration.
- Business Metadata - Tags, search, and data discovery.
- Audit Logging - Security tracking across all tools.
- Maintenance - Snapshots, orphan files, and storage optimization.
- CLI Reference - Full guide for
pangolin-adminandpangolin-user. - API Reference - Iceberg REST and Pangolin Management APIs.
- Management UI - Visual administration and data discovery.
- Python Client - Official Python library (
pypangolin). - Client Setup - Connecting PyIceberg, Spark, and Trino.
- Deployment - Production deployment, Docker, Kubernetes, HA setup.
- Scalability - Horizontal scaling, database optimization, caching.
- Security - Authentication, encryption, audit logging, compliance.
- Permissions Management - RBAC patterns, least privilege, access control.
- Branch Management - Git-like workflows, merge strategies, conflict resolution.
- Business Metadata - Metadata strategy, governance, data classification.
- Apache Iceberg - Table design, partitioning, schema evolution, performance.
- Generic Assets - Managing ML models, files, media, and artifacts.
Current Version: Alpha
Production-Ready Features:
- ✅ Iceberg REST Catalog API (100% Compliant)
- ✅ Multi-Tenancy & Tenant Isolation
- ✅ Git-like Branching & Tagging
- ✅ Advanced Audit Logging (UI/CLI/API)
- ✅ Service Users & API Keys
- ✅ PostgreSQL, MongoDB, and SQLite Backends
- ✅ Multi-Cloud Storage (S3, Azure, GCS)
- ✅ Management UI for Admins & Explorers
curl -X POST http://localhost:8080/api/v1/catalogs \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "production",
"warehouse_name": "main_s3",
"storage_location": "s3://my-bucket/warehouse"
}'pangolin-user create-branch dev --from main --catalog productionfrom pyiceberg.catalog import load_catalog
catalog = load_catalog(
"pangolin",
**{
"uri": "http://localhost:8080",
"warehouse": "production",
"token": "your-jwt-token",
"header.X-Iceberg-Access-Delegation": "vended-credentials",
}
)
# Load a table on the 'dev' branch
table = catalog.load_table("analytics.sales@dev")
df = table.scan().to_pandas()MIT License - see LICENSE file for details.
- Documentation: See docs/ directory.
- Issues: GitHub Issues.
- Discussions: GitHub Discussions.
