GitHub - AvalancheHQ/floe: The Open Platform for building Data Platforms. Ship faster. Stay compliant. Scale to Data Mesh.

The Open Platform for building Data Platforms

Ship faster. Stay compliant. Scale to Data Mesh.

Quick Start • Features • Documentation • Contributing

What is floe?

floe is an open platform for building internal data platforms.

Platform teams choose their stack from 12 plugin types:

Compute: DuckDB, Snowflake, Databricks, Spark, BigQuery
Orchestrator: Dagster, Airflow 3.x
Catalog: Polaris, AWS Glue, Unity Catalog
Observability: Split into TelemetryBackend (Jaeger, Datadog) + LineageBackend (Marquez, Atlan)
[... 8 more plugin types]

Data teams get opinionated workflows:

✅ 30 lines replaces 300+ lines of boilerplate
✅ Same config works everywhere (dev/staging/prod parity)
✅ Standards enforced automatically (compile-time validation)
✅ Full composability (swap DuckDB → Snowflake without pipeline changes)

Batteries included. Fully customizable. Production-ready.

The Problem

Platform engineers supporting 50+ data teams face:

Integration hell: Stitching together 15+ tools that don't talk to each other
Exception management: Every team has a "unicorn use case" that breaks your framework
RBAC sprawl: Managing 1200+ credentials across teams, environments, services
Security whack-a-mole: Someone always finds a way to hardcode production secrets

Data engineers shipping data products face:

Governance theater: 3 meetings to approve a pipeline (64% struggle to embed governance in workflows)
Platform dependency: Blocked for 2 weeks because "platform team is busy" (63% say leaders don't understand their pain)
Framework limitations: Can't do what you need → shadow IT or 6-month wait
Unclear requirements: "I thought 80% test coverage was optional?"

Result: Governance blocks teams instead of enabling them.

The Solution

For platform teams:

Get a pre-integrated stack (DuckDB + Dagster + Polaris + dbt tested together)
Say "yes" to edge cases with plugin architecture (add Spark? Swap ComputePlugin. Need Kafka? Add IngestionPlugin)
Automatic credential vending (SecretReference pattern, manage 1 OAuth config instead of 1200 secrets)
Enforce at compile-time (violations caught before deployment, not in production)

For data teams:

Governance = automatic (compile checks replace meetings)
Get capabilities instantly (platform adds plugin, you use it immediately)
Escape hatches built-in (plugin system extensible for your unicorn use case)
Requirements explicit (minimum_test_coverage: 80 in manifest.yaml, not tribal knowledge)

If it compiles, it's compliant.

How It Works

1. Platform Team Chooses Stack (Once)

Composable architecture: Mix and match from 13 plugin types

# manifest.yaml (50 lines supports 200 pipelines)
compute:
  approved:
    - name: duckdb      # Cost-effective analytics
    - name: spark       # Heavy processing
    - name: snowflake   # Enterprise warehouse
  default: duckdb       # Used when transform doesn't specify
orchestrator: dagster   # Or: airflow
catalog: polaris        # Or: glue, unity-catalog

governance:
  naming_pattern: medallion        # bronze/silver/gold layers
  minimum_test_coverage: 80        # Explicit, not ambiguous
  block_on_failure: true           # Enforced, not suggested

2. Data Teams Write Business Logic (Always)

Declarative config: Same across all 50 teams. Select compute per-step from approved list.

# floe.yaml (30 lines replaces 300 lines of boilerplate)
name: customer-analytics
version: "0.1.0"

transforms:
  - type: dbt
    path: ./dbt/staging
    compute: spark      # Heavy processing on Spark

  - type: dbt
    path: ./dbt/marts
    compute: duckdb     # Analytics on DuckDB

schedule:
  cron: "0 6 * * *"

3. floe Generates Everything Else

Compilation phase (2 seconds, catches violations before deployment):

$ floe compile

[1/3] Loading platform policies
      ✓ Platform: acme-data-platform v1.2.3

[2/3] Validating pipeline
      ✓ Naming: bronze_customers (compliant)
      ✓ Test coverage: 85% (>80% required)

[3/3] Generating artifacts
      ✓ Dagster assets (Python)
      ✓ dbt profiles (YAML)
      ✓ Kubernetes manifests (YAML)
      ✓ Credentials (vended automatically)

Compilation SUCCESS - ready to deploy

What's auto-generated:

✅ Database connection configs (dbt profiles.yml)
✅ Orchestration code (Dagster assets or Airflow DAGs)
✅ Kubernetes manifests (Jobs, Services, ConfigMaps)
✅ Environment-specific settings (dev/staging/prod)
✅ Credential vending (SecretReference pattern, no hardcoded secrets)

Same floe.yaml works across dev, staging, production.

Features

🔌 Composable by Design

Choose from 12 plugin types. Swap implementations without breaking pipelines.

Multi-compute pipelines: Platform teams approve N compute targets. Data engineers select per-step from the approved list. Different steps can use different engines:

# manifest.yaml (Platform Team)
compute:
  approved:
    - name: spark       # Heavy processing
    - name: duckdb      # Cost-effective analytics
    - name: snowflake   # Enterprise warehouse
  default: duckdb

# floe.yaml (Data Engineers)
transforms:
  - type: dbt
    path: models/staging/
    compute: spark      # Process 10TB raw data

  - type: dbt
    path: models/marts/
    compute: duckdb     # Build metrics on 100GB result

Environment parity preserved: Each step uses the SAME compute across dev/staging/prod. No "works in dev, fails in prod" surprises.

Real-world swap scenarios:

DuckDB (embedded, cost-effective) ↔ Snowflake (managed, elastic)
Dagster (asset-centric) ↔ Airflow 3.x (DAG-based)
Jaeger (self-hosted) ↔ Datadog (managed SaaS)

Plugin types: Compute, Orchestrator, Catalog, Storage, TelemetryBackend, LineageBackend, DBT, SemanticLayer, Ingestion, DataQuality, Secrets, Identity

📝 Declarative Configuration

Two-tier YAML. Platform team defines infrastructure. Data teams define logic.

No code generation anxiety: Compiled artifacts are checked into git. Diff them. Review them. Trust them.

✅ Compile-Time Validation

Catch errors before deployment. No runtime surprises.

Example:

$ floe compile
[FAIL] 'stg_payments' violates naming convention
       Expected: bronze_*, silver_*, gold_*

[FAIL] 'gold_revenue' missing required tests
       Required: [unique_pk, not_null_pk, documentation]

Compilation FAILED - fix violations before deployment

Not documentation governance. Computational governance.

🔐 Security by Default

Layer boundaries enforce separation:

Credentials in platform config → Data teams cannot access
Automatic vending with SecretReference → No hardcoded secrets possible
Layer architecture → Data teams cannot override platform policies
Type-safe schemas → Catch errors at compile-time

Result: Manage 1 OAuth config instead of 1200 credentials.

⚡ Environment Parity

Same pipeline config works everywhere:

Environment	Platform Config	Pipeline Config
Dev	DuckDB (local cluster)	`floe.yaml` (no changes)
Staging	DuckDB (shared cluster)	`floe.yaml` (no changes)
Prod	DuckDB (production cluster)	`floe.yaml` (no changes)

Or swap to Snowflake, Databricks, or Spark—the pipeline config stays identical.

Result: No "works on my machine" issues. No config drift. What you test is what you deploy.

🌐 Data Mesh Ready

Federated ownership with computational governance:

Enterprise policies → Domain constraints → Data products (three-tier hierarchy)
Data contracts as code (ODCS standard, auto-validated)
Compile-time + runtime enforcement (not meetings)
Domain teams have autonomy within guardrails

Scale from single platform to federated Data Mesh without rebuilding.

Architecture

Four-Layer Enforcement Model

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'16px'}}}%%
flowchart TB
    L4["<b>Layer 4: DATA</b><br/>Ephemeral Jobs<br/><br/>Owner: Data Engineers<br/>• Write SQL transforms<br/>• Define schedules<br/>• INHERIT platform constraints"]

    L3["<b>Layer 3: SERVICES</b><br/>Long-lived Infrastructure<br/><br/>Owner: Platform Engineers<br/>• Orchestrator, Catalog<br/>• Observability services<br/>• Always running, health probes"]

    L2["<b>Layer 2: CONFIGURATION</b><br/>Immutable Policies<br/><br/>Owner: Platform Engineers<br/>• Plugin selection<br/>• Governance rules<br/>• ENFORCED at compile-time"]

    L1["<b>Layer 1: FOUNDATION</b><br/>Framework Code<br/><br/>Owner: floe Maintainers<br/>• Schemas, validation engine<br/>• Distributed via PyPI + Helm"]

    L4 -->|Connects to| L3
    L3 -->|Configured by| L2
    L2 -->|Built on| L1

    classDef dataLayer fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
    classDef serviceLayer fill:#F5A623,stroke:#D68910,stroke-width:3px,color:#fff
    classDef configLayer fill:#9013FE,stroke:#6B0FBF,stroke-width:3px,color:#fff
    classDef foundationLayer fill:#50E3C2,stroke:#2EB8A0,stroke-width:3px,color:#fff

    class L4 dataLayer
    class L3 serviceLayer
    class L2 configLayer
    class L1 foundationLayer

Key principle: Configuration flows downward only. Data teams cannot weaken platform policies.

Two-Tier Configuration

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'16px'}}}%%
flowchart LR
    PM["<b>manifest.yaml</b><br/><br/>Platform Engineers<br/><br/>Infrastructure<br/>Credentials<br/>Governance policies"]

    FL["<b>floe.yaml</b><br/><br/>Data Engineers<br/><br/>Pipeline logic<br/>Transforms<br/>Schedules"]

    PM -->|Resolves to| FL

    classDef platformConfig fill:#F5A623,stroke:#D68910,stroke-width:3px,color:#fff
    classDef dataConfig fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff

    class PM platformConfig
    class FL dataConfig

File	Audience	Contains
`manifest.yaml`	Platform Engineers	Infrastructure, credentials, governance policies
`floe.yaml`	Data Engineers	Pipeline logic, transforms, schedules

Benefit: Data teams never see credentials or infrastructure details. Platform team controls standards centrally.

Built on the Shoulders of Giants

floe provides batteries-included OSS defaults that run on any Kubernetes cluster:

Apache Iceberg: Open table format with ACID transactions
Apache Polaris: Iceberg REST catalog
DuckDB: High-performance analytics engine
dbt: SQL transformation framework
Dagster: Asset-centric orchestration
Cube: Semantic layer and headless BI
OpenTelemetry + OpenLineage: Observability and lineage standards

Not "integration hell": Pre-configured, tested together, deployable with one command. Or swap any component for your cloud service of choice.

Documentation

Getting Started: Quick Start Guide
Configuration: Configuration Contracts (manifest.yaml + floe.yaml)
Architecture: Four-Layer Model • Platform Enforcement
Development: Contributing Guide • Code Standards
ADRs: Architecture Decision Records

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Code Standards

Type safety: All code must pass mypy --strict
Formatting: Black (100 char), enforced by ruff
Testing: >80% coverage, 100% requirement traceability
Security: No hardcoded secrets, Pydantic validation
Architecture: Respect layer boundaries

Roadmap

Current (v0.1.0 - Pre-Alpha):

Four-layer architecture
Two-tier configuration
Kubernetes-native deployment
Compile-time validation

Next (v0.2.0 - Alpha):

Complete K8s-native testing
Plugin ecosystem docs
CLI command suite
External plugin support

Future (v1.0.0 - Production):

Data Mesh extensions
OCI registry integration
Multi-environment workflows

License

Apache License 2.0 - See LICENSE for details.

Community

Issues: GitHub Issues
Discussions: GitHub Discussions

_{Built with ❤️ by the floe community}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.beads		.beads
.claude		.claude
.githooks		.githooks
.github		.github
.specify		.specify
.vscode		.vscode
benchmarks		benchmarks
docs		docs
packages/floe-core		packages/floe-core
plugins		plugins
scripts		scripts
specs		specs
testing		testing
tests		tests
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
TESTING.md		TESTING.md
floe.png		floe.png
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
sonar-project.properties		sonar-project.properties
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Open Platform for building Data Platforms

What is floe?

The Problem

The Solution

How It Works

1. Platform Team Chooses Stack (Once)

2. Data Teams Write Business Logic (Always)

3. floe Generates Everything Else

Features

🔌 Composable by Design

📝 Declarative Configuration

✅ Compile-Time Validation

🔐 Security by Default

⚡ Environment Parity

🌐 Data Mesh Ready

Architecture

Four-Layer Enforcement Model

Two-Tier Configuration

Built on the Shoulders of Giants

Documentation

Contributing

Code Standards

Roadmap

License

Community

About

Uh oh!

Releases

Packages

Languages

License

AvalancheHQ/floe

Folders and files

Latest commit

History

Repository files navigation

The Open Platform for building Data Platforms

What is floe?

The Problem

The Solution

How It Works

1. Platform Team Chooses Stack (Once)

2. Data Teams Write Business Logic (Always)

3. floe Generates Everything Else

Features

🔌 Composable by Design

📝 Declarative Configuration

✅ Compile-Time Validation

🔐 Security by Default

⚡ Environment Parity

🌐 Data Mesh Ready

Architecture

Four-Layer Enforcement Model

Two-Tier Configuration

Built on the Shoulders of Giants

Documentation

Contributing

Code Standards

Roadmap

License

Community

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages