AWS Data Platform Framework

I. Project Overview
II. Architecture / Design
III. Prerequisites
IV. Installation / Setup
V. Usage
VI. Infrastructure
VII. Configuration
VIII. Project Structure
IX. Limitations / Assumptions

I. Project Overview

This project is an AWS-based data lake platform designed to facilitate data ingestion, storage, transformation, and governance at scale. It provides:

A Python SDK (datalake_sdk) for interacting with the data lake, enabling data ingestion with multiple modes (overwrite, append, upsert)
Terraform infrastructure-as-code modules for provisioning AWS resources organized into domains and pipelines
Support for both native Python (Pandas) and Spark (EMR Serverless) processing environments
Apache Iceberg table format for advanced data lake capabilities (ACID transactions, schema evolution, time travel)
AWS Lake Formation integration for fine-grained access control and data governance
An AI agent ("Datalfred") for natural language interaction with the data lake
Automated orchestration using AWS Step Functions

The platform is intended for data engineers, data scientists, and developers who need to build scalable, governed data pipelines on AWS.

For detailed information about the datalake_sdk Python package, refer to the datalake_sdk README.

II. Architecture / Design

High-Level Components

The architecture is organized around three main layers:

SDK Layer (datalake_sdk):
- Python library providing abstractions for data ingestion and processing
- CLI tool for manual data operations
- Wrappers for Spark and native Python environments
- AI agent (Datalfred) for conversational data lake interaction
Infrastructure Layer (Terraform modules):
- Domain Factory: Provisions core AWS infrastructure per domain (S3 buckets, Glue databases, Lake Formation, Athena workgroups, IAM roles)
- Pipeline Factory: Creates data pipelines with orchestrated tasks (ECS/EMR tasks, Step Functions, CloudWatch logs)
Execution Layer:
- ECS Fargate tasks: Lightweight Python data processing
- EMR Serverless: Spark-based distributed processing
- Step Functions: Orchestration and workflow management

Data Flow

Data is ingested via the datalake_sdk CLI or programmatically through Python code
Tasks run in containerized environments (ECS or EMR) defined by Terraform
Data is written to S3 in Iceberg format with metadata in AWS Glue Data Catalog
Lake Formation manages permissions on databases and tables
Athena provides SQL query access to the data
Step Functions orchestrate multi-step pipelines with dependency management

Key Design Patterns

Domain-Driven Design: Resources are grouped by business domain
Infrastructure as Code: All AWS resources defined in Terraform
Schema-on-Read: Table schemas are inferred from data at ingestion time
Separation of Concerns: Data storage (S3), metadata (Glue), access control (Lake Formation), and orchestration (Step Functions) are decoupled
Multi-Stage Support: Terraform workspaces allow dev/uat/prod isolation

Organizational Conventions

This platform adheres to organizational technical conventions:

CI/CD Platform: GitLab CI is used for continuous integration and deployment (.gitlab-ci.yml). GitHub is a read-only mirror.
AWS Naming Convention: Resources follow the pattern {project_name}_{domain_name}_{stage_name}_resource_name
Stage Name Derivation:
- In GitLab CI: derived from Git branch name ($CI_COMMIT_REF_SLUG)
- Locally: derived from active Terraform workspace
AWS Region: Default region is eu-west-1 (Ireland)

Terraform Backend: Backend configuration is provided at initialization time via runtime parameters:

terraform init \
  -backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \
  -backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"

Cost Allocation Tags: All resources are tagged with project_name, domain_name, and stage_name for FinOps tracking

III. Prerequisites

Required Tools

AWS Account with administrative access or appropriate IAM permissions
Terraform >= 5.60.0, < 6.14.0 (AWS provider version)
Python ~3.13
Poetry (for local SDK development and installation)
Docker (for building container images and local task execution)
AWS CLI configured with credentials
Git access to the GitLab repository

AWS Services Used

Storage & Catalog: S3, Glue Data Catalog
Governance & Security: Lake Formation, IAM
Compute: ECS (Fargate), EMR Serverless
Orchestration: Step Functions, EventBridge
Querying: Athena
Monitoring: CloudWatch
Container Registry: ECR
AI/ML: Bedrock (for Datalfred agent)
Package Management: CodeArtifact
Notifications: Secrets Manager (for Slack integration)

Infrastructure Prerequisites

Terraform Backend: S3 bucket and DynamoDB table for state storage (must be created beforehand)
VPC: A VPC tagged with Name: {project_name}_network_platform_prod containing public and/or private subnets
NAT Gateway: Required if using private subnets (use_public_subnets=false)

IV. Installation / Setup

A. Install datalake_sdk from AWS CodeArtifact

Configure AWS credentials with CodeArtifact read access:

export CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token \
  --domain $CODEARTIFACT_DOMAIN_NAME \
  --domain-owner $AWS_ACCOUNT_ID \
  --query authorizationToken \
  --output text)

Configure pip to use CodeArtifact:

pip config set site.index-url https://aws:$CODEARTIFACT_AUTH_TOKEN@$CODEARTIFACT_DOMAIN_NAME-$AWS_ACCOUNT_ID.d.codeartifact.$AWS_REGION.amazonaws.com/pypi/$CODEARTIFACT_REPOSITORY_NAME/simple/

pip config set site.extra-index-url https://pypi.python.org/simple/

Install the SDK:

pip install datalake-sdk
datalake_sdk --help

(Optional) Install with AI agent support:

pip install datalake-sdk[agent]

B. Install datalake_sdk from Source

Clone the repository:

git clone ${REPO_URL}
cd datalake/datalake_sdk

Install dependencies:

poetry install

Option 1 - Install globally:

poetry build
pip install dist/*.whl
datalake_sdk --help

Option 2 - Run via Poetry:

poetry run datalake_sdk --help

For complete SDK documentation, see datalake_sdk/README.md.

C. Deploy Infrastructure

1. Initialize Terraform Backend

Ensure you have an S3 bucket and DynamoDB table for Terraform state management.

2. Create a Domain

Create a main.tf file using the domain_factory module:

module "domain" {
  source                        = "./domain_factory"
  project_name                  = "my_project"
  domain_name                   = "my_domain"
  stage_name                    = "dev"
  git_repository                = "${REPO_URL}"
  datalake_admin_principal_arns = ["arn:aws:iam::123456789012:role/AdminRole"]
  failure_notification_receivers = ["user@example.com"]
}

3. Deploy the Domain

terraform init \
  -backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \
  -backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"

terraform workspace new dev
terraform apply

4. Create Pipelines

Use the pipeline_factory module to create data pipelines (see Section VI.B for configuration details).

V. Usage

A. CLI - Ingest Data

Ingest a CSV file into the data lake:

datalake_sdk \
  --project-name poc \
  --domain-name my_tests \
  --stage-name prd \
  ingest \
  --database-name my_database \
  --table-name my_table \
  --input-file-path ./file.csv \
  --ingestion-mode upsert \
  --upsert-keys "column_1/column_2" \
  --partition-keys "column_3/column_4" \
  --csv-delimiter ";"

Note: CSV files must include headers.

B. Programmatic - Ingest Data with Python

from datalake_sdk.native_python_processing_wrapper import NativePythonProcessingWrapper

wrapper = NativePythonProcessingWrapper(
    project_name="poc",
    domain_name="my_tests",
    stage_name="prd",
    output_tables={
        "my_database.my_table": {
            "upsert_keys": ["column_1", "column_2"],
            "partition_keys": ["column_3"],
            "ingestion_mode": "upsert"
        }
    }
)

dataframe = wrapper.read_input_dataset("./file.csv", csv_delimiter=";")
wrapper.ingest("my_database.my_table", dataframe)

For Spark environments, replace NativePythonProcessingWrapper with SparkProcessingWrapper.

C. Delete a Table

datalake_sdk \
  --project-name poc \
  --domain-name my_tests \
  --stage-name prd \
  delete_table \
  --database-name my_database \
  --table-name my_table

D. Query Data with Athena

Use the AWS Athena console or CLI to query Iceberg tables:

SELECT * FROM dev_my_database.my_table WHERE column_3 = 'value';

E. AI Agent - Datalfred

Interact with the data lake using natural language (requires datalake-sdk[agent]):

datalake_sdk \
  --project-name poc \
  --domain-name my_tests \
  --stage-name prd \
  datalfred \
  --model-size large

Datalfred can:

Query data using natural language
Investigate pipeline failures
Analyze code and configurations

For more information, see datalake_sdk/README.md - Datalfred Agent.

F. Ingestion Modes

overwrite: Replaces all existing table data
append: Adds new rows without modifying existing data (may create duplicates)
upsert: Updates existing rows or inserts new ones based on upsert keys

For detailed explanations and examples, see datalake_sdk/README.md - Ingestion Modes.

G. Local Task Execution

The platform allows you to execute task code in a local Dockerized environment that is identical to the AWS task execution environment. This is particularly useful for developing new tasks or debugging existing ones.

You can run either:

ECS tasks (native Python with Pandas)
EMR Serverless tasks (PySpark)

The Docker image can be:

A sandbox image (intermediate base image)
A task-specific image (containing the final Python/PySpark code)

Prerequisites

Docker must be running locally
The Docker image must be available:
- If built locally, it's already available
- If from ECR, you must authenticate and pull the image

1. Authenticate to ECR

Assuming AWS credentials are configured:

aws ecr get-login-password --region ${ECR_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com

2. Run an ECS Task (Native Python)

This launches a Jupyter Notebook environment for native Python tasks:

docker run \
  -e AWS_PROFILE=${AWS_CREDENTIALS_PROFILE} \
  --mount type=bind,source=$HOME/.aws/,target=/root/.aws/ \
  -p 8888:8888 \
  ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
  jupyter notebook --ip="0.0.0.0" --no-browser --allow-root

The command will output the Jupyter Notebook URL. Copy and paste it into your browser.

3. Run an EMR Serverless Task (PySpark)

This launches a Jupyter Notebook with PySpark configured:

export CREDENTIALS=$(aws configure export-credentials)
mkdir -p logs  # To access generated Spark logs

docker run -d \
  -e AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r '.AccessKeyId') \
  -e AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r '.SecretAccessKey') \
  -e AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r '.SessionToken // ""') \
  -e AWS_REGION=${AWS_REGION} \
  -e AWS_DEFAULT_REGION=${AWS_REGION} \
  --mount type=bind,source=$(pwd)/logs,target=/var/log/spark/user/ \
  -p 8888:8888 \
  -e PYSPARK_DRIVER_PYTHON=jupyter \
  -e PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="0.0.0.0" --no-browser' \
  ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
  pyspark --master local \
  --conf spark.hadoop.fs.s3a.endpoint=s3.${AWS_REGION}.amazonaws.com \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

cat logs/stderr

The Jupyter Notebook URL will be printed in the logs/stderr file. Copy and paste it into your browser.

Notes

AWS Credentials: The ECS example mounts ~/.aws/ to use your local AWS profile. The EMR example exports credentials as environment variables.
Port Mapping: Both examples expose port 8888 for Jupyter Notebook access.
Spark Configuration: The EMR example configures Spark to use S3 and AWS Glue Data Catalog.
Logs Directory: For EMR tasks, Spark logs are written to the local logs/ directory for debugging.

VI. Infrastructure

A. Domain Factory

The domain_factory Terraform module provisions foundational infrastructure for a data domain.

Key Resources

S3 Buckets:
- {project_name}-{domain_name}-{stage_name}-data: Stores Iceberg table data with versioning and intelligent tiering
- {project_name}-{domain_name}-{stage_name}-technical: Stores logs, temporary files, and Athena query results
Glue Database: Domain-scoped catalog for tables ({stage_prefix}{domain_name})
Lake Formation:
- Registers S3 data location
- Manages database and table permissions
- Supports cross-account data sharing
Athena Workgroup: Query execution environment ({project_name}_{domain_name}_{stage_name})
IAM Roles: Task execution roles with least-privilege permissions
Security Groups: Network isolation for processing tasks
CodeArtifact Repository: Private Python package hosting for the SDK
ECS/EMR Sandbox: Pre-built base images for task execution
Lambda (Failsafe Shutdown): Monitors and terminates long-running tasks
Bedrock Inference Profile: AI model access for Datalfred (model sizes: small, medium, large)
EMR Studio: Interactive development environment for Spark jobs

Key Variables

Variable	Type	Description	Default
`project_name`	string	Project identifier	Required
`domain_name`	string	Domain name	Required
`stage_name`	string	Environment (dev, uat, prod, etc.)	Required
`git_repository`	string	GitLab repository URL	Required
`datalake_admin_principal_arns`	list(string)	IAM principals with full data access	`[]`
`use_public_subnets`	bool	Use public vs. private subnets	`true`
`database_description`	string	Description of the domain database	`""`
`skip_emr_serverless_sandbox_creation`	bool	Skip EMR sandbox image creation	`true`
`failure_notification_receivers`	list(string)	Email addresses for failure alerts	Required

Outputs

The module exports a domain object containing all necessary information for pipeline creation (see domain_factory/outputs.tf).

B. Pipeline Factory

The pipeline_factory Terraform module provisions data pipelines with orchestrated tasks.

Key Resources

Step Functions State Machine: Workflow orchestration with task dependencies
ECS or EMR Tasks: Containerized data processing
- ECS: Fargate tasks for lightweight Python jobs
- EMR: Serverless Spark for large-scale processing
Glue Database (optional): Pipeline-scoped catalog ({stage_prefix}{pipeline_name})
CloudWatch Logs: Task execution logs with 30-day retention
EventBridge Scheduler: Schedule-based or event-driven triggers
IAM Roles: Task-specific permissions (data access, Lake Formation, S3)
ECR Repositories: Docker image storage per task
Failure Notifications: CloudWatch Events trigger notifications on task failures

Key Variables

Variable	Type	Description	Default
`pipeline_name`	string	Pipeline identifier	Required
`tasks_configuration`	map(object)	Task definitions (see below)	Required
`trigger`	object	Pipeline trigger configuration	`{"type": "none", "argument": "none"}`
`orchestration_configuration_template_file_path`	string	Step Functions template path	Required
`domain_object`	object	Output from domain_factory	Required
`failure_notification_receivers`	list(string)	Email addresses for failure alerts	`[]`
`skip_pipeline_database_creation`	bool	Skip pipeline database creation	`false`

Task Configuration Structure

tasks_configuration = {
  "task_name" : {
    "type" : "python" | "sql"
    "path" : "./relative/path/to/task/code"
    "infra_type" : "ECS" | "EMRServerless"
    "infra_config" : {
      "cpu" : "512"        # ECS only: CPU units
      "memory" : "1024"    # ECS only: Memory in MB
    }
    "input_tables" : ["db.table1", "db.table2"]
    "output_tables" : {
      "db.output_table" : {
        "ingestion_mode" : "overwrite" | "append" | "upsert"
        "upsert_keys" : ["id"]
        "partition_keys" : ["date"]
      }
    }
    "additional_parameters" : {
      "param_key" : "static_value"
      "dynamic_param.$" : "$.trigger_param"  # Reference trigger input
    }
    "additional_rebuild_trigger" : {}  # Force image rebuild
    "additional_permissions" : "<IAM policy JSON>"  # Extra IAM permissions
  }
}

Trigger Configuration

Schedule-based (cron):

trigger = {
  "type" : "schedule"
  "argument" : "cron(15 1 * * ? *)"
  "parameters" : jsonencode({
    "key" : "value"
  })
}

Manual execution only:

trigger = {
  "type" : "none"
  "argument" : "none"
}

C. Terraform Modules

The pipeline_factory/modules directory contains three submodules:

1. `ecs_factory`

Provisions ECS Fargate tasks:

Task definition with environment variables
IAM roles for task execution and data access
ECR repository and Docker image build
CloudWatch log groups

2. `emr_factory`

Provisions EMR Serverless applications:

EMR application with Spark runtime
IAM roles for job execution and data access
ECR repository and Docker image build (Spark-compatible)
S3 paths for Spark logs

3. `build_and_upload_image_to_ecr`

Automates Docker image management:

Copies task code and dependencies
Builds Docker image using sandbox base image
Pushes image to ECR
Supports rebuild triggers for code changes

D. Deployment Workflow

Domain Deployment: Terraform provisions domain infrastructure (S3, Glue, Lake Formation, IAM, etc.)
Pipeline Deployment: Terraform provisions pipeline infrastructure
- Creates Step Functions state machine
- Builds Docker images for each task
- Pushes images to ECR
- Creates ECS task definitions or EMR applications
Task Execution:
- EventBridge scheduler or manual trigger starts Step Functions execution
- Step Functions orchestrates task execution based on orchestration template
- ECS/EMR tasks run with environment variables set by Terraform
- Tasks use datalake_sdk to read/write data
Data Ingestion:
- Tasks transform data using Pandas or Spark
- SDK ingests data to S3 in Iceberg format
- Glue Catalog metadata is updated
- Lake Formation permissions are enforced
Monitoring & Notifications:
- CloudWatch logs capture task execution
- Failsafe Lambda monitors task duration
- CloudWatch Events trigger email notifications on failures

VII. Configuration

A. Environment Variables

Set automatically by infrastructure; users can access via task_additional_parameters:

Variable	Description	Set By
`PROJECT_NAME`	Project identifier	Terraform
`DOMAIN_NAME`	Domain name	Terraform
`STAGE_NAME`	Environment name	Terraform
`PIPELINE_NAME`	Pipeline name	Terraform
`TASK_NAME`	Task name	Terraform
`INPUT_TABLES`	JSON-encoded list of input tables	Terraform
`OUTPUT_TABLES`	JSON-encoded dict of output table configs	Terraform
`IS_SQL_JOB`	Whether task executes SQL (`true`/`false`)	Terraform
`TASK_ADDITIONAL_PARAMETERS_*`	Custom parameters from Terraform	Terraform
`step_function_task_token`	Step Functions callback token	Step Functions
`step_function_execution_arn`	Step Functions execution ARN	Step Functions

B. Task Configuration

Example task configuration in Terraform:

tasks_configuration = {
  "my_task" : {
    "type" : "python",
    "path" : "./my_task/",
    "infra_type" : "ECS",
    "infra_config" : {
      "cpu" : "512",
      "memory" : "1024"
    },
    "input_tables" : ["db.input_table"],
    "output_tables" : {
      "db.output_table" : {
        "ingestion_mode" : "upsert",
        "upsert_keys" : ["id"],
        "partition_keys" : ["date"]
      }
    },
    "additional_parameters" : {
      "my_param.$" : "$.trigger_param",  # Dynamic from trigger
      "static_param" : "value"
    },
    "additional_permissions" : data.aws_iam_policy_document.my_policy.json
  }
}

C. Table Metadata

Place YAML files in code/tables_configuration/ to document tables:

# code/tables_configuration/my_database.my_table.yaml
description: "Customer dimension table"
schema:
  customer_id:
    description: "Unique customer identifier"
  customer_name:
    description: "Full name of the customer"

D. Triggers

Schedule: Cron-based execution

trigger = {
  "type" : "schedule"
  "argument" : "cron(15 1 * * ? *)"
  "parameters" : jsonencode({"key": "value"})
}

None: Manual execution only

trigger = {
  "type" : "none"
  "argument" : "none"
}

VIII. Project Structure

datalake/
├── datalake_sdk/              # Python SDK and CLI
├── domain_factory/            # Terraform module for domain infrastructure
├── pipeline_factory/          # Terraform module for pipeline infrastructure
│   └── modules/
│       ├── ecs_factory/       # ECS task provisioning
│       ├── emr_factory/       # EMR Serverless provisioning
│       └── build_and_upload_image_to_ecr/  # Docker build and push
├── test/                      # Integration tests and examples
├── doc_resources/             # Documentation resources
├── .gitlab-ci.yml             # GitLab CI pipeline configuration
├── .github/workflows/         # GitHub Actions (semantic-release)
├── LICENSE                    # Creative Commons Attribution-NonCommercial 4.0
└── README.md                  # This file

A. datalake_sdk

Purpose: Provides a unified interface for data lake operations.

The datalake_sdk is a comprehensive Python package for interacting with the data lake. It includes:

CLI: Command-line interface for ingestion, table deletion, and AI agent interaction
Processing Wrappers: Abstract base class and implementations for Pandas and Spark
Datalfred Agent: AI-powered assistant for natural language data lake interaction

For complete documentation, see datalake_sdk/README.md.

Key Files:

main.py: CLI entry point with subcommands
base_processing_wrapper.py: Abstract base class
native_python_processing_wrapper.py: Pandas implementation
spark_processing_wrapper.py: Spark implementation
ingestion.py: CLI ingestion command
delete_table.py: CLI delete command
datalfred_agent/: AI agent modules

Dependencies (from pyproject.toml):

Core: boto3, click, awswrangler, pyyaml, tqdm, slack-sdk
Optional: strands-agents, strands-agents-tools, strands-agents-builder (for Datalfred)

Version: 5.7.11 (automatically detected by domain_factory)

B. domain_factory

Purpose: Terraform module to provision AWS resources for a data domain.

Key Files:

s3_data.tf, s3_technical.tf: S3 bucket definitions
glue_database.tf: Glue Data Catalog database
lakeformation.tf: Lake Formation registration and permissions
athena_workgroup.tf: Athena workgroup configuration
ecs_cluster_sandbox.tf: ECS base image and cluster
emr_serverless_application_sandbox.tf: EMR Serverless base image
codeartifact_repository.tf: Private package repository
lambda_failsafe_shutdown.tf: Task timeout enforcement
bedrock_inference_profile.tf: AI model access
code_datalake_sdk.tf: Packages and publishes SDK to CodeArtifact
variables.tf: Input variables
outputs.tf: Exported domain configuration
locals.tf: Local variables (environment naming, SDK version extraction)

Outputs: Exports domain configuration consumed by pipeline_factory.

C. pipeline_factory

Purpose: Terraform module to create data pipelines with orchestrated tasks.

Key Files:

step_function.tf: AWS Step Functions state machine
ecs_tasks.tf: ECS task module invocations
emr_tasks.tf: EMR Serverless application module invocations
event_bridge_scheduler.tf: Pipeline trigger configuration
cloudwatch_event_task_failed.tf: Failure notification setup
cloudwatch_event_failsafe_shutdown.tf: Failsafe Lambda trigger
glue_database.tf: Pipeline-scoped database (optional)
variables.tf: Input variables
outputs.tf: Pipeline outputs
locals.tf: Local variables (environment naming)

Modules:

ecs_factory/: Provisions ECS Fargate tasks
emr_factory/: Provisions EMR Serverless applications
build_and_upload_image_to_ecr/: Builds and uploads Docker images

D. test

Purpose: Integration tests and example pipeline implementation.

Key Files:

domain.tf: Test domain deployment
pipeline.tf: Test pipeline with multiple task types
variables.tf: Test-specific variable definitions
integration_tests_pipeline/: Test tasks
- test_write/: Python task for data generation
- test_native_sql_entrypoint/: Native SQL task
- test_spark_sql_entrypoint/: Spark SQL task
- check_and_clean/: Validation and cleanup task
- orchestration_configuration.tftpl.json: Step Functions orchestration
utils/: Test utilities
- run_integration_tests.py: Test execution script
- pipeline_utils/: Test library for dependency validation

Variable Handling:

The test configuration uses a different variable format for convenience:

Variable	Type in domain_factory	Type in test	Transformation
`datalake_admin_principal_arns`	`list(string)`	`string` (comma-separated role names)	Split by comma, lookup ARNs via `data.aws_iam_role`, pass as list
`failure_notification_receivers`	`list(string)`	`string` (comma-separated emails)	Split by comma in module call

Example test variable usage:

# test/domain.tf
data "aws_iam_role" "datalake_admins" {
  for_each = toset(split(",", var.datalake_admin_principal_arns))
  name = each.value
}

module "domain" {
  # ...
  datalake_admin_principal_arns = values(data.aws_iam_role.datalake_admins)[*].arn
  failure_notification_receivers = split(",", var.failure_notification_receivers)
}

CI/CD: Integration tests run automatically in GitLab CI (run_integration_tests stage).

IX. Limitations / Assumptions

AWS-Only: This platform is tightly coupled to AWS services and cannot be deployed on other cloud providers without significant refactoring.
Python 3.13: The SDK and processing tasks require Python ~3.13. Older Python versions are not supported.
Iceberg Format: All tables are stored in Apache Iceberg format. Direct Parquet or other formats are not supported for managed tables.
Region: Infrastructure is deployed in a single AWS region (default: eu-west-1). Cross-region replication is not implemented.
Terraform State Backend: Assumes an existing S3 bucket and DynamoDB table for Terraform state management. These must be created manually before deployment.
Naming Conventions: Resource names follow the pattern {project_name}_{domain_name}_{stage_name}. Non-prod stages prefix database names (e.g., dev_my_database). Production (stage_name = "prod") databases have no prefix.
Lake Formation Permissions: The platform assumes AWS Lake Formation is the primary access control mechanism. IAM-only setups are not fully supported.
CSV Ingestion: CSV files must include headers for schema inference.
Upsert Key Uniqueness: Upsert keys must guarantee row uniqueness in the ingested dataset. Violations will cause ingestion failure.
Concurrency: Iceberg commit conflicts (e.g., simultaneous writes) are mitigated with retries (up to 30 retries with 2-10 minute waits), but high-concurrency scenarios may require tuning.
Failsafe Shutdown: The failsafe Lambda function monitors task durations but does not enforce hard limits on EMR Serverless jobs.
Datalfred Agent: The AI agent requires AWS Bedrock inference profiles to be pre-configured in the domain. Model sizes are fixed (small, medium, large).
GitLab Primary: GitLab is the source of truth for CI/CD. GitHub is a read-only mirror. GitHub Actions are only used for semantic-release on the prod branch.
Subnet Configuration: Tasks run in public subnets by default (use_public_subnets=true). Private subnets require a NAT Gateway for internet access (not provisioned by this platform).
Integration Tests: The test/ folder contains integration tests that create and delete tables. These tests assume administrative permissions and should not be run in production environments.
ECS Task Limits: ECS tasks are constrained by Fargate CPU/memory limits (max 4 vCPU, 30 GB RAM). Larger workloads require EMR Serverless.
SQL Tasks: SQL entry point tasks (type: "sql") are limited to single output tables and use a main.sql file. Multi-table SQL tasks are not supported.
Workspace Isolation: Terraform workspaces are used for environment isolation. The stage name is derived from:
- GitLab CI: Git branch name ($CI_COMMIT_REF_SLUG)
- Local execution: Active Terraform workspace (use terraform workspace select <stage>)
Athena Costs: Query costs are not monitored or capped by the platform. Users should implement AWS Budgets or Cost Anomaly Detection separately.
VPC Dependency: The domain factory expects a VPC tagged with Name: {project_name}_network_platform_prod containing appropriately tagged subnets (Tier: Public or Tier: Private).
EMR Sandbox Creation: By default, skip_emr_serverless_sandbox_creation=true to reduce deployment time. Set to false if large-scale Spark processing is required.
CodeArtifact Publishing: The domain factory automatically builds and publishes the datalake_sdk to CodeArtifact during deployment. The version is extracted from datalake_sdk/pyproject.toml.
Semantic Versioning: Releases are managed via semantic-release on GitHub (.releaserc.json). Conventional commit messages are required for automated versioning.
Local AWS Credentials: Terraform executed locally uses the default AWS credentials configured on the machine. Verify the active AWS account before applying changes.
Local Task Execution: Docker must be running and the task image must be available locally (either built locally or pulled from ECR after authentication). AWS credentials are required for accessing S3 and Glue.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
datalake_sdk		datalake_sdk
doc_resources		doc_resources
domain_factory		domain_factory
pipeline_factory		pipeline_factory
test		test
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.releaserc.json		.releaserc.json
LICENSE		LICENSE
README.md		README.md
tfsec_config.yaml		tfsec_config.yaml

License

erwan-simon/aws-data-platform-framework

Folders and files

Latest commit

History

Repository files navigation

AWS Data Platform Framework

I. Project Overview

II. Architecture / Design

High-Level Components

Data Flow

Key Design Patterns

Organizational Conventions

III. Prerequisites

Required Tools

AWS Services Used

Infrastructure Prerequisites

IV. Installation / Setup

A. Install datalake_sdk from AWS CodeArtifact

B. Install datalake_sdk from Source

C. Deploy Infrastructure

1. Initialize Terraform Backend

2. Create a Domain

3. Deploy the Domain

4. Create Pipelines

V. Usage

A. CLI - Ingest Data

B. Programmatic - Ingest Data with Python

C. Delete a Table

D. Query Data with Athena

E. AI Agent - Datalfred

F. Ingestion Modes

G. Local Task Execution

Prerequisites

1. Authenticate to ECR

2. Run an ECS Task (Native Python)

3. Run an EMR Serverless Task (PySpark)

Notes

VI. Infrastructure

A. Domain Factory

Key Resources

Key Variables

Outputs

B. Pipeline Factory

Key Resources

Key Variables

Task Configuration Structure

Trigger Configuration

C. Terraform Modules

1. ecs_factory

2. emr_factory

3. build_and_upload_image_to_ecr

D. Deployment Workflow

VII. Configuration

A. Environment Variables

B. Task Configuration

C. Table Metadata

D. Triggers

VIII. Project Structure

A. datalake_sdk

B. domain_factory

C. pipeline_factory

D. test

IX. Limitations / Assumptions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

1. `ecs_factory`

2. `emr_factory`

3. `build_and_upload_image_to_ecr`

Packages