- I. Project Overview
- II. Architecture / Design
- III. Prerequisites
- IV. Installation / Setup
- V. Usage
- VI. Infrastructure
- VII. Configuration
- VIII. Project Structure
- IX. Limitations / Assumptions
This project is an AWS-based data lake platform designed to facilitate data ingestion, storage, transformation, and governance at scale. It provides:
- A Python SDK (
datalake_sdk) for interacting with the data lake, enabling data ingestion with multiple modes (overwrite, append, upsert) - Terraform infrastructure-as-code modules for provisioning AWS resources organized into domains and pipelines
- Support for both native Python (Pandas) and Spark (EMR Serverless) processing environments
- Apache Iceberg table format for advanced data lake capabilities (ACID transactions, schema evolution, time travel)
- AWS Lake Formation integration for fine-grained access control and data governance
- An AI agent ("Datalfred") for natural language interaction with the data lake
- Automated orchestration using AWS Step Functions
The platform is intended for data engineers, data scientists, and developers who need to build scalable, governed data pipelines on AWS.
For detailed information about the datalake_sdk Python package, refer to the datalake_sdk README.
The architecture is organized around three main layers:
-
SDK Layer (
datalake_sdk):- Python library providing abstractions for data ingestion and processing
- CLI tool for manual data operations
- Wrappers for Spark and native Python environments
- AI agent (Datalfred) for conversational data lake interaction
-
Infrastructure Layer (Terraform modules):
- Domain Factory: Provisions core AWS infrastructure per domain (S3 buckets, Glue databases, Lake Formation, Athena workgroups, IAM roles)
- Pipeline Factory: Creates data pipelines with orchestrated tasks (ECS/EMR tasks, Step Functions, CloudWatch logs)
-
Execution Layer:
- ECS Fargate tasks: Lightweight Python data processing
- EMR Serverless: Spark-based distributed processing
- Step Functions: Orchestration and workflow management
- Data is ingested via the
datalake_sdkCLI or programmatically through Python code - Tasks run in containerized environments (ECS or EMR) defined by Terraform
- Data is written to S3 in Iceberg format with metadata in AWS Glue Data Catalog
- Lake Formation manages permissions on databases and tables
- Athena provides SQL query access to the data
- Step Functions orchestrate multi-step pipelines with dependency management
- Domain-Driven Design: Resources are grouped by business domain
- Infrastructure as Code: All AWS resources defined in Terraform
- Schema-on-Read: Table schemas are inferred from data at ingestion time
- Separation of Concerns: Data storage (S3), metadata (Glue), access control (Lake Formation), and orchestration (Step Functions) are decoupled
- Multi-Stage Support: Terraform workspaces allow dev/uat/prod isolation
This platform adheres to organizational technical conventions:
- CI/CD Platform: GitLab CI is used for continuous integration and deployment (
.gitlab-ci.yml). GitHub is a read-only mirror. - AWS Naming Convention: Resources follow the pattern
{project_name}_{domain_name}_{stage_name}_resource_name - Stage Name Derivation:
- In GitLab CI: derived from Git branch name (
$CI_COMMIT_REF_SLUG) - Locally: derived from active Terraform workspace
- In GitLab CI: derived from Git branch name (
- AWS Region: Default region is
eu-west-1(Ireland) - Terraform Backend: Backend configuration is provided at initialization time via runtime parameters:
terraform init \ -backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \ -backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"
- Cost Allocation Tags: All resources are tagged with
project_name,domain_name, andstage_namefor FinOps tracking
- AWS Account with administrative access or appropriate IAM permissions
- Terraform >= 5.60.0, < 6.14.0 (AWS provider version)
- Python ~3.13
- Poetry (for local SDK development and installation)
- Docker (for building container images and local task execution)
- AWS CLI configured with credentials
- Git access to the GitLab repository
- Storage & Catalog: S3, Glue Data Catalog
- Governance & Security: Lake Formation, IAM
- Compute: ECS (Fargate), EMR Serverless
- Orchestration: Step Functions, EventBridge
- Querying: Athena
- Monitoring: CloudWatch
- Container Registry: ECR
- AI/ML: Bedrock (for Datalfred agent)
- Package Management: CodeArtifact
- Notifications: Secrets Manager (for Slack integration)
- Terraform Backend: S3 bucket and DynamoDB table for state storage (must be created beforehand)
- VPC: A VPC tagged with
Name: {project_name}_network_platform_prodcontaining public and/or private subnets - NAT Gateway: Required if using private subnets (
use_public_subnets=false)
- Configure AWS credentials with CodeArtifact read access:
export CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token \
--domain $CODEARTIFACT_DOMAIN_NAME \
--domain-owner $AWS_ACCOUNT_ID \
--query authorizationToken \
--output text)- Configure pip to use CodeArtifact:
pip config set site.index-url https://aws:$CODEARTIFACT_AUTH_TOKEN@$CODEARTIFACT_DOMAIN_NAME-$AWS_ACCOUNT_ID.d.codeartifact.$AWS_REGION.amazonaws.com/pypi/$CODEARTIFACT_REPOSITORY_NAME/simple/
pip config set site.extra-index-url https://pypi.python.org/simple/- Install the SDK:
pip install datalake-sdk
datalake_sdk --help- (Optional) Install with AI agent support:
pip install datalake-sdk[agent]- Clone the repository:
git clone ${REPO_URL}
cd datalake/datalake_sdk- Install dependencies:
poetry install- Option 1 - Install globally:
poetry build
pip install dist/*.whl
datalake_sdk --help- Option 2 - Run via Poetry:
poetry run datalake_sdk --helpFor complete SDK documentation, see datalake_sdk/README.md.
Ensure you have an S3 bucket and DynamoDB table for Terraform state management.
Create a main.tf file using the domain_factory module:
module "domain" {
source = "./domain_factory"
project_name = "my_project"
domain_name = "my_domain"
stage_name = "dev"
git_repository = "${REPO_URL}"
datalake_admin_principal_arns = ["arn:aws:iam::123456789012:role/AdminRole"]
failure_notification_receivers = ["user@example.com"]
}terraform init \
-backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \
-backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"
terraform workspace new dev
terraform applyUse the pipeline_factory module to create data pipelines (see Section VI.B for configuration details).
Ingest a CSV file into the data lake:
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
ingest \
--database-name my_database \
--table-name my_table \
--input-file-path ./file.csv \
--ingestion-mode upsert \
--upsert-keys "column_1/column_2" \
--partition-keys "column_3/column_4" \
--csv-delimiter ";"Note: CSV files must include headers.
from datalake_sdk.native_python_processing_wrapper import NativePythonProcessingWrapper
wrapper = NativePythonProcessingWrapper(
project_name="poc",
domain_name="my_tests",
stage_name="prd",
output_tables={
"my_database.my_table": {
"upsert_keys": ["column_1", "column_2"],
"partition_keys": ["column_3"],
"ingestion_mode": "upsert"
}
}
)
dataframe = wrapper.read_input_dataset("./file.csv", csv_delimiter=";")
wrapper.ingest("my_database.my_table", dataframe)For Spark environments, replace NativePythonProcessingWrapper with SparkProcessingWrapper.
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
delete_table \
--database-name my_database \
--table-name my_tableUse the AWS Athena console or CLI to query Iceberg tables:
SELECT * FROM dev_my_database.my_table WHERE column_3 = 'value';Interact with the data lake using natural language (requires datalake-sdk[agent]):
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
datalfred \
--model-size largeDatalfred can:
- Query data using natural language
- Investigate pipeline failures
- Analyze code and configurations
For more information, see datalake_sdk/README.md - Datalfred Agent.
- overwrite: Replaces all existing table data
- append: Adds new rows without modifying existing data (may create duplicates)
- upsert: Updates existing rows or inserts new ones based on upsert keys
For detailed explanations and examples, see datalake_sdk/README.md - Ingestion Modes.
The platform allows you to execute task code in a local Dockerized environment that is identical to the AWS task execution environment. This is particularly useful for developing new tasks or debugging existing ones.
You can run either:
- ECS tasks (native Python with Pandas)
- EMR Serverless tasks (PySpark)
The Docker image can be:
- A sandbox image (intermediate base image)
- A task-specific image (containing the final Python/PySpark code)
- Docker must be running locally
- The Docker image must be available:
- If built locally, it's already available
- If from ECR, you must authenticate and pull the image
Assuming AWS credentials are configured:
aws ecr get-login-password --region ${ECR_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.comThis launches a Jupyter Notebook environment for native Python tasks:
docker run \
-e AWS_PROFILE=${AWS_CREDENTIALS_PROFILE} \
--mount type=bind,source=$HOME/.aws/,target=/root/.aws/ \
-p 8888:8888 \
${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
jupyter notebook --ip="0.0.0.0" --no-browser --allow-rootThe command will output the Jupyter Notebook URL. Copy and paste it into your browser.
This launches a Jupyter Notebook with PySpark configured:
export CREDENTIALS=$(aws configure export-credentials)
mkdir -p logs # To access generated Spark logs
docker run -d \
-e AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r '.AccessKeyId') \
-e AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r '.SecretAccessKey') \
-e AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r '.SessionToken // ""') \
-e AWS_REGION=${AWS_REGION} \
-e AWS_DEFAULT_REGION=${AWS_REGION} \
--mount type=bind,source=$(pwd)/logs,target=/var/log/spark/user/ \
-p 8888:8888 \
-e PYSPARK_DRIVER_PYTHON=jupyter \
-e PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="0.0.0.0" --no-browser' \
${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
pyspark --master local \
--conf spark.hadoop.fs.s3a.endpoint=s3.${AWS_REGION}.amazonaws.com \
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
cat logs/stderrThe Jupyter Notebook URL will be printed in the logs/stderr file. Copy and paste it into your browser.
- AWS Credentials: The ECS example mounts
~/.aws/to use your local AWS profile. The EMR example exports credentials as environment variables. - Port Mapping: Both examples expose port 8888 for Jupyter Notebook access.
- Spark Configuration: The EMR example configures Spark to use S3 and AWS Glue Data Catalog.
- Logs Directory: For EMR tasks, Spark logs are written to the local
logs/directory for debugging.
The domain_factory Terraform module provisions foundational infrastructure for a data domain.
-
S3 Buckets:
{project_name}-{domain_name}-{stage_name}-data: Stores Iceberg table data with versioning and intelligent tiering{project_name}-{domain_name}-{stage_name}-technical: Stores logs, temporary files, and Athena query results
-
Glue Database: Domain-scoped catalog for tables (
{stage_prefix}{domain_name}) -
Lake Formation:
- Registers S3 data location
- Manages database and table permissions
- Supports cross-account data sharing
-
Athena Workgroup: Query execution environment (
{project_name}_{domain_name}_{stage_name}) -
IAM Roles: Task execution roles with least-privilege permissions
-
Security Groups: Network isolation for processing tasks
-
CodeArtifact Repository: Private Python package hosting for the SDK
-
ECS/EMR Sandbox: Pre-built base images for task execution
-
Lambda (Failsafe Shutdown): Monitors and terminates long-running tasks
-
Bedrock Inference Profile: AI model access for Datalfred (model sizes:
small,medium,large) -
EMR Studio: Interactive development environment for Spark jobs
| Variable | Type | Description | Default |
|---|---|---|---|
project_name |
string | Project identifier | Required |
domain_name |
string | Domain name | Required |
stage_name |
string | Environment (dev, uat, prod, etc.) | Required |
git_repository |
string | GitLab repository URL | Required |
datalake_admin_principal_arns |
list(string) | IAM principals with full data access | [] |
use_public_subnets |
bool | Use public vs. private subnets | true |
database_description |
string | Description of the domain database | "" |
skip_emr_serverless_sandbox_creation |
bool | Skip EMR sandbox image creation | true |
failure_notification_receivers |
list(string) | Email addresses for failure alerts | Required |
The module exports a domain object containing all necessary information for pipeline creation (see domain_factory/outputs.tf).
The pipeline_factory Terraform module provisions data pipelines with orchestrated tasks.
-
Step Functions State Machine: Workflow orchestration with task dependencies
-
ECS or EMR Tasks: Containerized data processing
- ECS: Fargate tasks for lightweight Python jobs
- EMR: Serverless Spark for large-scale processing
-
Glue Database (optional): Pipeline-scoped catalog (
{stage_prefix}{pipeline_name}) -
CloudWatch Logs: Task execution logs with 30-day retention
-
EventBridge Scheduler: Schedule-based or event-driven triggers
-
IAM Roles: Task-specific permissions (data access, Lake Formation, S3)
-
ECR Repositories: Docker image storage per task
-
Failure Notifications: CloudWatch Events trigger notifications on task failures
| Variable | Type | Description | Default |
|---|---|---|---|
pipeline_name |
string | Pipeline identifier | Required |
tasks_configuration |
map(object) | Task definitions (see below) | Required |
trigger |
object | Pipeline trigger configuration | {"type": "none", "argument": "none"} |
orchestration_configuration_template_file_path |
string | Step Functions template path | Required |
domain_object |
object | Output from domain_factory | Required |
failure_notification_receivers |
list(string) | Email addresses for failure alerts | [] |
skip_pipeline_database_creation |
bool | Skip pipeline database creation | false |
tasks_configuration = {
"task_name" : {
"type" : "python" | "sql"
"path" : "./relative/path/to/task/code"
"infra_type" : "ECS" | "EMRServerless"
"infra_config" : {
"cpu" : "512" # ECS only: CPU units
"memory" : "1024" # ECS only: Memory in MB
}
"input_tables" : ["db.table1", "db.table2"]
"output_tables" : {
"db.output_table" : {
"ingestion_mode" : "overwrite" | "append" | "upsert"
"upsert_keys" : ["id"]
"partition_keys" : ["date"]
}
}
"additional_parameters" : {
"param_key" : "static_value"
"dynamic_param.$" : "$.trigger_param" # Reference trigger input
}
"additional_rebuild_trigger" : {} # Force image rebuild
"additional_permissions" : "<IAM policy JSON>" # Extra IAM permissions
}
}Schedule-based (cron):
trigger = {
"type" : "schedule"
"argument" : "cron(15 1 * * ? *)"
"parameters" : jsonencode({
"key" : "value"
})
}Manual execution only:
trigger = {
"type" : "none"
"argument" : "none"
}The pipeline_factory/modules directory contains three submodules:
Provisions ECS Fargate tasks:
- Task definition with environment variables
- IAM roles for task execution and data access
- ECR repository and Docker image build
- CloudWatch log groups
Provisions EMR Serverless applications:
- EMR application with Spark runtime
- IAM roles for job execution and data access
- ECR repository and Docker image build (Spark-compatible)
- S3 paths for Spark logs
Automates Docker image management:
- Copies task code and dependencies
- Builds Docker image using sandbox base image
- Pushes image to ECR
- Supports rebuild triggers for code changes
-
Domain Deployment: Terraform provisions domain infrastructure (S3, Glue, Lake Formation, IAM, etc.)
-
Pipeline Deployment: Terraform provisions pipeline infrastructure
- Creates Step Functions state machine
- Builds Docker images for each task
- Pushes images to ECR
- Creates ECS task definitions or EMR applications
-
Task Execution:
- EventBridge scheduler or manual trigger starts Step Functions execution
- Step Functions orchestrates task execution based on orchestration template
- ECS/EMR tasks run with environment variables set by Terraform
- Tasks use
datalake_sdkto read/write data
-
Data Ingestion:
- Tasks transform data using Pandas or Spark
- SDK ingests data to S3 in Iceberg format
- Glue Catalog metadata is updated
- Lake Formation permissions are enforced
-
Monitoring & Notifications:
- CloudWatch logs capture task execution
- Failsafe Lambda monitors task duration
- CloudWatch Events trigger email notifications on failures
Set automatically by infrastructure; users can access via task_additional_parameters:
| Variable | Description | Set By |
|---|---|---|
PROJECT_NAME |
Project identifier | Terraform |
DOMAIN_NAME |
Domain name | Terraform |
STAGE_NAME |
Environment name | Terraform |
PIPELINE_NAME |
Pipeline name | Terraform |
TASK_NAME |
Task name | Terraform |
INPUT_TABLES |
JSON-encoded list of input tables | Terraform |
OUTPUT_TABLES |
JSON-encoded dict of output table configs | Terraform |
IS_SQL_JOB |
Whether task executes SQL (true/false) |
Terraform |
TASK_ADDITIONAL_PARAMETERS_* |
Custom parameters from Terraform | Terraform |
step_function_task_token |
Step Functions callback token | Step Functions |
step_function_execution_arn |
Step Functions execution ARN | Step Functions |
Example task configuration in Terraform:
tasks_configuration = {
"my_task" : {
"type" : "python",
"path" : "./my_task/",
"infra_type" : "ECS",
"infra_config" : {
"cpu" : "512",
"memory" : "1024"
},
"input_tables" : ["db.input_table"],
"output_tables" : {
"db.output_table" : {
"ingestion_mode" : "upsert",
"upsert_keys" : ["id"],
"partition_keys" : ["date"]
}
},
"additional_parameters" : {
"my_param.$" : "$.trigger_param", # Dynamic from trigger
"static_param" : "value"
},
"additional_permissions" : data.aws_iam_policy_document.my_policy.json
}
}Place YAML files in code/tables_configuration/ to document tables:
# code/tables_configuration/my_database.my_table.yaml
description: "Customer dimension table"
schema:
customer_id:
description: "Unique customer identifier"
customer_name:
description: "Full name of the customer"Schedule: Cron-based execution
trigger = {
"type" : "schedule"
"argument" : "cron(15 1 * * ? *)"
"parameters" : jsonencode({"key": "value"})
}None: Manual execution only
trigger = {
"type" : "none"
"argument" : "none"
}datalake/
├── datalake_sdk/ # Python SDK and CLI
├── domain_factory/ # Terraform module for domain infrastructure
├── pipeline_factory/ # Terraform module for pipeline infrastructure
│ └── modules/
│ ├── ecs_factory/ # ECS task provisioning
│ ├── emr_factory/ # EMR Serverless provisioning
│ └── build_and_upload_image_to_ecr/ # Docker build and push
├── test/ # Integration tests and examples
├── doc_resources/ # Documentation resources
├── .gitlab-ci.yml # GitLab CI pipeline configuration
├── .github/workflows/ # GitHub Actions (semantic-release)
├── LICENSE # Creative Commons Attribution-NonCommercial 4.0
└── README.md # This file
Purpose: Provides a unified interface for data lake operations.
The datalake_sdk is a comprehensive Python package for interacting with the data lake. It includes:
- CLI: Command-line interface for ingestion, table deletion, and AI agent interaction
- Processing Wrappers: Abstract base class and implementations for Pandas and Spark
- Datalfred Agent: AI-powered assistant for natural language data lake interaction
For complete documentation, see datalake_sdk/README.md.
Key Files:
main.py: CLI entry point with subcommandsbase_processing_wrapper.py: Abstract base classnative_python_processing_wrapper.py: Pandas implementationspark_processing_wrapper.py: Spark implementationingestion.py: CLI ingestion commanddelete_table.py: CLI delete commanddatalfred_agent/: AI agent modules
Dependencies (from pyproject.toml):
- Core:
boto3,click,awswrangler,pyyaml,tqdm,slack-sdk - Optional:
strands-agents,strands-agents-tools,strands-agents-builder(for Datalfred)
Version: 5.7.11 (automatically detected by domain_factory)
Purpose: Terraform module to provision AWS resources for a data domain.
Key Files:
s3_data.tf,s3_technical.tf: S3 bucket definitionsglue_database.tf: Glue Data Catalog databaselakeformation.tf: Lake Formation registration and permissionsathena_workgroup.tf: Athena workgroup configurationecs_cluster_sandbox.tf: ECS base image and clusteremr_serverless_application_sandbox.tf: EMR Serverless base imagecodeartifact_repository.tf: Private package repositorylambda_failsafe_shutdown.tf: Task timeout enforcementbedrock_inference_profile.tf: AI model accesscode_datalake_sdk.tf: Packages and publishes SDK to CodeArtifactvariables.tf: Input variablesoutputs.tf: Exported domain configurationlocals.tf: Local variables (environment naming, SDK version extraction)
Outputs: Exports domain configuration consumed by pipeline_factory.
Purpose: Terraform module to create data pipelines with orchestrated tasks.
Key Files:
step_function.tf: AWS Step Functions state machineecs_tasks.tf: ECS task module invocationsemr_tasks.tf: EMR Serverless application module invocationsevent_bridge_scheduler.tf: Pipeline trigger configurationcloudwatch_event_task_failed.tf: Failure notification setupcloudwatch_event_failsafe_shutdown.tf: Failsafe Lambda triggerglue_database.tf: Pipeline-scoped database (optional)variables.tf: Input variablesoutputs.tf: Pipeline outputslocals.tf: Local variables (environment naming)
Modules:
ecs_factory/: Provisions ECS Fargate tasksemr_factory/: Provisions EMR Serverless applicationsbuild_and_upload_image_to_ecr/: Builds and uploads Docker images
Purpose: Integration tests and example pipeline implementation.
Key Files:
domain.tf: Test domain deploymentpipeline.tf: Test pipeline with multiple task typesvariables.tf: Test-specific variable definitionsintegration_tests_pipeline/: Test taskstest_write/: Python task for data generationtest_native_sql_entrypoint/: Native SQL tasktest_spark_sql_entrypoint/: Spark SQL taskcheck_and_clean/: Validation and cleanup taskorchestration_configuration.tftpl.json: Step Functions orchestration
utils/: Test utilitiesrun_integration_tests.py: Test execution scriptpipeline_utils/: Test library for dependency validation
Variable Handling:
The test configuration uses a different variable format for convenience:
| Variable | Type in domain_factory | Type in test | Transformation |
|---|---|---|---|
datalake_admin_principal_arns |
list(string) |
string (comma-separated role names) |
Split by comma, lookup ARNs via data.aws_iam_role, pass as list |
failure_notification_receivers |
list(string) |
string (comma-separated emails) |
Split by comma in module call |
Example test variable usage:
# test/domain.tf
data "aws_iam_role" "datalake_admins" {
for_each = toset(split(",", var.datalake_admin_principal_arns))
name = each.value
}
module "domain" {
# ...
datalake_admin_principal_arns = values(data.aws_iam_role.datalake_admins)[*].arn
failure_notification_receivers = split(",", var.failure_notification_receivers)
}CI/CD: Integration tests run automatically in GitLab CI (run_integration_tests stage).
-
AWS-Only: This platform is tightly coupled to AWS services and cannot be deployed on other cloud providers without significant refactoring.
-
Python 3.13: The SDK and processing tasks require Python ~3.13. Older Python versions are not supported.
-
Iceberg Format: All tables are stored in Apache Iceberg format. Direct Parquet or other formats are not supported for managed tables.
-
Region: Infrastructure is deployed in a single AWS region (default:
eu-west-1). Cross-region replication is not implemented. -
Terraform State Backend: Assumes an existing S3 bucket and DynamoDB table for Terraform state management. These must be created manually before deployment.
-
Naming Conventions: Resource names follow the pattern
{project_name}_{domain_name}_{stage_name}. Non-prod stages prefix database names (e.g.,dev_my_database). Production (stage_name = "prod") databases have no prefix. -
Lake Formation Permissions: The platform assumes AWS Lake Formation is the primary access control mechanism. IAM-only setups are not fully supported.
-
CSV Ingestion: CSV files must include headers for schema inference.
-
Upsert Key Uniqueness: Upsert keys must guarantee row uniqueness in the ingested dataset. Violations will cause ingestion failure.
-
Concurrency: Iceberg commit conflicts (e.g., simultaneous writes) are mitigated with retries (up to 30 retries with 2-10 minute waits), but high-concurrency scenarios may require tuning.
-
Failsafe Shutdown: The failsafe Lambda function monitors task durations but does not enforce hard limits on EMR Serverless jobs.
-
Datalfred Agent: The AI agent requires AWS Bedrock inference profiles to be pre-configured in the domain. Model sizes are fixed (
small,medium,large). -
GitLab Primary: GitLab is the source of truth for CI/CD. GitHub is a read-only mirror. GitHub Actions are only used for semantic-release on the
prodbranch. -
Subnet Configuration: Tasks run in public subnets by default (
use_public_subnets=true). Private subnets require a NAT Gateway for internet access (not provisioned by this platform). -
Integration Tests: The
test/folder contains integration tests that create and delete tables. These tests assume administrative permissions and should not be run in production environments. -
ECS Task Limits: ECS tasks are constrained by Fargate CPU/memory limits (max 4 vCPU, 30 GB RAM). Larger workloads require EMR Serverless.
-
SQL Tasks: SQL entry point tasks (
type: "sql") are limited to single output tables and use amain.sqlfile. Multi-table SQL tasks are not supported. -
Workspace Isolation: Terraform workspaces are used for environment isolation. The stage name is derived from:
- GitLab CI: Git branch name (
$CI_COMMIT_REF_SLUG) - Local execution: Active Terraform workspace (use
terraform workspace select <stage>)
- GitLab CI: Git branch name (
-
Athena Costs: Query costs are not monitored or capped by the platform. Users should implement AWS Budgets or Cost Anomaly Detection separately.
-
VPC Dependency: The domain factory expects a VPC tagged with
Name: {project_name}_network_platform_prodcontaining appropriately tagged subnets (Tier: PublicorTier: Private). -
EMR Sandbox Creation: By default,
skip_emr_serverless_sandbox_creation=trueto reduce deployment time. Set tofalseif large-scale Spark processing is required. -
CodeArtifact Publishing: The domain factory automatically builds and publishes the
datalake_sdkto CodeArtifact during deployment. The version is extracted fromdatalake_sdk/pyproject.toml. -
Semantic Versioning: Releases are managed via semantic-release on GitHub (
.releaserc.json). Conventional commit messages are required for automated versioning. -
Local AWS Credentials: Terraform executed locally uses the default AWS credentials configured on the machine. Verify the active AWS account before applying changes.
-
Local Task Execution: Docker must be running and the task image must be available locally (either built locally or pulled from ECR after authentication). AWS credentials are required for accessing S3 and Glue.