- I. Project Overview
- II. Architecture / Design
- III. Prerequisites
- IV. Installation / Setup
- V. Usage
- VI. Infrastructure
- VII. Configuration
- VIII. Project Structure
- IX. Limitations / Assumptions
Datalfred is an AI-powered chatbot designed to assist users in interacting with a data lake platform on AWS. The chatbot provides capabilities for querying data, investigating AWS infrastructure issues, and managing ingestion workflows.
The system is deployed as an AWS Lambda function accessible via Slack, using AWS Bedrock for LLM inference and integrating with various AWS services including Athena, Glue Data Catalog, Step Functions, CloudWatch, and ECS/EMR for data operations.
The system follows a multi-agent architecture built on the strands framework:
-
Main Agent (
main_agent.py)- Orchestrates the overall conversation flow
- Routes user requests to specialized sub-agents
- Manages conversation history using a sliding window approach
- Tracks token usage and calculates costs
-
Sub-Agents
- Data Analyst Agent (
data_analyst.py): Queries data from the Glue Data Catalog and executes SQL queries via Athena - Run Guy Agent (
run_guy.py): Investigates AWS infrastructure issues, monitors ingestion jobs (Step Functions, ECS, EMR), and can redrive failed executions
- Data Analyst Agent (
-
Slack Integration (
slack.py)- Validates Slack webhook signatures to prevent unauthorized access
- Sends and receives messages from Slack channels
- Manages threaded conversations
-
Lambda Entrypoint (
lambda_entrypoint.py)- Receives Slack events via Lambda function URL
- Implements timeout failsafe mechanism to prevent Lambda execution timeouts
- Authorizes users based on Slack user IDs
-
Failsafe Lambda (
chatbot_failsafe/main.py)- Automatically disables the main chatbot Lambda if too many authentication failures are detected (>100 in 1 hour)
- Triggered by CloudWatch alarms monitoring signature validation failures
The infrastructure is defined using Terraform and includes:
- AWS Bedrock Inference Profiles: Three profiles (large, medium, small) using different Claude and Amazon Nova models
- Lambda Function: Containerized Python application deployed via ECR
- S3 Bucket: Stores Athena query results and conversation session data
- Athena Workgroup: Configured for SQL query execution
- CloudWatch: Logs and alarms for monitoring and failsafe triggering
- IAM Roles & Policies: Fine-grained permissions for Lambda execution
The chatbot uses the strands-agents framework for agent orchestration and AWS services for data access and infrastructure management.
- Python: 3.13
- Poetry: For dependency management (version 2.1.4 in Dockerfile)
- AWS Account: With permissions to deploy Lambda, Bedrock, S3, Athena, IAM, CloudWatch, and Glue resources
- Terraform: For infrastructure deployment (backend configured for S3 with DynamoDB state locking)
- Docker: For building the Lambda container image
- AWS CLI: Configured with appropriate credentials
- Slack Workspace: With administrator access to create and configure a Slack app
- AWS Bedrock Access: Models must be enabled in your AWS account (Claude Sonnet 4.5, Claude Haiku 3, Amazon Nova Pro)
-
Clone the repository
git clone <repository-url> cd chatbot
-
Install Python dependencies
cd code poetry install --with agent -
Configure AWS credentials Ensure your AWS CLI is configured with credentials for the target AWS account:
aws configure
-
Set up required AWS Secrets
Create a Secrets Manager secret with the following structure:
{ "token": "xoxb-your-slack-bot-token", "signing_secret": "your-slack-signing-secret", "slack_channel_id": "C01234567" }The secret name should follow the pattern:
{project_name}_slack_alerting_prod
-
Navigate to the infrastructure directory
cd iac -
Create a
terraform.tfvarsfileproject_name = "poc" git_repository = "your-git-repo-url" failure_notification_receivers = "email1@example.com,email2@example.com" authorized_slack_users = "U01234567,U89ABCDEF" role_to_assume_arn = "arn:aws:iam::123456789012:role/deployment-role"
-
Initialize Terraform with backend configuration
terraform init \ -backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \ -backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"
-
Select or create a workspace (determines the stage/environment)
terraform workspace new prod # or terraform workspace select prod
-
Deploy the infrastructure
terraform plan terraform apply
This will:
- Build and push the Docker image to ECR
- Create the Lambda function with the container image
- Set up Bedrock inference profiles
- Configure S3, Athena, CloudWatch, and IAM resources
- Deploy the failsafe Lambda and CloudWatch alarm
-
Retrieve the Lambda Function URL
After deployment, get the Lambda function URL from Terraform outputs:
terraform output
You will need this URL in the next step to configure your Slack app.
The chatbot requires a Slack app to be created and configured in your Slack workspace. Follow these steps:
-
Create a new Slack app at api.slack.com/apps
-
Use the following app manifest, replacing the placeholder values:
$APPLICATION_NAME: Choose a display name for your bot (e.g., "Datalfred")$AWS_LAMBDA_FUNCTION_URL: Use the Lambda function URL from the Terraform output above
{ "display_information": { "name": "$APPLICATION_NAME" }, "features": { "app_home": { "home_tab_enabled": false, "messages_tab_enabled": true, "messages_tab_read_only_enabled": false }, "bot_user": { "display_name": "$APPLICATION_NAME", "always_online": false } }, "oauth_config": { "scopes": { "bot": [ "app_mentions:read", "chat:write", "chat:write.customize", "commands", "im:read", "im:write", "incoming-webhook", "reactions:read", "reactions:write", "im:history", "mpim:history" ] } }, "settings": { "event_subscriptions": { "request_url": "$AWS_LAMBDA_FUNCTION_URL", "user_events": [ "message.app_home" ], "bot_events": [ "app_mention", "message.im" ] }, "org_deploy_enabled": false, "socket_mode_enabled": false, "token_rotation_enabled": false } } -
Install the app to your Slack workspace
After creating the app with the manifest, click "Install to Workspace" and authorize the requested permissions.
-
Copy the credentials to AWS Secrets Manager
From the Slack app settings, retrieve:
- Bot User OAuth Token (from "OAuth & Permissions" → starts with
xoxb-) - Signing Secret (from "Basic Information" → "App Credentials")
Update your AWS Secrets Manager secret (created in step 4 of Local Development Setup) with these values.
- Bot User OAuth Token (from "OAuth & Permissions" → starts with
-
Verify the configuration
Send a direct message to your bot in Slack or mention it in a channel. If configured correctly, the bot should respond (or indicate if you're not in the authorized users list).
The chatbot can be run interactively from the command line:
poetry run chatbot --project-name <project_name> --stage-name <stage> --model-size <size>Options:
-p, --project-name(required): Name of the project-s, --stage-name(optional, default:prod): Environment name-m, --model-size(optional, default:large): Model size (large,medium,small)-d, --print-sub-agent-debug(optional, flag): Print debug output from sub-agents-id, --session-id(optional): Session ID for conversation persistence-up, --user-prompt(optional): Single prompt for one-shot execution instead of interactive mode
Examples:
Interactive mode:
poetry run chatbot -p poc -s prod -m largeOne-shot query:
poetry run chatbot -p poc -s prod -m medium -up "Show me tables in the analytics database"With session persistence:
poetry run chatbot -p poc -s prod -id my-session-123Users can interact with Datalfred by mentioning the bot in a Slack channel or sending it a direct message. The bot will:
- Validate the request signature
- Check if the user is authorized (via
AUTHORIZED_SLACK_USERSenvironment variable) - Process the question using the main agent and sub-agents
- Reply in the same Slack thread
Slack User Authorization:
Only Slack users whose IDs are listed in the authorized_slack_users Terraform variable can use the bot.
Finding Slack User IDs: To find a user's Slack ID, click on their profile in Slack, then click the three dots (More) → "Copy member ID".
-
Query Data: Ask questions about data in the data lake (Glue Catalog, Athena queries)
- Example: "What tables are in the customer database?"
- Example: "Show me the last 10 records from the sales table"
-
Investigate Infrastructure: Check status of ingestion jobs, Step Functions, ECS tasks, CloudWatch logs
- Example: "What's the status of the latest ingestion run?"
- Example: "Show me errors in the data-pipeline CloudWatch logs"
-
Redrive Executions: Restart failed Step Function executions (only when explicitly requested)
- Example: "Redrive the failed execution for pipeline X"
The chatbot tracks token usage and provides cost estimates after each conversation. It will also suggest using a smaller model size if costs exceed expectations.
The infrastructure is organized into the following modules:
-
Bedrock Inference Profiles (
bedrock_inference_profile.tf)- Creates three inference profiles for different model sizes
- Large: Claude Sonnet 4.5
- Medium: Claude Haiku 3
- Small: Amazon Nova Pro
-
Lambda Function (
lambda_chatbot.tf)- Container-based Lambda function (900s timeout, 520 MB memory)
- Uses a Terraform module to build and push Docker images to ECR
- Automatically rebuilds when code changes are detected (via file hash triggers)
- Exposes a Lambda function URL for Slack webhook integration
-
S3 Bucket (
s3.tf)- Stores Athena query results (7-day lifecycle)
- Stores conversation session data with versioning enabled
- Intelligent tiering for cost optimization
- Server-side encryption (AES256)
-
Athena Workgroup (
athena_workgroup.tf)- Configured with output location in S3
- Enforces workgroup configuration
-
Failsafe Lambda (
lambda_chatbot_failsafe.tf)- Monitors CloudWatch logs for authentication failures
- Triggers a CloudWatch alarm if >100 signature mismatches occur in 1 hour
- Automatically sets the main Lambda concurrency to 0 (disabling it) when triggered
- Sends email notifications to configured recipients
- GitLab CI: Pipelines are defined in
.gitlab-ci.ymlusing shared templates from a central repository - Stages:
init,format,security,deploy,mirror_to_github - Environment Selection: Determined by Git branch name in CI, or by Terraform workspace locally
- Naming Convention: Resources follow
{project_name}_{domain_name}_{stage_name}_<resource_name>pattern - Backend: Terraform state is stored in S3 with DynamoDB locking (configured at
terraform inittime)
The following environment variables are configured for the Lambda function:
PROJECT_NAME: Project identifier (e.g.,poc)DOMAIN_NAME: Domain/component name (hardcoded tochatbot)STAGE_NAME: Environment name (e.g.,prod,dev)SLACK_SECRET_ARN: ARN of the Secrets Manager secret containing Slack credentialsAUTHORIZED_SLACK_USERS: Comma-separated list of Slack user IDs authorized to use the bot
Required variables (defined in variables.tf):
project_name: Name of the projectgit_repository: Git repository URLfailure_notification_receivers: Comma-separated email addresses for failure alertsauthorized_slack_users: Comma-separated Slack user IDsrole_to_assume_arn: (Optional) IAM role ARN for Terraform to assume during deployment
The Slack app requires the following OAuth scopes (configured via app manifest):
Bot Token Scopes:
app_mentions:read: Detect when the bot is mentionedchat:write: Send messages as the botchat:write.customize: Customize message appearancecommands: Support slash commands (if implemented)im:read,im:write: Read and send direct messagesim:history,mpim:history: Access message history in DMsincoming-webhook: Post messages to channelsreactions:read,reactions:write: Read and add reactions
Event Subscriptions:
app_mention: Triggered when the bot is mentioned in a channelmessage.im: Triggered when a direct message is sent to the botmessage.app_home: Triggered when a message is sent in the app home
- Terraform Workspace: Determines the
stage_namefor local deployments - AWS Region: Default region is
eu-west-1(configured interraform.tf) - Model Size: Can be set via CLI (
--model-size) to control cost/performance tradeoff
- Sliding Window Size: 20 messages maximum in conversation history
- Session Storage: Persisted in S3 for continued conversations (when session ID is provided)
- Lambda Timeout: 900 seconds (15 minutes)
- Failsafe Trigger: 100 signature validation failures in 1 hour
Located in the code/ directory:
code/
├── chatbot/ # Main application package
│ ├── __init__.py
│ ├── main_agent.py # Main orchestration agent
│ ├── lambda_entrypoint.py # AWS Lambda handler
│ ├── slack.py # Slack integration (webhooks, signatures)
│ └── sub_agents/ # Specialized agents
│ ├── data_analyst.py # Queries Glue/Athena
│ └── run_guy.py # AWS infrastructure investigation
├── chatbot_failsafe/ # Emergency shutoff Lambda
│ └── main.py
├── pyproject.toml # Poetry dependencies
└── Dockerfile # Lambda container image definition
Key Files:
main_agent.py: Entry point for the chatbot, orchestrates sub-agents, manages conversation state, and calculates costslambda_entrypoint.py: AWS Lambda handler, processes Slack events, implements timeout failsafeslack.py: Handles Slack signature validation, message sending, and event filteringsub_agents/: Each sub-agent is a specialized tool with its own system prompt and capabilities
Located in the iac/ directory:
iac/
├── terraform.tf # Provider and backend configuration
├── locals.tf # Local variables (domain_name, stage_name)
├── variables.tf # Input variables
├── data.tf # Data sources (AWS account, region, secrets)
├── lambda_chatbot.tf # Main Lambda function and IAM
├── lambda_chatbot_failsafe.tf # Failsafe Lambda and CloudWatch alarm
├── bedrock_inference_profile.tf # Bedrock model configurations
├── s3.tf # S3 bucket for Athena and sessions
├── athena_workgroup.tf # Athena workgroup configuration
└── outputs.tf # Terraform outputs
Key Files:
lambda_chatbot.tf: Defines the main Lambda function, builds Docker images via a reusable Terraform module, and manages IAM permissionsbedrock_inference_profile.tf: Creates three Bedrock inference profiles for different model sizeslambda_chatbot_failsafe.tf: Implements the security failsafe mechanism with CloudWatch alarms
-
AWS Region: Infrastructure is deployed in
eu-west-1by default (Ireland). -
GitLab CI Dependency: CI/CD pipelines rely on GitLab CI templates that are not present in GitHub mirrors. GitHub should be considered read-only.
-
Bedrock Model Availability: The chatbot assumes that the required Bedrock models (Claude Sonnet 4.5, Claude Haiku 3, Amazon Nova Pro) are enabled in the AWS account and region.
-
Slack App Configuration: The chatbot requires a Slack app to be manually created and configured using the provided manifest. The Lambda function URL must be available before configuring the Slack app's event subscription endpoint.
-
Slack Secrets: The chatbot expects a Secrets Manager secret named
{project_name}_slack_alerting_prodwith specific fields (token,signing_secret,slack_channel_id). -
Session Persistence: Conversation history is only persisted when a
session_idis provided. In Slack mode, the Slack user ID is used as the session ID. -
Lambda Timeout: The Lambda function has a 15-minute timeout. Long-running queries or operations may be interrupted by the failsafe mechanism (triggered at <3 minutes remaining).
-
Cost Tracking: Token usage and cost calculations are approximations based on hardcoded pricing for specific models. Actual costs may vary.
-
Terraform Backend: Backend configuration (S3 bucket and DynamoDB table) must be provided at
terraform inittime and is not hardcoded. -
Failsafe Threshold: The security failsafe is triggered after 100 failed signature validations in 1 hour. This threshold is hardcoded and may need adjustment based on usage patterns.
-
Tool Error Handling: Sub-agents are instructed not to retry failed tool calls (except for SQL syntax errors) to prevent infinite loops and reduce costs.
-
Read-Only AWS Operations: The "Run Guy" agent is restricted to read-only AWS operations, with the exception of redriving failed Step Function executions when explicitly requested by authorized users.
-
Workspace-Based Environment Selection: When running Terraform locally, the environment (stage) is determined by the active Terraform workspace. The default workspace results in
stage_name=default, which may not be intended for production use.