A Kubernetes GPU workload simulator using KWOK (Kubernetes WithOut Kubelet) and KAI Scheduler. This project provides a lightweight, efficient way to simulate GPU workloads in a Kubernetes environment without requiring actual GPU hardware.
graph TB
subgraph "Kubernetes Cluster"
API[API Server]
subgraph "Control Plane"
KAI[KAI Scheduler]
KWOK[KWOK Controller]
style KAI fill:#f9f,stroke:#333,stroke-width:2px
style KWOK fill:#bbf,stroke:#333,stroke-width:2px
end
subgraph "Simulated Resources"
N1[GPU Node 1]
N2[GPU Node 2]
N3[GPU Node 3]
style N1 fill:#bfb,stroke:#333,stroke-width:2px
style N2 fill:#bfb,stroke:#333,stroke-width:2px
style N3 fill:#bfb,stroke:#333,stroke-width:2px
end
subgraph "Workloads"
J1[GPU Jobs]
D1[GPU Deployments]
style J1 fill:#fbb,stroke:#333,stroke-width:2px
style D1 fill:#fbb,stroke:#333,stroke-width:2px
end
API --> KAI
API --> KWOK
KWOK --> N1
KWOK --> N2
KWOK --> N3
KAI --> J1
KAI --> D1
end
subgraph "Configuration"
Q[Queue Definitions]
S[Stage Definitions]
N[Node Definitions]
style Q fill:#ff9,stroke:#333,stroke-width:2px
style S fill:#ff9,stroke:#333,stroke-width:2px
style N fill:#ff9,stroke:#333,stroke-width:2px
end
Q --> API
S --> API
N --> API
-
Control Plane
- KAI Scheduler: Manages GPU workload scheduling and resource allocation
- KWOK Controller: Simulates node and pod behaviors
-
Simulated Resources
- GPU Nodes: Virtual nodes with simulated GPU resources
- Resource Metrics: Simulated GPU metrics and states
-
Workloads
- GPU Jobs: Batch processing and training jobs
- GPU Deployments: Long-running GPU services
-
Configuration
- Queue Definitions: Resource quotas and scheduling policies
- Stage Definitions: Pod lifecycle stages
- Node Definitions: GPU node specifications
This simulator allows you to:
- Test GPU workload scheduling behaviors
- Simulate various GPU job patterns and deployments
- Validate GPU resource allocation strategies
- Test KAI Scheduler configurations
- Simulate GPU node behaviors
.
├── deployments/
│ ├── kai-scheduler/ # KAI Scheduler configurations
│ │ ├── nodes.yaml
│ │ ├── pod-stages.yaml
│ │ └── queues.yaml
│ └── workloads/ # Production workload configurations
├── docs/
│ └── images/ # Documentation images
├── examples/
│ ├── gpu-jobs/ # Example GPU job configurations
│ └── gpu-deployments/ # Example GPU deployment configurations
└── README.md
Before installation, ensure you have:
- A running Kubernetes cluster (v1.20+)
- Helm CLI installed
kubectlconfigured to access your clustercurlandjqinstalled for fetching latest releases- NVIDIA GPUs available in your cluster nodes
First, add the NVIDIA Helm repository:
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo updateInstall the GPU Operator:
# Install GPU Operator with custom configuration
helm install --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.2 \
--set "driver.enabled=false" \
--set "validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION" \
--set-string "validator.driver.env[0].value=true"Verify the installation:
# Check GPU operator pods
kubectl get pods -n gpu-operator
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=trueFirst, prepare the environment variables:
# KWOK repository
KWOK_REPO=kubernetes-sigs/kwok
# Get latest release
KWOK_LATEST_RELEASE=$(curl "https://api.github.com/repos/${KWOK_REPO}/releases/latest" | jq -r '.tag_name')Deploy KWOK and set up custom resource definitions (CRDs):
# Deploy KWOK controller and CRDs
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml"
# Apply default stage configurations for pod/node emulation
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml"
# (Optional) Apply resource usage metrics configuration
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/metrics-usage.yaml"Add the NVIDIA Helm repository and install KAI Scheduler:
# Add NVIDIA Helm repository
helm repo add nvidia-k8s https://helm.ngc.nvidia.com/nvidia/k8s
helm repo update
# Install KAI Scheduler
helm upgrade -i kai-scheduler nvidia-k8s/kai-scheduler \
-n kai-scheduler \
--create-namespace \
--set "global.registry=nvcr.io/nvidia/k8s"Apply the base configurations for nodes, pod stages, and queues:
kubectl apply -f deployments/kai-scheduler/nodes.yaml
kubectl apply -f deployments/kai-scheduler/pod-stages.yaml
kubectl apply -f deployments/kai-scheduler/queues.yamlVerify the configurations:
# View simulated nodes
kubectl get nodes
kubectl describe nodes
# View pod stages
kubectl get stages
kubectl describe stage gpu-pod-create
kubectl describe stage gpu-pod-ready
kubectl describe stage gpu-pod-complete
# View queues
kubectl get queues
kubectl describe queue default
kubectl describe queue test
# Monitor queue status
kubectl get queue default -o jsonpath='{.status}' | jq
kubectl get queue test -o jsonpath='{.status}' | jqThe repository includes example configurations for both GPU jobs and deployments:
- Single GPU Jobs:
kubectl apply -f examples/gpu-jobs/jobs.yaml
# Monitor the jobs
kubectl get jobs
kubectl get pods
# You should see output like this:
NAME READY STATUS RESTARTS AGE
gpu-test-1-ttzfz 0/1 Completed 0 21s
gpu-test-2-qw574 0/1 Completed 0 21s
gpu-test-3-7hlcg 0/1 Completed 0 21s
gpu-test-4-66mzb 0/1 Completed 0 21s
gpu-test-large-1-x5sns 0/1 Completed 0 21s
gpu-test-large-2-khnf5 0/1 Completed 0 21sThe Completed status indicates that the jobs have successfully finished their execution. Each job runs a container that sleeps for a random duration between 1-10 seconds and then completes. The 0/1 in the READY column is normal for completed jobs as they are no longer running.
- GPU Deployments:
kubectl apply -f examples/gpu-deployments/deployments.yaml
# Monitor the deployments
kubectl get deployments
kubectl get pods -l app=gpu-deployment-1
kubectl get pods -l app=gpu-deployment-with-initMonitor your GPU workloads:
# Watch all pods with their status
kubectl get pods -w
# Watch specific job pods
kubectl get pods -l job-name=gpu-test-1
kubectl get pods -l job-name=gpu-test-large-1
# Check pod details including events
kubectl describe pod <pod-name>
# Check GPU allocation
kubectl describe nodes | grep nvidia.com/gpu- GPU Test Jobs: Basic jobs requesting 1-2 GPUs with varying sleep durations
- GPU Deployments: Long-running services with GPU requirements
- Init Container Examples: Deployments with GPU-enabled init containers
The simulator includes several example configurations to demonstrate different GPU workload patterns:
Contains simple GPU job configurations for testing:
gpu-test-1togpu-test-4: Single GPU jobs with random completion timesgpu-test-large-1andgpu-test-large-2: Jobs requiring 2 GPUs each
Demonstrates various GPU workload patterns:
-
Training Job (gpu-a100-training)
- Uses 2 NVIDIA A100 GPUs
- Higher resource allocation (8 CPU cores, 32GB memory)
- Runs in research queue
- Random completion time between 10-30 seconds
-
Inference Job (gpu-v100-inference)
- Uses 1 NVIDIA V100 GPU
- Medium resource allocation (4 CPU cores, 16GB memory)
- Runs in production queue
- Random completion time between 5-20 seconds
-
Batch Processing (gpu-t4-batch)
- Uses 1 NVIDIA T4 GPU
- Lower resource allocation (2 CPU cores, 8GB memory)
- Zone-specific scheduling (zone-a)
- Runs in test queue
- Random completion time between 1-11 seconds
-
Multi-Node Training (gpu-multi-node)
- Distributed job with 4 parallel pods
- Each pod uses 1 A100 GPU
- Pod anti-affinity ensures distribution across nodes
- Medium resource allocation per pod (4 CPU cores, 16GB memory)
- Random completion time between 20-50 seconds
- Runs in research queue
Contains persistent GPU workloads:
- Interactive Session (gpu-interactive)
- Long-running deployment with 1 replica
- Uses 1 NVIDIA V100 GPU
- Medium resource allocation (4 CPU cores, 16GB memory)
- Runs in research queue
- Continuous operation with 30-second sleep intervals
All GPU configurations include:
- KWOK node tolerations for simulated environment
- Specific GPU type selectors
- Resource limits matching requests
- Base CUDA image (nvidia/cuda:11.0-base)
- Queue labels for scheduling (
runai/queue)
Apply the configurations:
# Apply basic jobs
kubectl apply -f examples/gpu-jobs/jobs.yaml
# Apply complex scenarios
kubectl apply -f examples/gpu-jobs/complex-scenarios.yaml
# Apply long-running deployments
kubectl apply -f examples/gpu-jobs/deployments.yaml
# Monitor pods
kubectl get pods -wThese configurations work with the simulated GPU nodes defined in deployments/kai-scheduler/nodes.yaml, which includes:
- A100 nodes with 4 GPUs each (80GB GPU memory)
- V100 nodes with 4 GPUs each (32GB GPU memory)
- T4 nodes with 4 GPUs each (16GB GPU memory)
Common issues and their solutions:
- Pods Stuck in Pending: Check node GPU availability and queue configurations
- Resource Quotas: Verify queue resource limits and usage
- Node Selection: Ensure proper node labels and tolerations
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.