KWOK GPU Simulator

A Kubernetes GPU workload simulator using KWOK (Kubernetes WithOut Kubelet) and KAI Scheduler. This project provides a lightweight, efficient way to simulate GPU workloads in a Kubernetes environment without requiring actual GPU hardware.

Architecture

graph TB
    subgraph "Kubernetes Cluster"
        API[API Server]
        
        subgraph "Control Plane"
            KAI[KAI Scheduler]
            KWOK[KWOK Controller]
            style KAI fill:#f9f,stroke:#333,stroke-width:2px
            style KWOK fill:#bbf,stroke:#333,stroke-width:2px
        end
        
        subgraph "Simulated Resources"
            N1[GPU Node 1]
            N2[GPU Node 2]
            N3[GPU Node 3]
            style N1 fill:#bfb,stroke:#333,stroke-width:2px
            style N2 fill:#bfb,stroke:#333,stroke-width:2px
            style N3 fill:#bfb,stroke:#333,stroke-width:2px
        end
        
        subgraph "Workloads"
            J1[GPU Jobs]
            D1[GPU Deployments]
            style J1 fill:#fbb,stroke:#333,stroke-width:2px
            style D1 fill:#fbb,stroke:#333,stroke-width:2px
        end
        
        API --> KAI
        API --> KWOK
        KWOK --> N1
        KWOK --> N2
        KWOK --> N3
        KAI --> J1
        KAI --> D1
    end

    subgraph "Configuration"
        Q[Queue Definitions]
        S[Stage Definitions]
        N[Node Definitions]
        style Q fill:#ff9,stroke:#333,stroke-width:2px
        style S fill:#ff9,stroke:#333,stroke-width:2px
        style N fill:#ff9,stroke:#333,stroke-width:2px
    end
    
    Q --> API
    S --> API
    N --> API

Components

Control Plane
- KAI Scheduler: Manages GPU workload scheduling and resource allocation
- KWOK Controller: Simulates node and pod behaviors
Simulated Resources
- GPU Nodes: Virtual nodes with simulated GPU resources
- Resource Metrics: Simulated GPU metrics and states
Workloads
- GPU Jobs: Batch processing and training jobs
- GPU Deployments: Long-running GPU services
Configuration
- Queue Definitions: Resource quotas and scheduling policies
- Stage Definitions: Pod lifecycle stages
- Node Definitions: GPU node specifications

Overview

This simulator allows you to:

Test GPU workload scheduling behaviors
Simulate various GPU job patterns and deployments
Validate GPU resource allocation strategies
Test KAI Scheduler configurations
Simulate GPU node behaviors

Repository Structure

.
├── deployments/
│   ├── kai-scheduler/    # KAI Scheduler configurations
│   │   ├── nodes.yaml
│   │   ├── pod-stages.yaml
│   │   └── queues.yaml
│   └── workloads/        # Production workload configurations
├── docs/
│   └── images/          # Documentation images
├── examples/
│   ├── gpu-jobs/        # Example GPU job configurations
│   └── gpu-deployments/ # Example GPU deployment configurations
└── README.md

Prerequisites

Before installation, ensure you have:

A running Kubernetes cluster (v1.20+)
Helm CLI installed
kubectl configured to access your cluster
curl and jq installed for fetching latest releases
NVIDIA GPUs available in your cluster nodes

Installation

1. Install NVIDIA GPU Operator

First, add the NVIDIA Helm repository:

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the GPU Operator:

# Install GPU Operator with custom configuration
helm install --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v24.9.2 \
  --set "driver.enabled=false" \
  --set "validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION" \
  --set-string "validator.driver.env[0].value=true"

Verify the installation:

# Check GPU operator pods
kubectl get pods -n gpu-operator

# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true

2. Install KWOK

First, prepare the environment variables:

# KWOK repository
KWOK_REPO=kubernetes-sigs/kwok
# Get latest release
KWOK_LATEST_RELEASE=$(curl "https://api.github.com/repos/${KWOK_REPO}/releases/latest" | jq -r '.tag_name')

Deploy KWOK and set up custom resource definitions (CRDs):

# Deploy KWOK controller and CRDs
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml"

# Apply default stage configurations for pod/node emulation
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml"

# (Optional) Apply resource usage metrics configuration
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/metrics-usage.yaml"

3. Install KAI Scheduler

Add the NVIDIA Helm repository and install KAI Scheduler:

# Add NVIDIA Helm repository
helm repo add nvidia-k8s https://helm.ngc.nvidia.com/nvidia/k8s
helm repo update

# Install KAI Scheduler
helm upgrade -i kai-scheduler nvidia-k8s/kai-scheduler \
  -n kai-scheduler \
  --create-namespace \
  --set "global.registry=nvcr.io/nvidia/k8s"

4. Apply Base Configurations

Apply the base configurations for nodes, pod stages, and queues:

kubectl apply -f deployments/kai-scheduler/nodes.yaml
kubectl apply -f deployments/kai-scheduler/pod-stages.yaml
kubectl apply -f deployments/kai-scheduler/queues.yaml

Verify the configurations:

# View simulated nodes
kubectl get nodes
kubectl describe nodes

# View pod stages
kubectl get stages
kubectl describe stage gpu-pod-create
kubectl describe stage gpu-pod-ready
kubectl describe stage gpu-pod-complete

# View queues
kubectl get queues
kubectl describe queue default
kubectl describe queue test

# Monitor queue status
kubectl get queue default -o jsonpath='{.status}' | jq
kubectl get queue test -o jsonpath='{.status}' | jq

Usage

Running GPU Jobs

The repository includes example configurations for both GPU jobs and deployments:

Single GPU Jobs:

kubectl apply -f examples/gpu-jobs/jobs.yaml

# Monitor the jobs
kubectl get jobs
kubectl get pods

# You should see output like this:
NAME                     READY   STATUS      RESTARTS   AGE
gpu-test-1-ttzfz         0/1     Completed   0          21s
gpu-test-2-qw574         0/1     Completed   0          21s
gpu-test-3-7hlcg         0/1     Completed   0          21s
gpu-test-4-66mzb         0/1     Completed   0          21s
gpu-test-large-1-x5sns   0/1     Completed   0          21s
gpu-test-large-2-khnf5   0/1     Completed   0          21s

The Completed status indicates that the jobs have successfully finished their execution. Each job runs a container that sleeps for a random duration between 1-10 seconds and then completes. The 0/1 in the READY column is normal for completed jobs as they are no longer running.

GPU Deployments:

kubectl apply -f examples/gpu-deployments/deployments.yaml

# Monitor the deployments
kubectl get deployments
kubectl get pods -l app=gpu-deployment-1
kubectl get pods -l app=gpu-deployment-with-init

Monitoring

Monitor your GPU workloads:

# Watch all pods with their status
kubectl get pods -w

# Watch specific job pods
kubectl get pods -l job-name=gpu-test-1
kubectl get pods -l job-name=gpu-test-large-1

# Check pod details including events
kubectl describe pod <pod-name>

# Check GPU allocation
kubectl describe nodes | grep nvidia.com/gpu

Example Workloads

GPU Test Jobs: Basic jobs requesting 1-2 GPUs with varying sleep durations
GPU Deployments: Long-running services with GPU requirements
Init Container Examples: Deployments with GPU-enabled init containers

GPU Job Configurations

The simulator includes several example configurations to demonstrate different GPU workload patterns:

Basic GPU Jobs (`examples/gpu-jobs/jobs.yaml`)

Contains simple GPU job configurations for testing:

gpu-test-1 to gpu-test-4: Single GPU jobs with random completion times
gpu-test-large-1 and gpu-test-large-2: Jobs requiring 2 GPUs each

Complex GPU Scenarios (`examples/gpu-jobs/complex-scenarios.yaml`)

Demonstrates various GPU workload patterns:

Training Job (gpu-a100-training)
- Uses 2 NVIDIA A100 GPUs
- Higher resource allocation (8 CPU cores, 32GB memory)
- Runs in research queue
- Random completion time between 10-30 seconds
Inference Job (gpu-v100-inference)
- Uses 1 NVIDIA V100 GPU
- Medium resource allocation (4 CPU cores, 16GB memory)
- Runs in production queue
- Random completion time between 5-20 seconds
Batch Processing (gpu-t4-batch)
- Uses 1 NVIDIA T4 GPU
- Lower resource allocation (2 CPU cores, 8GB memory)
- Zone-specific scheduling (zone-a)
- Runs in test queue
- Random completion time between 1-11 seconds
Multi-Node Training (gpu-multi-node)
- Distributed job with 4 parallel pods
- Each pod uses 1 A100 GPU
- Pod anti-affinity ensures distribution across nodes
- Medium resource allocation per pod (4 CPU cores, 16GB memory)
- Random completion time between 20-50 seconds
- Runs in research queue

Long-Running GPU Deployments (`examples/gpu-jobs/deployments.yaml`)

Contains persistent GPU workloads:

Interactive Session (gpu-interactive)
- Long-running deployment with 1 replica
- Uses 1 NVIDIA V100 GPU
- Medium resource allocation (4 CPU cores, 16GB memory)
- Runs in research queue
- Continuous operation with 30-second sleep intervals

Common Features

All GPU configurations include:

KWOK node tolerations for simulated environment
Specific GPU type selectors
Resource limits matching requests
Base CUDA image (nvidia/cuda:11.0-base)
Queue labels for scheduling (runai/queue)

Usage

Apply the configurations:

# Apply basic jobs
kubectl apply -f examples/gpu-jobs/jobs.yaml

# Apply complex scenarios
kubectl apply -f examples/gpu-jobs/complex-scenarios.yaml

# Apply long-running deployments
kubectl apply -f examples/gpu-jobs/deployments.yaml

# Monitor pods
kubectl get pods -w

These configurations work with the simulated GPU nodes defined in deployments/kai-scheduler/nodes.yaml, which includes:

A100 nodes with 4 GPUs each (80GB GPU memory)
V100 nodes with 4 GPUs each (32GB GPU memory)
T4 nodes with 4 GPUs each (16GB GPU memory)

Troubleshooting

Common issues and their solutions:

Pods Stuck in Pending: Check node GPU availability and queue configurations
Resource Quotas: Verify queue resource limits and usage
Node Selection: Ensure proper node labels and tolerations

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
deployments/kai-scheduler		deployments/kai-scheduler
examples		examples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KWOK GPU Simulator

Architecture

Components

Overview

Repository Structure

Prerequisites

Installation

1. Install NVIDIA GPU Operator

2. Install KWOK

3. Install KAI Scheduler

4. Apply Base Configurations

Usage

Running GPU Jobs

Monitoring

Example Workloads

GPU Job Configurations

Basic GPU Jobs (`examples/gpu-jobs/jobs.yaml`)

Complex GPU Scenarios (`examples/gpu-jobs/complex-scenarios.yaml`)

Long-Running GPU Deployments (`examples/gpu-jobs/deployments.yaml`)

Common Features

Usage

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

ansjindal/kwok-gpu-simulator

Folders and files

Latest commit

History

Repository files navigation

KWOK GPU Simulator

Architecture

Components

Overview

Repository Structure

Prerequisites

Installation

1. Install NVIDIA GPU Operator

2. Install KWOK

3. Install KAI Scheduler

4. Apply Base Configurations

Usage

Running GPU Jobs

Monitoring

Example Workloads

GPU Job Configurations

Basic GPU Jobs (examples/gpu-jobs/jobs.yaml)

Complex GPU Scenarios (examples/gpu-jobs/complex-scenarios.yaml)

Long-Running GPU Deployments (examples/gpu-jobs/deployments.yaml)

Common Features

Usage

Troubleshooting

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Basic GPU Jobs (`examples/gpu-jobs/jobs.yaml`)

Complex GPU Scenarios (`examples/gpu-jobs/complex-scenarios.yaml`)

Long-Running GPU Deployments (`examples/gpu-jobs/deployments.yaml`)

Packages