Skip to content

A Kubernetes GPU workload simulator using KWOK (Kubernetes WithOut Kubelet) and KAI Scheduler. This project provides a lightweight, efficient way to simulate GPU workloads in a Kubernetes environment without requiring actual GPU hardware.

Notifications You must be signed in to change notification settings

ansjindal/kwok-gpu-simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KWOK GPU Simulator

A Kubernetes GPU workload simulator using KWOK (Kubernetes WithOut Kubelet) and KAI Scheduler. This project provides a lightweight, efficient way to simulate GPU workloads in a Kubernetes environment without requiring actual GPU hardware.

Architecture

graph TB
    subgraph "Kubernetes Cluster"
        API[API Server]
        
        subgraph "Control Plane"
            KAI[KAI Scheduler]
            KWOK[KWOK Controller]
            style KAI fill:#f9f,stroke:#333,stroke-width:2px
            style KWOK fill:#bbf,stroke:#333,stroke-width:2px
        end
        
        subgraph "Simulated Resources"
            N1[GPU Node 1]
            N2[GPU Node 2]
            N3[GPU Node 3]
            style N1 fill:#bfb,stroke:#333,stroke-width:2px
            style N2 fill:#bfb,stroke:#333,stroke-width:2px
            style N3 fill:#bfb,stroke:#333,stroke-width:2px
        end
        
        subgraph "Workloads"
            J1[GPU Jobs]
            D1[GPU Deployments]
            style J1 fill:#fbb,stroke:#333,stroke-width:2px
            style D1 fill:#fbb,stroke:#333,stroke-width:2px
        end
        
        API --> KAI
        API --> KWOK
        KWOK --> N1
        KWOK --> N2
        KWOK --> N3
        KAI --> J1
        KAI --> D1
    end

    subgraph "Configuration"
        Q[Queue Definitions]
        S[Stage Definitions]
        N[Node Definitions]
        style Q fill:#ff9,stroke:#333,stroke-width:2px
        style S fill:#ff9,stroke:#333,stroke-width:2px
        style N fill:#ff9,stroke:#333,stroke-width:2px
    end
    
    Q --> API
    S --> API
    N --> API
Loading

Components

  1. Control Plane

    • KAI Scheduler: Manages GPU workload scheduling and resource allocation
    • KWOK Controller: Simulates node and pod behaviors
  2. Simulated Resources

    • GPU Nodes: Virtual nodes with simulated GPU resources
    • Resource Metrics: Simulated GPU metrics and states
  3. Workloads

    • GPU Jobs: Batch processing and training jobs
    • GPU Deployments: Long-running GPU services
  4. Configuration

    • Queue Definitions: Resource quotas and scheduling policies
    • Stage Definitions: Pod lifecycle stages
    • Node Definitions: GPU node specifications

Overview

This simulator allows you to:

  • Test GPU workload scheduling behaviors
  • Simulate various GPU job patterns and deployments
  • Validate GPU resource allocation strategies
  • Test KAI Scheduler configurations
  • Simulate GPU node behaviors

Repository Structure

.
├── deployments/
│   ├── kai-scheduler/    # KAI Scheduler configurations
│   │   ├── nodes.yaml
│   │   ├── pod-stages.yaml
│   │   └── queues.yaml
│   └── workloads/        # Production workload configurations
├── docs/
│   └── images/          # Documentation images
├── examples/
│   ├── gpu-jobs/        # Example GPU job configurations
│   └── gpu-deployments/ # Example GPU deployment configurations
└── README.md

Prerequisites

Before installation, ensure you have:

  • A running Kubernetes cluster (v1.20+)
  • Helm CLI installed
  • kubectl configured to access your cluster
  • curl and jq installed for fetching latest releases
  • NVIDIA GPUs available in your cluster nodes

Installation

1. Install NVIDIA GPU Operator

First, add the NVIDIA Helm repository:

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the GPU Operator:

# Install GPU Operator with custom configuration
helm install --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v24.9.2 \
  --set "driver.enabled=false" \
  --set "validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION" \
  --set-string "validator.driver.env[0].value=true"

Verify the installation:

# Check GPU operator pods
kubectl get pods -n gpu-operator

# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu=true

2. Install KWOK

First, prepare the environment variables:

# KWOK repository
KWOK_REPO=kubernetes-sigs/kwok
# Get latest release
KWOK_LATEST_RELEASE=$(curl "https://api.github.com/repos/${KWOK_REPO}/releases/latest" | jq -r '.tag_name')

Deploy KWOK and set up custom resource definitions (CRDs):

# Deploy KWOK controller and CRDs
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml"

# Apply default stage configurations for pod/node emulation
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml"

# (Optional) Apply resource usage metrics configuration
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/metrics-usage.yaml"

3. Install KAI Scheduler

Add the NVIDIA Helm repository and install KAI Scheduler:

# Add NVIDIA Helm repository
helm repo add nvidia-k8s https://helm.ngc.nvidia.com/nvidia/k8s
helm repo update

# Install KAI Scheduler
helm upgrade -i kai-scheduler nvidia-k8s/kai-scheduler \
  -n kai-scheduler \
  --create-namespace \
  --set "global.registry=nvcr.io/nvidia/k8s"

4. Apply Base Configurations

Apply the base configurations for nodes, pod stages, and queues:

kubectl apply -f deployments/kai-scheduler/nodes.yaml
kubectl apply -f deployments/kai-scheduler/pod-stages.yaml
kubectl apply -f deployments/kai-scheduler/queues.yaml

Verify the configurations:

# View simulated nodes
kubectl get nodes
kubectl describe nodes

# View pod stages
kubectl get stages
kubectl describe stage gpu-pod-create
kubectl describe stage gpu-pod-ready
kubectl describe stage gpu-pod-complete

# View queues
kubectl get queues
kubectl describe queue default
kubectl describe queue test

# Monitor queue status
kubectl get queue default -o jsonpath='{.status}' | jq
kubectl get queue test -o jsonpath='{.status}' | jq

Usage

Running GPU Jobs

The repository includes example configurations for both GPU jobs and deployments:

  1. Single GPU Jobs:
kubectl apply -f examples/gpu-jobs/jobs.yaml

# Monitor the jobs
kubectl get jobs
kubectl get pods

# You should see output like this:
NAME                     READY   STATUS      RESTARTS   AGE
gpu-test-1-ttzfz         0/1     Completed   0          21s
gpu-test-2-qw574         0/1     Completed   0          21s
gpu-test-3-7hlcg         0/1     Completed   0          21s
gpu-test-4-66mzb         0/1     Completed   0          21s
gpu-test-large-1-x5sns   0/1     Completed   0          21s
gpu-test-large-2-khnf5   0/1     Completed   0          21s

The Completed status indicates that the jobs have successfully finished their execution. Each job runs a container that sleeps for a random duration between 1-10 seconds and then completes. The 0/1 in the READY column is normal for completed jobs as they are no longer running.

  1. GPU Deployments:
kubectl apply -f examples/gpu-deployments/deployments.yaml

# Monitor the deployments
kubectl get deployments
kubectl get pods -l app=gpu-deployment-1
kubectl get pods -l app=gpu-deployment-with-init

Monitoring

Monitor your GPU workloads:

# Watch all pods with their status
kubectl get pods -w

# Watch specific job pods
kubectl get pods -l job-name=gpu-test-1
kubectl get pods -l job-name=gpu-test-large-1

# Check pod details including events
kubectl describe pod <pod-name>

# Check GPU allocation
kubectl describe nodes | grep nvidia.com/gpu

Example Workloads

  1. GPU Test Jobs: Basic jobs requesting 1-2 GPUs with varying sleep durations
  2. GPU Deployments: Long-running services with GPU requirements
  3. Init Container Examples: Deployments with GPU-enabled init containers

GPU Job Configurations

The simulator includes several example configurations to demonstrate different GPU workload patterns:

Basic GPU Jobs (examples/gpu-jobs/jobs.yaml)

Contains simple GPU job configurations for testing:

  • gpu-test-1 to gpu-test-4: Single GPU jobs with random completion times
  • gpu-test-large-1 and gpu-test-large-2: Jobs requiring 2 GPUs each

Complex GPU Scenarios (examples/gpu-jobs/complex-scenarios.yaml)

Demonstrates various GPU workload patterns:

  1. Training Job (gpu-a100-training)

    • Uses 2 NVIDIA A100 GPUs
    • Higher resource allocation (8 CPU cores, 32GB memory)
    • Runs in research queue
    • Random completion time between 10-30 seconds
  2. Inference Job (gpu-v100-inference)

    • Uses 1 NVIDIA V100 GPU
    • Medium resource allocation (4 CPU cores, 16GB memory)
    • Runs in production queue
    • Random completion time between 5-20 seconds
  3. Batch Processing (gpu-t4-batch)

    • Uses 1 NVIDIA T4 GPU
    • Lower resource allocation (2 CPU cores, 8GB memory)
    • Zone-specific scheduling (zone-a)
    • Runs in test queue
    • Random completion time between 1-11 seconds
  4. Multi-Node Training (gpu-multi-node)

    • Distributed job with 4 parallel pods
    • Each pod uses 1 A100 GPU
    • Pod anti-affinity ensures distribution across nodes
    • Medium resource allocation per pod (4 CPU cores, 16GB memory)
    • Random completion time between 20-50 seconds
    • Runs in research queue

Long-Running GPU Deployments (examples/gpu-jobs/deployments.yaml)

Contains persistent GPU workloads:

  1. Interactive Session (gpu-interactive)
    • Long-running deployment with 1 replica
    • Uses 1 NVIDIA V100 GPU
    • Medium resource allocation (4 CPU cores, 16GB memory)
    • Runs in research queue
    • Continuous operation with 30-second sleep intervals

Common Features

All GPU configurations include:

  • KWOK node tolerations for simulated environment
  • Specific GPU type selectors
  • Resource limits matching requests
  • Base CUDA image (nvidia/cuda:11.0-base)
  • Queue labels for scheduling (runai/queue)

Usage

Apply the configurations:

# Apply basic jobs
kubectl apply -f examples/gpu-jobs/jobs.yaml

# Apply complex scenarios
kubectl apply -f examples/gpu-jobs/complex-scenarios.yaml

# Apply long-running deployments
kubectl apply -f examples/gpu-jobs/deployments.yaml

# Monitor pods
kubectl get pods -w

These configurations work with the simulated GPU nodes defined in deployments/kai-scheduler/nodes.yaml, which includes:

  • A100 nodes with 4 GPUs each (80GB GPU memory)
  • V100 nodes with 4 GPUs each (32GB GPU memory)
  • T4 nodes with 4 GPUs each (16GB GPU memory)

Troubleshooting

Common issues and their solutions:

  1. Pods Stuck in Pending: Check node GPU availability and queue configurations
  2. Resource Quotas: Verify queue resource limits and usage
  3. Node Selection: Ensure proper node labels and tolerations

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

A Kubernetes GPU workload simulator using KWOK (Kubernetes WithOut Kubelet) and KAI Scheduler. This project provides a lightweight, efficient way to simulate GPU workloads in a Kubernetes environment without requiring actual GPU hardware.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published