Skip to content

An intelligent NVIDIA GPU task scheduler with real-time monitoring and automatic job execution. Features persistent task queues, daemon scheduling, and multi-notification support for efficient GPU resource management.

Notifications You must be signed in to change notification settings

huhusmang/GPU-Scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Scheduler

A command-line utility built on top of nvitop that monitors NVIDIA GPUs and automatically executes queued commands when specified GPUs become idle.

✨ Features

  • 🔍 GPU Monitoring: Real-time GPU status display with live updates
  • 📋 Task Queue: Persistent queue for commands waiting on specific GPUs
  • ⚙️ Background Scheduler: Daemon process that continues running after TUI exit
  • 🎯 Auto-Execution: Automatically launches tasks when GPUs become idle
  • 💾 Data Persistence: All data saved to disk, survives TUI/terminal restarts
  • 🎨 Interactive TUI: Beautiful terminal UI for task management
  • 🔔 Notifications: Email, webhook, or stdout notifications
  • 🔒 Single Instance: Prevents multiple scheduler conflicts

🚀 Quick Start

Installation

Install globally to use gst anywhere (Recommended):

# Using uv
uv tool install .

# Using pipx
pipx install .

Or install in the current virtual environment:

# Using pip
pip install .

5-Minute Tutorial

1. Launch TUI

gst

2. Submit a task

  • Select "📝 Submit"
  • Enter GPU indices: 0
  • Enter command: python train.py --epochs 3
  • Confirm

3. Start scheduler

  • Select "⚙️ Start Scheduler"
  • Exit TUI (scheduler keeps running!)

4. Check status anytime

gst  # Reopen TUI
# Select "📊 View Status" to see running tasks

That's it! Your tasks will execute automatically when GPUs become idle. 🎉

Command-Line Quick Reference

# Monitor GPU
gpus monitor --watch

# Submit task
gpus submit --gpu 0 -- python train.py

# View queue
gpus queue

# Start scheduler (background)
nohup gpus scheduler > scheduler.log 2>&1 &

📖 Documentation

New to GPU Scheduler? Start here:

Need detailed reference?

Want to understand internals?

Developer resources:

💡 Key Concepts

Non-Interactive Shell Environment

⚠️ Important: Tasks run in non-interactive shells without .bashrc or conda activation.

Solution - Use absolute paths:

# ✅ Recommended
gpus submit --gpu 0 -- /path/to/.venv/bin/python train.py

# ✅ Or use wrapper script
cat > run_task.sh << 'EOF'
#!/bin/bash
cd /path/to/project
source .venv/bin/activate
python train.py
EOF

gpus submit --gpu 0 -- bash run_task.sh

See FAQ - Virtual Environments for details.

Background Scheduler

The scheduler runs independently as a daemon process:

# Start via TUI
gst → Start Scheduler → Exit

# Or via CLI (recommended for production)
nohup gpus scheduler --log-dir ./logs > scheduler.log 2>&1 &

# Scheduler survives:
# ✅ TUI exit
# ✅ Terminal close  
# ✅ SSH disconnect

See FAQ - Background Running for production setup.

Data Persistence

All data stored in ~/.local/share/gpu_scheduler/:

~/.local/share/gpu_scheduler/
├── queue.json              # Task queue
├── task_history.json       # Execution history
├── scheduler_state.json    # Running tasks
└── scheduler.log           # Logs

Guarantees:

  • ✅ Tasks persist across TUI sessions
  • ✅ Queue survives system restarts
  • ✅ Atomic operations prevent corruption

🎯 Common Use Cases

Multi-Task Training

# Submit multiple training jobs
gpus submit --gpu 0 -- python train_model_a.py
gpus submit --gpu 0 -- python train_model_b.py
gpus submit --gpu 1 -- python train_model_c.py

# Start scheduler
gst → Start Scheduler → Exit

# Jobs execute automatically in order

Multi-GPU Task

# Task requires 2 GPUs
gpus submit --gpu 0 1 -- python large_model.py

# Will wait until BOTH GPUs are idle

Task Retry

# Enable automatic retry (3 attempts, exponential backoff)
gpus submit --gpu 0 --retry -- python flaky_job.py

# Custom retry
gpus submit \
  --gpu 0 \
  --retry \
  --max-retries 5 \
  --retry-delay 60.0 \
  -- python train.py

Stop Running Task

# Via TUI
gst → Stop Task → Select task → Confirm

# Via CLI
gpus queue --stop <task_id>

# Graceful termination:
# 1. SIGTERM sent (5s timeout)
# 2. SIGKILL if needed
# 3. GPU resources released

🧪 Testing

# Quick GPU test (allocate 2GB for 60s)
python tests/tools/gpu_test.py --memory 2048 --duration 60

# Interactive test scenarios
bash tests/tools/run_scenarios.sh

# Run automated tests
pytest tests/integration/

See tests/README.md for comprehensive testing guide.

🆘 Troubleshooting

Tasks not executing?

  1. Check scheduler is running: ps aux | grep "gpus scheduler"
  2. Check GPU status: gpus monitor
  3. View logs: tail -f ~/.local/share/gpu_scheduler/scheduler.log

Virtual environment issues?

More help:

🤝 Contributing

Contributions welcome! Please check:

📄 License

See LICENSE file for details.

🙏 Acknowledgments

Built on top of the excellent nvitop library.

About

An intelligent NVIDIA GPU task scheduler with real-time monitoring and automatic job execution. Features persistent task queues, daemon scheduling, and multi-notification support for efficient GPU resource management.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published