Skip to content

ermolushka/vllm-benchmark

Repository files navigation

Llama 8B Memory Optimization on RTX 4090

A comprehensive benchmarking suite for testing different quantization techniques and memory optimization strategies for running Llama 8B models on NVIDIA RTX 4090 GPUs.

Overview

This project provides reproducible experiments comparing various approaches to running 8B parameter language models efficiently on consumer hardware, specifically targeting the 24GB VRAM limitation of the RTX 4090.

Features

  • Multiple Quantization Methods: FP16 baseline, BitsAndBytes 4-bit, AWQ, GPTQ
  • Automated Benchmarking: Complete suite with performance and memory monitoring
  • Configuration Management: Flexible vLLM configuration profiles
  • Real-time Monitoring: GPU memory, temperature, and system resource tracking
  • Comparative Analysis: Automated report generation with charts and statistics
  • Blog Post Template: Ready-to-publish analysis framework

Project Structure

vllm-benchmark/
├── experiments/           # Individual quantization experiments
│   ├── base_fp16.py      # FP16 baseline experiment
│   ├── bnb_4bit.py       # BitsAndBytes 4-bit quantization
│   ├── awq_quantized.py  # AWQ quantization experiment
│   └── gptq_quantized.py # GPTQ quantization experiment
├── benchmarks/           # Benchmarking and monitoring tools
│   ├── benchmark_runner.py    # Automated benchmark suite
│   └── system_monitor.py      # Real-time system monitoring
├── configs/              # Configuration files and management
│   ├── vllm_configs.yaml      # vLLM configuration profiles
│   └── experiment_config.py   # Configuration loader utilities
├── results/              # Generated results and reports (created during runs)
├── main.py               # entrypoint
└── README.md

Quick Start

Prerequisites

  • NVIDIA RTX 4090 with 24GB VRAM
  • CUDA toolkit installed
  • Python 3.11+
  • UV package manager

Installation

# Clone and navigate to the project
cd llama-8b-memory-optimization

# Install dependencies using UV
uv sync

# Activate the virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

Running Experiments

Run All Experiments (Recommended)

python benchmarks/benchmark_runner.py

This will:

  1. Run all quantization experiments sequentially
  2. Generate comparison reports and charts
  3. Save results in the results/ directory

Run Individual Experiments

# FP16 baseline
python experiments/base_fp16.py

# BitsAndBytes 4-bit quantization
python experiments/bnb_4bit.py

# AWQ quantization
python experiments/awq_quantized.py

# GPTQ quantization
python experiments/gptq_quantized.py

Monitor System Resources

# Run system monitoring demo
python benchmarks/system_monitor.py

Test Configuration Management

# View available configurations
python configs/experiment_config.py

Configuration Profiles

The project includes optimized vLLM configuration profiles for different use cases:

  • base_fp16: Standard FP16 baseline for maximum quality
  • memory_optimized: Aggressive memory saving with quality trade-offs
  • throughput_optimized: Optimized for maximum tokens/second
  • balanced: Good balance of memory usage and performance
  • long_context: Optimized for extended context sequences
  • production: Stable settings for production deployment

Example usage:

from configs.experiment_config import ConfigManager

config_manager = ConfigManager()
kwargs = config_manager.create_vllm_kwargs('balanced', 'meta-llama/Meta-Llama-3.1-8B-Instruct')
llm = LLM(**kwargs)

Quantization Methods Tested

1. FP16 Baseline

  • Standard 16-bit floating point precision
  • Maximum quality reference point
  • ~16GB memory requirement for model weights

2. BitsAndBytes 4-bit

  • Accessible through Transformers library
  • NF4 quantization with double quantization
  • Excellent quality preservation
  • Easy to implement and experiment with

3. AWQ (Activation-aware Weight Quantization)

  • Hardware-optimized for inference speed
  • Considers activation patterns during quantization
  • Excellent vLLM integration
  • Designed for production deployment

4. GPTQ

  • Established post-training quantization method
  • Good compression with quality trade-offs
  • Broad framework support
  • Mature implementation

Results and Analysis

After running experiments, results are automatically saved in the results/ directory:

  • Individual Results: {method}_{timestamp}.json files with detailed metrics
  • Comparison Data: comparison_{timestamp}.json with side-by-side analysis
  • Charts: comparison_chart_{timestamp}.png with visualization
  • Monitoring Data: System resource monitoring logs

Key Metrics Tracked

  • Memory Usage: GPU and system memory consumption
  • Performance: Tokens per second, model loading time
  • Quality: Inference outputs for comparison
  • System Resources: GPU utilization, temperature, power consumption

Hardware Requirements

Minimum Requirements

  • NVIDIA RTX 4090 (24GB VRAM)
  • 32GB system RAM
  • CUDA-compatible driver

Recommended Setup

  • RTX 4090 with good cooling
  • 64GB+ system RAM for comfortable experimentation
  • Fast NVMe SSD for model storage
  • Latest CUDA toolkit and drivers

Troubleshooting

Common Issues

Out of Memory Errors

# Try reducing these parameters in configs/vllm_configs.yaml
gpu_memory_utilization: 0.85  # Reduce from 0.9
max_model_len: 2048          # Reduce from 4096
max_num_seqs: 64             # Reduce from 128

Model Loading Failures

  • Ensure you have sufficient disk space for model downloads
  • Check internet connectivity for Hugging Face model downloads
  • Verify CUDA installation with nvidia-smi

Slow Performance

  • Enable CUDA graphs: enforce_eager: false
  • Increase batch size: max_num_seqs: 256
  • Check thermal throttling with monitoring tools

Memory Optimization Tips

  1. Start Conservative: Begin with lower memory utilization settings
  2. Monitor Resources: Use the system monitor to understand actual usage
  3. Adjust Gradually: Increase utilization incrementally
  4. Use Swap Space: Configure swap for memory pressure relief
  5. CPU Offloading: Use CPU offloading for extreme memory constraints

Contributing

We welcome contributions! Here are ways to help:

  • Additional Quantization Methods: Implement new quantization techniques
  • Hardware Variations: Test on different GPU configurations
  • Quality Metrics: Improve model quality assessment methods
  • Optimization Techniques: Add new memory optimization strategies

Citation

If you use this benchmarking suite in your research or blog posts, please consider citing:

@misc{llama8b-memory-optimization,
  title={Memory Optimization Deep Dive: Running 8B Models on RTX 4090},
  author={[Alexey Ermolaev]},
  year={2025},
  url={[https://github.com/ermolushka/vllm-benchmark]}
}

For detailed analysis and insights, see the complete blog post.

About

scripts for benchmarking vLLM using Llama 8b and NVIDIA 4090 GPU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages