A comprehensive benchmarking suite for testing different quantization techniques and memory optimization strategies for running Llama 8B models on NVIDIA RTX 4090 GPUs.
This project provides reproducible experiments comparing various approaches to running 8B parameter language models efficiently on consumer hardware, specifically targeting the 24GB VRAM limitation of the RTX 4090.
- Multiple Quantization Methods: FP16 baseline, BitsAndBytes 4-bit, AWQ, GPTQ
- Automated Benchmarking: Complete suite with performance and memory monitoring
- Configuration Management: Flexible vLLM configuration profiles
- Real-time Monitoring: GPU memory, temperature, and system resource tracking
- Comparative Analysis: Automated report generation with charts and statistics
- Blog Post Template: Ready-to-publish analysis framework
vllm-benchmark/
├── experiments/ # Individual quantization experiments
│ ├── base_fp16.py # FP16 baseline experiment
│ ├── bnb_4bit.py # BitsAndBytes 4-bit quantization
│ ├── awq_quantized.py # AWQ quantization experiment
│ └── gptq_quantized.py # GPTQ quantization experiment
├── benchmarks/ # Benchmarking and monitoring tools
│ ├── benchmark_runner.py # Automated benchmark suite
│ └── system_monitor.py # Real-time system monitoring
├── configs/ # Configuration files and management
│ ├── vllm_configs.yaml # vLLM configuration profiles
│ └── experiment_config.py # Configuration loader utilities
├── results/ # Generated results and reports (created during runs)
├── main.py # entrypoint
└── README.md
- NVIDIA RTX 4090 with 24GB VRAM
- CUDA toolkit installed
- Python 3.11+
- UV package manager
# Clone and navigate to the project
cd llama-8b-memory-optimization
# Install dependencies using UV
uv sync
# Activate the virtual environment
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windowspython benchmarks/benchmark_runner.pyThis will:
- Run all quantization experiments sequentially
- Generate comparison reports and charts
- Save results in the
results/directory
# FP16 baseline
python experiments/base_fp16.py
# BitsAndBytes 4-bit quantization
python experiments/bnb_4bit.py
# AWQ quantization
python experiments/awq_quantized.py
# GPTQ quantization
python experiments/gptq_quantized.py# Run system monitoring demo
python benchmarks/system_monitor.py# View available configurations
python configs/experiment_config.pyThe project includes optimized vLLM configuration profiles for different use cases:
base_fp16: Standard FP16 baseline for maximum qualitymemory_optimized: Aggressive memory saving with quality trade-offsthroughput_optimized: Optimized for maximum tokens/secondbalanced: Good balance of memory usage and performancelong_context: Optimized for extended context sequencesproduction: Stable settings for production deployment
Example usage:
from configs.experiment_config import ConfigManager
config_manager = ConfigManager()
kwargs = config_manager.create_vllm_kwargs('balanced', 'meta-llama/Meta-Llama-3.1-8B-Instruct')
llm = LLM(**kwargs)- Standard 16-bit floating point precision
- Maximum quality reference point
- ~16GB memory requirement for model weights
- Accessible through Transformers library
- NF4 quantization with double quantization
- Excellent quality preservation
- Easy to implement and experiment with
- Hardware-optimized for inference speed
- Considers activation patterns during quantization
- Excellent vLLM integration
- Designed for production deployment
- Established post-training quantization method
- Good compression with quality trade-offs
- Broad framework support
- Mature implementation
After running experiments, results are automatically saved in the results/ directory:
- Individual Results:
{method}_{timestamp}.jsonfiles with detailed metrics - Comparison Data:
comparison_{timestamp}.jsonwith side-by-side analysis - Charts:
comparison_chart_{timestamp}.pngwith visualization - Monitoring Data: System resource monitoring logs
- Memory Usage: GPU and system memory consumption
- Performance: Tokens per second, model loading time
- Quality: Inference outputs for comparison
- System Resources: GPU utilization, temperature, power consumption
- NVIDIA RTX 4090 (24GB VRAM)
- 32GB system RAM
- CUDA-compatible driver
- RTX 4090 with good cooling
- 64GB+ system RAM for comfortable experimentation
- Fast NVMe SSD for model storage
- Latest CUDA toolkit and drivers
# Try reducing these parameters in configs/vllm_configs.yaml
gpu_memory_utilization: 0.85 # Reduce from 0.9
max_model_len: 2048 # Reduce from 4096
max_num_seqs: 64 # Reduce from 128- Ensure you have sufficient disk space for model downloads
- Check internet connectivity for Hugging Face model downloads
- Verify CUDA installation with
nvidia-smi
- Enable CUDA graphs:
enforce_eager: false - Increase batch size:
max_num_seqs: 256 - Check thermal throttling with monitoring tools
- Start Conservative: Begin with lower memory utilization settings
- Monitor Resources: Use the system monitor to understand actual usage
- Adjust Gradually: Increase utilization incrementally
- Use Swap Space: Configure swap for memory pressure relief
- CPU Offloading: Use CPU offloading for extreme memory constraints
We welcome contributions! Here are ways to help:
- Additional Quantization Methods: Implement new quantization techniques
- Hardware Variations: Test on different GPU configurations
- Quality Metrics: Improve model quality assessment methods
- Optimization Techniques: Add new memory optimization strategies
If you use this benchmarking suite in your research or blog posts, please consider citing:
@misc{llama8b-memory-optimization,
title={Memory Optimization Deep Dive: Running 8B Models on RTX 4090},
author={[Alexey Ermolaev]},
year={2025},
url={[https://github.com/ermolushka/vllm-benchmark]}
}For detailed analysis and insights, see the complete blog post.