Llama 8B Memory Optimization on RTX 4090

A comprehensive benchmarking suite for testing different quantization techniques and memory optimization strategies for running Llama 8B models on NVIDIA RTX 4090 GPUs.

Overview

This project provides reproducible experiments comparing various approaches to running 8B parameter language models efficiently on consumer hardware, specifically targeting the 24GB VRAM limitation of the RTX 4090.

Features

Multiple Quantization Methods: FP16 baseline, BitsAndBytes 4-bit, AWQ, GPTQ
Automated Benchmarking: Complete suite with performance and memory monitoring
Configuration Management: Flexible vLLM configuration profiles
Real-time Monitoring: GPU memory, temperature, and system resource tracking
Comparative Analysis: Automated report generation with charts and statistics
Blog Post Template: Ready-to-publish analysis framework

Project Structure

vllm-benchmark/
├── experiments/           # Individual quantization experiments
│   ├── base_fp16.py      # FP16 baseline experiment
│   ├── bnb_4bit.py       # BitsAndBytes 4-bit quantization
│   ├── awq_quantized.py  # AWQ quantization experiment
│   └── gptq_quantized.py # GPTQ quantization experiment
├── benchmarks/           # Benchmarking and monitoring tools
│   ├── benchmark_runner.py    # Automated benchmark suite
│   └── system_monitor.py      # Real-time system monitoring
├── configs/              # Configuration files and management
│   ├── vllm_configs.yaml      # vLLM configuration profiles
│   └── experiment_config.py   # Configuration loader utilities
├── results/              # Generated results and reports (created during runs)
├── main.py               # entrypoint
└── README.md

Quick Start

Prerequisites

NVIDIA RTX 4090 with 24GB VRAM
CUDA toolkit installed
Python 3.11+
UV package manager

Installation

# Clone and navigate to the project
cd llama-8b-memory-optimization

# Install dependencies using UV
uv sync

# Activate the virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

Running Experiments

Run All Experiments (Recommended)

python benchmarks/benchmark_runner.py

This will:

Run all quantization experiments sequentially
Generate comparison reports and charts
Save results in the results/ directory

Run Individual Experiments

# FP16 baseline
python experiments/base_fp16.py

# BitsAndBytes 4-bit quantization
python experiments/bnb_4bit.py

# AWQ quantization
python experiments/awq_quantized.py

# GPTQ quantization
python experiments/gptq_quantized.py

Monitor System Resources

# Run system monitoring demo
python benchmarks/system_monitor.py

Test Configuration Management

# View available configurations
python configs/experiment_config.py

Configuration Profiles

The project includes optimized vLLM configuration profiles for different use cases:

base_fp16: Standard FP16 baseline for maximum quality
memory_optimized: Aggressive memory saving with quality trade-offs
throughput_optimized: Optimized for maximum tokens/second
balanced: Good balance of memory usage and performance
long_context: Optimized for extended context sequences
production: Stable settings for production deployment

Example usage:

from configs.experiment_config import ConfigManager

config_manager = ConfigManager()
kwargs = config_manager.create_vllm_kwargs('balanced', 'meta-llama/Meta-Llama-3.1-8B-Instruct')
llm = LLM(**kwargs)

Quantization Methods Tested

1. FP16 Baseline

Standard 16-bit floating point precision
Maximum quality reference point
~16GB memory requirement for model weights

2. BitsAndBytes 4-bit

Accessible through Transformers library
NF4 quantization with double quantization
Excellent quality preservation
Easy to implement and experiment with

3. AWQ (Activation-aware Weight Quantization)

Hardware-optimized for inference speed
Considers activation patterns during quantization
Excellent vLLM integration
Designed for production deployment

4. GPTQ

Established post-training quantization method
Good compression with quality trade-offs
Broad framework support
Mature implementation

Results and Analysis

After running experiments, results are automatically saved in the results/ directory:

Individual Results: {method}_{timestamp}.json files with detailed metrics
Comparison Data: comparison_{timestamp}.json with side-by-side analysis
Charts: comparison_chart_{timestamp}.png with visualization
Monitoring Data: System resource monitoring logs

Key Metrics Tracked

Memory Usage: GPU and system memory consumption
Performance: Tokens per second, model loading time
Quality: Inference outputs for comparison
System Resources: GPU utilization, temperature, power consumption

Hardware Requirements

Minimum Requirements

NVIDIA RTX 4090 (24GB VRAM)
32GB system RAM
CUDA-compatible driver

Recommended Setup

RTX 4090 with good cooling
64GB+ system RAM for comfortable experimentation
Fast NVMe SSD for model storage
Latest CUDA toolkit and drivers

Troubleshooting

Common Issues

Out of Memory Errors

# Try reducing these parameters in configs/vllm_configs.yaml
gpu_memory_utilization: 0.85  # Reduce from 0.9
max_model_len: 2048          # Reduce from 4096
max_num_seqs: 64             # Reduce from 128

Model Loading Failures

Ensure you have sufficient disk space for model downloads
Check internet connectivity for Hugging Face model downloads
Verify CUDA installation with nvidia-smi

Slow Performance

Enable CUDA graphs: enforce_eager: false
Increase batch size: max_num_seqs: 256
Check thermal throttling with monitoring tools

Memory Optimization Tips

Start Conservative: Begin with lower memory utilization settings
Monitor Resources: Use the system monitor to understand actual usage
Adjust Gradually: Increase utilization incrementally
Use Swap Space: Configure swap for memory pressure relief
CPU Offloading: Use CPU offloading for extreme memory constraints

Contributing

We welcome contributions! Here are ways to help:

Additional Quantization Methods: Implement new quantization techniques
Hardware Variations: Test on different GPU configurations
Quality Metrics: Improve model quality assessment methods
Optimization Techniques: Add new memory optimization strategies

Citation

If you use this benchmarking suite in your research or blog posts, please consider citing:

@misc{llama8b-memory-optimization,
  title={Memory Optimization Deep Dive: Running 8B Models on RTX 4090},
  author={[Alexey Ermolaev]},
  year={2025},
  url={[https://github.com/ermolushka/vllm-benchmark]}
}

For detailed analysis and insights, see the complete blog post.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
configs		configs
experiments		experiments
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

ermolushka/vllm-benchmark

Folders and files

Latest commit

History

Repository files navigation