Skip to content

Matthieu6/IRL-Alignment-Auditor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IRL Alignment Auditor

Open In Colab Paper

A production-ready pipeline for auditing and refining LLM objectives using Inverse Reinforcement Learning (IRL) with Variational Inference. This package provides a clean, organized implementation of Bayesian IRL methods for learning reward models from human preferences.

Paper: The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

πŸš€ Quick Setup

1. Install Dependencies

# Clone the repository
git clone https://github.com/Matthieu6/IRL-Alignment-Auditor.git
cd IRL-Alignment-Auditor

# Install Python dependencies
pip install -r requirements.txt

2. Hugging Face Token Setup

Important: You need a Hugging Face token to download models and save results to the Hub.

  1. Get your token from: https://huggingface.co/settings/tokens
  2. Login using the CLI:
huggingface-cli login

Or set the environment variable:

export HUGGING_FACE_HUB_TOKEN="your_token_here"

3. Verify Installation

# Test the installation
python -c "import irl_pipeline; print('Installation successful!')"

πŸ““ Google Colab Example

Try it now: Open In Colab

The example notebook demonstrates:

  • Complete pipeline execution with Llama-3.2-1B (including IRL+RLHF)
  • Custom model configurations
  • Individual component usage
  • Troubleshooting tips

Note: The complete IRL+RLHF pipeline is only available for Llama-3.2-1B. Other models support dataset generation and IRL training only.

🎯 Main Variables to Change

Model Configuration

The pipeline uses two main models that you can easily change:

Toxic Model (generates toxic content):

  • EleutherAI/pythia-70m (small)
  • EleutherAI/pythia-410m (medium)
  • EleutherAI/gpt-neo-125m (small)
  • meta-llama/Llama-3.2-1B (default, large)

Non-Toxic Model (generates detoxified content):

  • ajagota71/pythia-70m-s-nlp-detox-checkpoint-epoch-100
  • ajagota71/pythia-410m-s-nlp-detox-checkpoint-epoch-100
  • ajagota71/gpt-neo-125m-s-nlp-detox-checkpoint-epoch-100
  • ajagota71/llama-3-2-1b-rlhf-kl-p5-target-2p5-lr-3e-6-checkpoint-epoch-100 (default)

⚠️ RLHF Limitation: The complete IRL+RLHF pipeline (using learned reward models) is currently only supported for the Llama-3.2-1B model. Other models can be used for dataset generation and IRL training, but RLHF will use standard training without the learned IRL reward function.

Other Key Variables

  • train_samples: Number of training samples (default: 3100)
  • test_samples: Number of test samples (default: 600)
  • toxicity_threshold: Threshold for filtering toxic prompts (default: 0.9)
  • sort_threshold: Threshold for sorting outputs (default: 0.7)

πŸƒβ€β™‚οΈ How to Run

Complete Pipeline (Recommended for first run)

# Run everything: dataset generation + IRL training + analysis (default: llama-1B)
./run_pipeline.sh complete

Individual Components

1. Generate Datasets Only

# Generate datasets with default models
./run_pipeline.sh generate

# Generate with custom models
./run_pipeline.sh generate \
  toxic_model=meta-llama/Llama-3.2-1B \
  non_toxic_model=ajagota71/llama-3-2-1b-rlhf-kl-p5-target-2p5-lr-3e-6-checkpoint-epoch-100

2. Train IRL Model Only

# Train IRL model (requires existing datasets)
./run_pipeline.sh train

# Train with custom parameters
./run_pipeline.sh train \
  training.n_steps=5000 \
  training.learning_rate=0.02

3. Run RLHF Training

# Run RLHF with default model
./run_pipeline.sh rlhf

# Run RLHF with specific model
./run_pipeline.sh rlhf \
  rlhf_config.model=llama_3_2_1b

# Available RLHF models:
# - smolLM_135m
# - smolLM_360m  
# - llama_3_2_1b

⚠️ Important: RLHF using the IRL reward model is currently only available for the Llama-3.2-1B model. Other models use standard RLHF training without the learned IRL reward function.

4. Analyze Spurious Features

# Run spurious features analysis (requires trained model)
./run_pipeline.sh analyze

5. Evaluate RLHF Models

# Evaluate trained RLHF model (trained_model_root is path/huggingface name of trained model)
./run_pipeline.sh evaluate \
  evaluate_rlhf.model.trained_model_root=user/model-name

πŸ“Š Outputs

Generated Datasets

  • datasets/*_samples_original.json: Original model outputs
  • datasets/*_samples_detoxified.json: Detoxified model outputs
  • datasets/sorted_toxic_dataset_*.json: Sorted toxic samples
  • datasets/sorted_non_toxic_dataset_*.json: Sorted non-toxic samples

Training Results

  • outputs/re_irl/{timestamp}/: Training outputs directory
  • round_{i}/: Results for each training round
  • summary.json: Summary of all rounds
  • Various plots and visualizations

πŸ”§ Configuration Files

The pipeline uses Hydra for configuration management:

  • configs/full_pipeline.yaml: Main pipeline configuration
  • configs/dataset.yaml: Dataset generation settings
  • configs/re_irl_config.yaml: IRL training parameters
  • configs/rlhf_config.yaml: RLHF training parameters
  • configs/rlhf/: Model-specific RLHF configurations

πŸ§ͺ Compatible Models

Small Models (Fast, Good for Testing)

  • EleutherAI/pythia-70m
  • EleutherAI/gpt-neo-125m
  • HuggingFaceTB/SmolLM-135M

Medium Models (Balanced)

  • EleutherAI/pythia-410m
  • HuggingFaceTB/SmolLM-360M

Large Models (Best Performance)

  • meta-llama/Llama-3.2-1B ⭐ (Recommended for IRL+RLHF pipeline)

🚨 Troubleshooting

Common Issues

CUDA Out of Memory:

# Use smaller models
./run_pipeline.sh complete toxic_model=EleutherAI/pythia-70m

Missing Dependencies:

pip install -r requirements.txt

Permission Errors:

chmod +x run_pipeline.sh

πŸ”¬ Research Background

This implementation is based on:

  • Inverse Reinforcement Learning: Learning reward functions from human preferences
  • Variational Inference: Efficient Bayesian inference for large-scale problems
  • Bradley-Terry Model: Probabilistic ranking model for pairwise preferences
  • Text Detoxification: Reducing toxicity in language model outputs

πŸ“„ Citation

If you use this code in your research, please cite:

@article{bou2024alignment,
  title={The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives},
  author={Bou, Matthieu and Patel, Nyal and Jagota, Arjun and Krishna, Satyapriya and Parbhoo, Sonali},
  journal={arXiv preprint arXiv:2510.06096},
  year={2024},
  url={https://arxiv.org/abs/2510.06096}
}

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published