A production-ready pipeline for auditing and refining LLM objectives using Inverse Reinforcement Learning (IRL) with Variational Inference. This package provides a clean, organized implementation of Bayesian IRL methods for learning reward models from human preferences.
Paper: The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
# Clone the repository
git clone https://github.com/Matthieu6/IRL-Alignment-Auditor.git
cd IRL-Alignment-Auditor
# Install Python dependencies
pip install -r requirements.txtImportant: You need a Hugging Face token to download models and save results to the Hub.
- Get your token from: https://huggingface.co/settings/tokens
- Login using the CLI:
huggingface-cli loginOr set the environment variable:
export HUGGING_FACE_HUB_TOKEN="your_token_here"# Test the installation
python -c "import irl_pipeline; print('Installation successful!')"The example notebook demonstrates:
- Complete pipeline execution with Llama-3.2-1B (including IRL+RLHF)
- Custom model configurations
- Individual component usage
- Troubleshooting tips
Note: The complete IRL+RLHF pipeline is only available for Llama-3.2-1B. Other models support dataset generation and IRL training only.
The pipeline uses two main models that you can easily change:
Toxic Model (generates toxic content):
EleutherAI/pythia-70m(small)EleutherAI/pythia-410m(medium)EleutherAI/gpt-neo-125m(small)meta-llama/Llama-3.2-1B(default, large)
Non-Toxic Model (generates detoxified content):
ajagota71/pythia-70m-s-nlp-detox-checkpoint-epoch-100ajagota71/pythia-410m-s-nlp-detox-checkpoint-epoch-100ajagota71/gpt-neo-125m-s-nlp-detox-checkpoint-epoch-100ajagota71/llama-3-2-1b-rlhf-kl-p5-target-2p5-lr-3e-6-checkpoint-epoch-100(default)
train_samples: Number of training samples (default: 3100)test_samples: Number of test samples (default: 600)toxicity_threshold: Threshold for filtering toxic prompts (default: 0.9)sort_threshold: Threshold for sorting outputs (default: 0.7)
# Run everything: dataset generation + IRL training + analysis (default: llama-1B)
./run_pipeline.sh complete# Generate datasets with default models
./run_pipeline.sh generate
# Generate with custom models
./run_pipeline.sh generate \
toxic_model=meta-llama/Llama-3.2-1B \
non_toxic_model=ajagota71/llama-3-2-1b-rlhf-kl-p5-target-2p5-lr-3e-6-checkpoint-epoch-100# Train IRL model (requires existing datasets)
./run_pipeline.sh train
# Train with custom parameters
./run_pipeline.sh train \
training.n_steps=5000 \
training.learning_rate=0.02# Run RLHF with default model
./run_pipeline.sh rlhf
# Run RLHF with specific model
./run_pipeline.sh rlhf \
rlhf_config.model=llama_3_2_1b
# Available RLHF models:
# - smolLM_135m
# - smolLM_360m
# - llama_3_2_1b# Run spurious features analysis (requires trained model)
./run_pipeline.sh analyze# Evaluate trained RLHF model (trained_model_root is path/huggingface name of trained model)
./run_pipeline.sh evaluate \
evaluate_rlhf.model.trained_model_root=user/model-namedatasets/*_samples_original.json: Original model outputsdatasets/*_samples_detoxified.json: Detoxified model outputsdatasets/sorted_toxic_dataset_*.json: Sorted toxic samplesdatasets/sorted_non_toxic_dataset_*.json: Sorted non-toxic samples
outputs/re_irl/{timestamp}/: Training outputs directoryround_{i}/: Results for each training roundsummary.json: Summary of all rounds- Various plots and visualizations
The pipeline uses Hydra for configuration management:
configs/full_pipeline.yaml: Main pipeline configurationconfigs/dataset.yaml: Dataset generation settingsconfigs/re_irl_config.yaml: IRL training parametersconfigs/rlhf_config.yaml: RLHF training parametersconfigs/rlhf/: Model-specific RLHF configurations
EleutherAI/pythia-70mEleutherAI/gpt-neo-125mHuggingFaceTB/SmolLM-135M
EleutherAI/pythia-410mHuggingFaceTB/SmolLM-360M
meta-llama/Llama-3.2-1Bβ (Recommended for IRL+RLHF pipeline)
CUDA Out of Memory:
# Use smaller models
./run_pipeline.sh complete toxic_model=EleutherAI/pythia-70mMissing Dependencies:
pip install -r requirements.txtPermission Errors:
chmod +x run_pipeline.shThis implementation is based on:
- Inverse Reinforcement Learning: Learning reward functions from human preferences
- Variational Inference: Efficient Bayesian inference for large-scale problems
- Bradley-Terry Model: Probabilistic ranking model for pairwise preferences
- Text Detoxification: Reducing toxicity in language model outputs
If you use this code in your research, please cite:
@article{bou2024alignment,
title={The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives},
author={Bou, Matthieu and Patel, Nyal and Jagota, Arjun and Krishna, Satyapriya and Parbhoo, Sonali},
journal={arXiv preprint arXiv:2510.06096},
year={2024},
url={https://arxiv.org/abs/2510.06096}
}This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.