This repository contains a set of tools for reinforcement learning with LLMs in verifiable environments.
PyPI coming soon once a couple more features are added, just clone it for now and run:
git clone https://github.com/willccbb/verifiers.git
cd verifiers
uv sync
uv pip install flash-attn --no-build-isolation
source .venv/bin/activate
accelerate launch --config-file configs/zero3.yaml --num-processes [N-1] verifiers/examples/gsm8k_calculator.py
# Example of using MathEnv with a system using two (2) RTX 4090's and deepspeed with cpu offloading
accelerate launch --config-file configs/zero3_cpu_offload.yaml --num-processes=2 verifiers/examples/gsm8k_simple.py
# Example of using ToolEnv with a system using two (2) RTX 4090's and deepspeed with cpu offloading
accelerate launch --config-file configs/zero3_cpu_offload.yaml --num-processes=2 verifiers/examples/gsm8k_calculator.py
# Example of using ToolEnv with a system using two (2) RTX 4090's and deepspeed with cpu offloading
accelerate launch --config-file configs/zero3_cpu_offload.yaml --num-processes=2 verifiers/examples/gsm8k_tool_phi4.py
Ensure your wandb and huggingface-cli logins are set up (or set report_to=None in training_args).
Tested with Python 3.11 and this image. If you encounter version issues, please confirm that you are able to run basic TRL training in your environment before opening an issue. flash-attn and liger-kernel are used for performance reasons. Recommended usage is via accelerate with DeepSpeed ZeRO 3 (example config) but torchrun works in my tests as well. You should really be using uv (curl -LsSf https://astral.sh/uv/install.sh | sh). I don't have the bandwidth to help debug your version issues if you're using pip, sorry.
# script.py
import verifiers as vf
from verifiers.tools import calculator
from verifiers.prompts import SEARCH_FEW_SHOT
model_name = "Qwen/Qwen2.5-7B-Instruct"
model, tokenizer = vf.get_model_and_tokenizer(model_name)
vf_env = vf.ToolEnv(
dataset="gsm8k",
few_shot=SEARCH_FEW_SHOT[0],
tools=[calculator],
max_steps=3
)
trainer = vf.GRPOEnvTrainer(
model=model,
processing_class=tokenizer,
env=vf_env,
reward_funcs=vf_env.get_rubric(),
args=vf.get_default_grpo_config(run_name="gsm8k", num_gpus=2),
train_dataset=vf_env.get_dataset(),
)
trainer.train()See examples for additional usage examples.
To create your own multi-step environment, inherit from MultiStepEnv and implement:
def get_dataset(self, **kwargs: Any) -> Dataset:
pass
def get_rubric(self, **kwargs: Any) -> List[RewardFunc]:
pass
def is_completed(self, messages: List[Dict[str, str]], **kwargs: Any) -> bool:
pass
def env_response(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, str]:
passAccelerate:
accelerate launch --config_file /path/to/deepspeed_zero3.yaml --num_processes [N-1] script.pyTorchrun:
torchrun --nproc_per_node=[N-1] script.py- Environments:
SimpleEnv,MathEnv,DoubleCheckEnv,CodeEnv,ToolEnv - Multi-step execution in
CodeEnvandToolEnv - Dataset formatting + XML parsers
- Basic ubrics for math/code correctness + formatting
- Defaults for GRPO, model, tokenizer, etc.
There are a number of features we're planning to support in the near future:
- Integrated evals
- TextArena games
- LLM judges
- Claude-generated rubrics
- A range of other environments (suggestions welcome!)
- PPO
- Potential interoperability with other RL libraries (veRL, OpenRLHF, open-instruct, oat, etc.)
Community contributions are appreciated and encouraged!
If you use this code in your research, please cite:
@article{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}