This repository contains code to reproduce all results of the paper
Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization
Clone the repository and install all requirements from the requirements.txt. We recommend using a virtual environment. The following shows an example installation:
git clone https://github.com/Sebastian-Griesbach/SEE.git
cd SEE
python -m venv env
source env/bin/activate
pip install -r requirements.txtThis was tested with Python version 3.12.3.
Specific results can be repeated by executing run_experiment.py.
python see/run_experiment.pyOr directly in your IDE.
To recreate any of the experiments in the paper, open run_experiment.py, scroll down to the bottom and modify:
if __name__ == "__main__":
setup_and_run_experiment(
environment_name="Pendulum-v1",
reward_setting="adverse",
method="SAC+SEE",
)You can choose between:
-
environment_name:
"Pendulum-v1""LocalOptimumCar-v0""HalfCheetah-v5""Ant-v5""Hopper-v5""Swimmer-v5""FetchPickAndPlace-v4""LargePointMaze-v3"
-
reward_setting:
"dense""sparse""adverse"
-
method:
"SAC""TD3""SAC+SEE""TD3+SEE""SAC+SEE w/o maximum update""SAC+SEE w/o conditioning""SAC+SEE w/o mixing""TD3+SEE w/o maximum update""TD3+SEE w/o conditioning""TD3+SEE w/o mixing"
All other settings will be automatically set to match the settings of the experiments in the paper. The code is optimized for readability and is fully documented with docstrings.
wandb is used for logging which requires an account to use. If you want to change that, there are three occurrences of wandb calls in run_experiment.py which you can replace with an alternative. It is not used outside of this file.
Most environments are included in the environments folder. But the Gymnasium-Robotics environments are based on a fork of that repository, which alters the reward functions of the relevant environments.
SAC+SEE and TD3+SEE are implemented using the Athlete API. For the baseline SAC and TD3 the Athlete implementations are used directly.
@article{griesbach2025learning,
title={Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization},
author={Griesbach, Sebastian and D'Eramo, Carlo},
journal={Reinforcement Learning Journal},
volume={6},
pages={1140--1157},
year={2025}
}