(Published in the Reinforcement Learning Journal (RLC 2025))
This repository contains the code to reproduce the experiments present in our paper titled Deep Reinforcement Learning with Gradient Eligibility Traces which was accepted at (RLC 2025)
create a virtual env install requirements jax, mujoco,gymnasium,..etc (check the requirement file)
$ pip install -e .
The gtd_algos/exp_configs folder contains default configuration files with the hyperparameters configurations used to generate the results in our experiments.
Note on the config files: We have both exp_seed and env_seed, the exp_seed is used for randomization within the agent such as action selection, and env_seed is used to seed the environment. We think that having separate seeds for both the agent and the environment is a good practice to make sure the agent and the environment decoupled.
We have two forward-view algorithms: PPO and Gradient PPO.
For Gradient PPO, you need to run:
python gtd_algos/src/experiments/gym_exps/gradient_ppo_main.py --config_file=gtd_algos/exp_configs/mujoco_gradient_ppo.yaml
The default values in the mujoco_gradient_ppo.yaml
will run one seed for the Ant-v4 environments, and you can easily to change the seed or env_name value to try out different seeds and different environments.
Different variations of gradient ppo updates can also be configured from the config file:
- To get TDRC(
$\lambda$ ) update (the default): setgradient_correction
to True andreg_coeff
to a value greater than 0.0. Thereg_coeff
is the$\beta$ coeffient in the paper. - To get TDC(
$\lambda$ ) update: setgradient_correction
to True andreg_coeff
to 0.0. - To get GTD2(
$\lambda$ ) update: setgradient_correction
to False.
Note that we have an additional config parameter is_correction
when this is true. Setting it to True would use the equations in Appendix A for importance sampling. However, in our experiments we set it to False by defaults.
For PPO, you need to run:
python gtd_algos/src/experiments/gym_exps/ppo_main.py --config_file=gtd_algos/exp_configs/mujoco_ppo.yaml
We have three backward-view algorithms: our QRC(
For QRC(
python gtd_algos/src/experiments/gym_exps/qrc_main.py --config_file=gtd_algos/exp_configs/minatar_qrclambda.yaml
As before, to get the other variations of gradient updates, the config file can be edited as follows:
- To get QRC(
$\lambda$ ) update (the default): setgradient_correction
to True andreg_coeff
to a value greater than 0.0. (1.0 by defaults) - To get QC(
$\lambda$ ) update: setgradient_correction
to True andreg_coeff
to 0.0. - To get GQ2(
$\lambda$ ) update: setgradient_correction
to False.
For Q(
python gtd_algos/src/experiments/gym_exps/q_lambda_main.py --config_file=gtd_algos/exp_configs/minatar_qlambda.yaml
We include the PyTorch implementation provided by (Elsayed et al., 2024), and we include our implementation in Jax, which we verified to reproduce the same results but is also 2x faster than the PyTorch implementation. For Jax version:
python gtd_algos/src/experiments/gym_exps/streamq_main.py --config_file=gtd_algos/exp_configs/minatar_streamq_jax.yaml
For PyTorch version:
python gtd_algos/src/experiments/gym_exps/streamq_pytorch.py
Please cite our work if you find it useful:
@inproceedings{elelimy2025deep,
title={Deep Reinforcement Learning with Gradient Eligibility Traces},
author={Esraa Elelimy and Brett Daley and Andrew Patterson and Marlos C. Machado and Adam White and Martha White},
booktitle={Reinforcement Learning Conference},
year={2025},
}