This repository hosts the code release for the paper "Orchestrated Value Mapping for Reinforcement Learning", published at ICLR 2022. This work was done by Mehdi Fatemi (Microsoft Research) and Arash Tavakoli (Max Planck Institute for Intelligent Systems).
We release a flexible framework, built upon Dopamine (Castro et al., 2018), for building and orchestrating various mappings over different reward decomposition schemes. This enables the research community to easily explore the design space that our theory opens up and investigate new convergent families of algorithms.
The code has been developed by Arash Tavakoli.
If you make use of our work, please use the citation information below:
@inproceedings{Fatemi2022Orchestrated,
title={Orchestrated Value Mapping for Reinforcement Learning},
author={Mehdi Fatemi and Arash Tavakoli},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=c87d0TS4yX}
}
We install the required packages within a virtual environment.
Create a virtual environment using conda
via:
conda create --name maprl-env python=3.8
conda activate maprl-env
Atari benchmark. To set up the Atari suite, please follow the steps outlined here.
Install Dopamine. Install a compatible version of Dopamine with pip
:
pip install dopamine-rl==3.1.10
To easily experiment within our framework, install it from source and modify the code directly:
git clone https://github.com/microsoft/orchestrated-value-mapping.git
cd orchestrated-value-mapping
pip install -e .
Change directory to the workspace directory:
cd map_rl
To train a LogDQN agent, similar to that introduced by van Seijen, Fatemi & Tavakoli (2019), run the following command:
python -um map_rl.train \
--base_dir=/tmp/log_dqn \
--gin_files='configs/map_dqn.gin' \
--gin_bindings='MapDQNAgent.map_func_id="[log,log]"' \
--gin_bindings='MapDQNAgent.rew_decomp_id="polar"' &
Here, polar
refers to the reward decomposition scheme described in Equation 13 of Fatemi & Tavakoli (2022) (which has two reward channels) and [log,log]
results in a logarithmic mapping for each of the two reward channels.
Train a LogLinDQN agent, similar to that described by Fatemi & Tavakoli (2022), using:
python -um map_rl.train \
--base_dir=/tmp/loglin_dqn \
--gin_files='configs/map_dqn.gin' \
--gin_bindings='MapDQNAgent.map_func_id="[loglin,loglin]"' \
--gin_bindings='MapDQNAgent.rew_decomp_id="polar"' &
To instantiate a custom agent, simply set the mapping functions for each channel and a reward decomposition scheme. For instance, the following setting
MapDQNAgent.map_func_id="[log,identity]"
MapDQNAgent.rew_decomp_id="polar"
results in a logarithmic mapping for the positive-reward channel and the identity mapping (same as in DQN) for the negative-reward channel.
To use more complex reward decomposition schemes, such as Configurations 1 and 2 from Fatemi & Tavakoli (2022), you can do as follows:
MapDQNAgent.map_func_id="[identity,identity,log,log,loglin,loglin]"
MapDQNAgent.rew_decomp_id="config_1"
To instantiate an ensemble of two learners, each using a polar
reward decomposition, use the following syntax:
MapDQNAgent.map_func_id="[loglin,loglin,log,log]"
MapDQNAgent.rew_decomp_id="two_ensemble_polar"
To implement custom mapping functions and reward decomposition schemes, we suggest that you draw on insights from Fatemi & Tavakoli (2022) and follow the format of such methods in map_dqn_agent.py to design yours.