Skip to content

Latest commit

 

History

History
162 lines (117 loc) · 7.44 KB

README.md

File metadata and controls

162 lines (117 loc) · 7.44 KB

Pacman-RL

Implement some reinforcement learning algorithms, test and visualize on Pacman under OpenAI's Gym environment.

Requirements

  • Python 3.6+
  • gym
  • matplotlib
  • tensorflow
  • keras
  • mujoco_py (if you want to save replay)
  • torch
  • torchvision

Run

  • Run python run.py --controller MC train for training using Monte-Carlo control. The weight file will be saved as weights/mc.h5.
  • Run python run.py --controller MC --render --show_plot --evaluate_episodes 10 evaluate for evaluation using Monte-Carlo control. It will render the Pacman environment and show the dynamic Q-value and reward plot at the same time.
Full usage: run.py [-h]
              [--controller {MC,Sarsa,Sarsa_lambda,Q_learning,REINFORCE,ActorCritic,A3C,PPO}]
              [--render] [--save_replay] [--save_plot] [--show_plot]
              [--num_episodes NUM_EPISODES] [--batch_size BATCH_SIZE]
              [--eva_interval EVA_INTERVAL]
              [--evaluate_episodes EVALUATE_EPISODES] [--lr LR]
              [--epsilon EPSILON] [--gamma GAMMA] [--lam LAM] [--forward]
              [--max_workers MAX_WORKERS] [--t_max T_MAX]
              {train,evaluate}

positional arguments:
  {train,evaluate}      what to do

optional arguments:
  -h, --help            show this help message and exit
  --controller {MC,Sarsa,Sarsa_lambda,Q_learning,REINFORCE,ActorCritic,A3C,PPO}
                        choose an algorithm (controller)
  --render              set to render the env when evaluate
  --save_replay         set to save replay
  --save_plot           set to save Q-value plot when evaluate
  --show_plot           set to show Q-value plot when evaluate
  --num_episodes NUM_EPISODES
                        set to run how many episodes
  --batch_size BATCH_SIZE
                        set the batch size
  --eva_interval EVA_INTERVAL
                        set how many episodes evaluate once
  --evaluate_episodes EVALUATE_EPISODES
                        set evaluate how many episodes
  --lr LR               set learning rate
  --epsilon EPSILON     set epsilon when use epsilon-greedy
  --gamma GAMMA         set reward decay rate
  --lam LAM             set lambda if use sarsa(lambda) algorithm
  --forward             set to use forward-view sarsa(lambda)
  --rawpixels           set to use raw pixels as input (only valid to PPO)
  --max_workers MAX_WORKERS
                        set max workers to train
  --t_max T_MAX         set simulate how many timesteps until update param

sample1

sample2

Reinforcement Learning Algorithms

Monte-Carlo Control

  • Policy evaluation

  • Policy improvement: 𝜀-greedy with 𝜀 decay

  • Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)

learning curve

Sarsa(0)

  • Policy evaluation
  • Policy improvement: 𝜀-greedy with 𝜀 decay
  • Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)

learning curve

Sarsa(𝝀)

Forward-view

  • Policy evaluation
  • Policy improvement: 𝜀-greedy with 𝜀 decay
  • Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)

Backward-view

  • Policy evaluation
    • Accumulating eligibility trace:
  • Policy improvement: 𝜀-greedy with 𝜀 decay
  • Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)

learning curve

Q-learning

  • Policy evaluation
  • Policy improvement: 𝜀-greedy with 𝜀 decay
  • Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)

learning curve

REINFORCE

Monte-Carlo policy gradient

  • Use return Gt to estimate :
  • Policy function approximation: Softmax policy with a fc layer

Note: You shold pick a very small lr to train a decent model, e.g. lr = 0.00001 learning curve

Advantage Actor-Critic

  • Actor

    • Softmax policy with a fc layer
    • Use advantage function to estimate : , where
  • Critic

    • TD policy evaluation
    • Value function approximation: a fully connected layer (input layer and output layer with no hidden layer)

learning curve

Asynchronous Advantage Actor-Critic (A3C)

a3c

a3c

Trust Region Policy Optimization (TRPO)

trpo Note: Running with OpenAI Spinning Up, TRPO is not implemented in this repo.

Proximal Policy Optimization (PPO)

algo

Run with:

python run.py --controller PPO --max_worker 6 --gamma 0.99 --evaluate_episodes 50 --batch_size 20 --epsilon 0.2 --lam 0.97 --eva_interval 100 train

ppo