Skip to content

Latest commit

 

History

History

Gridworld

Gridworld

Download report here

This repo contains implementions and comparison of Sarsa(λ) (with accumulating or replacing traces) and Q-learning. Your agent will be placed in an environment, and will try to reach a goal state by taking actions. Its objective is to maximise expected cumulative reward.

Environment

The environment, which can be modeled by an MDP, is a grid world with a fixed start state and a fixed goal state (see figure). Some states in this world are obstacles, and thus unreachable. At the beginning of each episode, the agent is placed at the start state. The agent must move to the goal by taking the actions: up, down, left, and right. An episode is terminated if the agent reaches the goal, or if the episode reaches a certain length.

  • If the agent tries to move into an obstacle, it stays put.
  • The environment is slippery, i.e. the agent will move in a random direction with some small probability, irrespective of the action taken.
  • Reaching the goal gives a reward of +100.
  • Reaching any other state gives a reward of -1.

image

Note:

  • The algorithms must work for an arbitrary-sized grid. In these experiments, however, the grid is fixed to be of size 32 × 32.
  • The states are numbered 0 to N-1, where N is the total number of states. However, the numbers assigned to the states are permuted, which means cells i and i + 1 need not be neighbours on the grid. The permutation function can be turned off for debugging.
  • A random grid world can be generated by using the --instance parameter for the server. The experiments can be run on the gridworld instances 0 and 1.

Code

server

The server directory comprises the code to simulate the MDP, in other words the "environment". The server waits for a client ("agent") to connect with it and start taking actions in sequence. Initially, by requesting "info" from the server, the client can get to know the number of states in the task, as well as the current state of the agent. Thereafter, for each action, the server generates a reward and a next state based on the current state and the chosen action, which is communicated back to the agent. Agent-server communication happens through a TCP connection. The server is coded in Python3. It can be tarted by calling python3 server.py with the appropriate parameters.

After a complete run, the server prints (to stdout) the per episode reward obtained by the client agent, along with the number of episodes completed.

Parameters to server.py

--ip
--port
--side Side length of the square grid
--instance Instance number of the gridworld. The instance number fixes the start state, goal, and obstacles
--maxlength Maximum number of time steps in an episode. Set it to 1000 for your experiments
--randomseed Seed for RNG. This determines the permutation for state labels, and the actual environment dynamics
--numepisodes Number of episodes to run
--slip How likely is it for the agent to slip. Fixed at 0.02 for evaluation. Can be set to 0 for debugging
--noobfuscate Turns off obfuscation (for debugging only)
--quiet Suppresses detailed output. (Will make the code run a little faster)

client

The client directory is provided to you as an example of what your submission must achieve. The client is the agent that implements learning algorithms. client.py handles the networking. The agent provided to you in agent.py merely samples the actions in a round-robin fashion: you will have to implement more efficient learning algorithms.

Parameters to client.py

--ip
--port
--algorithm The learning algorithm to be used. {`random`, `sarsa`, `qlearning`}
--gamma Discount Factor
--lambda This provides the value for λ for the Sarsa algorithm
--randomseed Seed for RNG. Your agent must use this seed

Results

Expected cumulative reward against episode number for Q-learning and Sarsa (λ=0.8) for MDP instance 0

image

From the experiments carried out (explained further), optimal value of the λ was found to be 0.8 for MDP instance 0. The optimal alpha was 0.8 and optimal epsilon was 0.2 for both the instances. The figure below shows the expected cumulative reward for gamma 1 vs the number of episodes for SARSA and Q Learning algorithms respectively on MDP instance 0.

Expected cumulative reward against episode number for Q-learning and Sarsa(λ=0.85) for MDP instance 1

image

Likewise, optimal value of the λ was found to be 0.85 for MDP instance 1. The optimal alpha was 0.8 and optimal epsilon was 0.2 for both the algorithms . The figure below shows the expected cumulative reward for gamma 1 vs the number of episodes for SARSA and Q Learning algorithms respectively on MDP instance 1.

Expected cumulative reward over the first 500 episodes of training for Sarsa(λ) against λ for MDP instance 0

image

The procedure followed to tune the λ value once alpha was fixed at 0.8 and epsilon was fixed at 0.2 was to plot the expected cumulative reward over first 500 episodes for SARSA on both the MDP instances. The figure below shows the variation expected cumulative reward as λ is changed for MDP instance 0. The values of λ used are more sparse as the likely distance from observed maxima goes increasing. The optimal value of λ was found to be around 0.8 for MDP instance 0. This value is used in all the following experiments including the graph in section 1.

Expected cumulative reward over the first 500 episodes of training for Sarsa(λ) against λ for MDP instance 1

image

Similar procedure was followed for MDP 2 to find the optimal value of λ as shown in figure below. The optimal value of λ was found to be around 0.85 for MDP instance 1.

Tuning

Tuning of epsilon for Q Learning on MDP instance 0

image image
Figure 1 Figure 2

For tuning the epsilon, the alpha was set to 0.1. Then, the epsilon was varied from 0.1 to 0.9 in steps of 0.1. For each value of epsilon the expected cumulative reward was plotted against the episode number on the same axes for comparison.

But, due to noise, optimal epsilon is not clearly distinguishable. So, the the plots for suboptimal epsilons were removed. The plots for epsilons 0.1, 0.2 and 0.3 are nearly identical. Hence, individual plots for each epsilon (all the plots not included in the report are available on the link at the end of this report) were then checked to arrive at a moderate value of epsilon as 0.2.

Tuning of epsilon for Q Learning on MDP instance 1

image image
Figure 3 Figure 4

Same procedure as above was followed for rest of the experiments for tuning the epsilon across MDP instances 0 and 1. A suitable epsilon was found to be 0.2.

Tuning of epsilon for SARSA on MDP instance 0

image image
Figure 5 Figure 6

By applying the same hypothesis, while keeping alpha to be 0.1 and λ 0, 0.2 was found out to be the suitable epsilon. The trace was kept as replace. This is compared with accumulating trace at the end.

Tuning of epsilon for SARSA on MDP instance 1

image image
Figure 7 Figure 8

Alpha was kept at 0.2, λ was kept at 0 and trace set to replace, to get epsilon as 0.2 using the above described procedure.

Since, the optimal value of epsilon is not clearly distinguishable and lies around 0.2 in each case, this was set to be the default value of epsilon in client.py . All the tuned options hereafter are set as default in client.py.

Tuning of alpha for Q Learning on MDP instance 0

image image
Figure 9 Figure 10
image
Figure 11

For tuning the alpha, epsilon was set to the value found in previous section i.e. 0.2. The alpha was varied from 0.1 to 0.9 in steps of 0.1 and at each step the expected cumulative reward was plotted against the episode number on the same axes for the sake of comparison. To get a better idea of the trend of the alpha, plots for suboptimal values of alpha were removed and the remaining plots nearly coincided. The individual plots for each alpha (all the plots not included in the report are available on the link at the end of this report) were then checked to arrive at a suitable value of alpha as 0.8.

Tuning of alpha for Q Learning on MDP instance 1

image image
Figure 12 Figure 13
image
Figure 14

The same procedure was followed for MDP instance 1 with epsilon 0.2 and alpha was chosen to be 0.8.

Tuning of alpha for SARSA on MDP instance 0

image image
Figure 15 Figure 16
image
Figure17

Similar procedure was applied for SARSA instances. Epsilon was kept to be 0.2, λ to be 0 and the trace was defaulted to replace. The suitable value of alpha turned out to be 0.8

Tuning of alpha for SARSA on MDP instance 1

image image
Figure 18 Figure 19
image
Figure 20

The same procedure with same parameters as above was applied here also to get alpha as 0.8.

λ tuning of SARSA on MDP instance 0

image image
Figure 21 Figure 22

The λ was tuned based on the requirement of the assignment to plot expected cumulative reward for first 500 episodes against λ. To get more insight however, the trend of expected cumulative reward for the same 500 episodes was plotted against the episode number for each λ and is shown in figure 22. The maximum expected cumulative reward for MDP instance 0 was found to be at lambda 0.8. This was selected as the optimal value of λ.

λ tuning of SARSA on MDP instance 1

image image
Figure 23 Figure 24

As in the section 5. A., here optimal value of λ was found to be 0.85 from figure 23.

Comparison

Comparison between variants of Q Learning on basis of annealing

An experiment was carried out to spot the effect of annealing of epsilon on the Q Learning algorithm. Figures 25 and 26 show the expected cumulative reward vs the episode number under annealing and in its absence for MDP instances 0 and 1 respectively. In MDP instance 0, Annealing improves the performance slightly but no such conclusion can be drawn from the MDP instance 1. Hence, annealing was not considered.

image image
Figure 25 Figure 26

Comparison between variants of Comparison between variants of SARSA on basis of annealing and accumulating trace

Unlike above case, SARSA shows significant improvement when both annealing and accumulating trace are used, but the improvement in the MDP instance 1 is not significant. Also, the no annealing no accumulating trace is in between the both the extremes. Since, this agent is to be tested on random instances of this MDP, the annealing and accumulating traces were not used.

image image
Figure 27 Figure 28

Observations

  • With proper tuning, SARSA(λ) performs better than Q Learning for the same MDP. SARSA converges quickly than Q learning but leaves some margin for improvement w.r.t Q Learning in MDP instance 0.
  • As λ is increased, the expected cumulative reward increases upto a certain λ, then it decreases even more as λ approaches 1. This optimal λ changes if the learning rate alpha changes.
  • The noise in plots after 50 trials is more pronounced in MDP instance 1 which makes the tuning difficult. The number of trials for MDP 1, therefore should be more than 50.

References