(Incorrect) experience-replay MC control algorithm. #62

TikhonJelvis · 2020-12-28T07:46:56Z

Here's my first implementation of the experience-replay-style Monte Carlo algorithm for optimizing MDPs.

However, this implementation is not correct because there's no feedback mechanism to improve the policy between traces. This means that the algorithm evaluates the MDP on a uniform random policy rather than the optimal policy, which is why it arrives at a value of ≈150 rather than ≈170.

In the TD(0) code, we didn't have this problem because the reward for each step is calculated based on the best possible action from that state:

next_reward = max(
    q((transition.next_state, a))
    for a in actions(transition.next_state)
)

This step keeps lets us learn the optimal Q function over time.

In the previous version of the Monte Carlo evaluate_mrp algorithm, we explicitly updated the policy between each episode:

 while True:
     trace = mdp.simulate_actions(states, p)
     ...
     p = markov_decision_process.policy_from_q(q, mdp, ϵ)

However, this only works because we're taking the mdp as an input. In this PR, I tried changing the interface so that we took traces instead:

def evaluate_mdp(
        traces: Iterable[Iterable[mdp.TransitionStep[S, A]]],
        ...
)

How can this work for evaluating control problems if there's no way to impact the control of the system?

I think we'll need to figure out a bidirectional interface here—whether it's taking an MDP object or something else—before we can use Monte Carlo for control problems. Not sure we could make TD(N)/TD(λ)/etc work without that either.

Here's my first implementation of the experience-replay-style Monte Carlo algorithm for optimizing MDPs. However, this implementation is not correct because there's no feedback mechanism to improve the policy between traces. This means that the algorithm evaluates the MDP *on a uniform random policy* rather than the optimal policy, which is why it arrives at a value of ≈150 rather than ≈170. In the TD(0) code, we didn't have this problem because the reward for each step is calculated based on the best possible action from that state: ```python next_reward = max( q((transition.next_state, a)) for a in actions(transition.next_state) ) ``` This step keeps lets us learn the *optimal* Q function over time. In the previous version of the Monte Carlo `evaluate_mrp` algorithm, we explicitly updated the policy between each episode: ```python while True: trace = mdp.simulate_actions(states, p) ... p = markov_decision_process.policy_from_q(q, mdp, ϵ) ``` However, this only works because we're taking the `mdp` as an input. In this PR, I tried changing the interface so that we took traces instead: ```python def evaluate_mdp( traces: Iterable[Iterable[mdp.TransitionStep[S, A]]], ... ) ``` How can this work for evaluating *control* problems if there's no way to impact the control of the system? I think we'll need to figure out a bidirectional interface here—whether it's taking an MDP object or something else—before we can use Monte Carlo for control problems. Not sure we could make TD(N)/TD(λ)/etc work without that either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Incorrect) experience-replay MC control algorithm. #62

(Incorrect) experience-replay MC control algorithm. #62

TikhonJelvis commented Dec 28, 2020

(Incorrect) experience-replay MC control algorithm. #62

Are you sure you want to change the base?

(Incorrect) experience-replay MC control algorithm. #62

Conversation

TikhonJelvis commented Dec 28, 2020