Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Incorrect) experience-replay MC control algorithm. #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TikhonJelvis
Copy link
Owner

Here's my first implementation of the experience-replay-style Monte Carlo algorithm for optimizing MDPs.

However, this implementation is not correct because there's no feedback mechanism to improve the policy between traces. This means that the algorithm evaluates the MDP on a uniform random policy rather than the optimal policy, which is why it arrives at a value of ≈150 rather than ≈170.

In the TD(0) code, we didn't have this problem because the reward for each step is calculated based on the best possible action from that state:

next_reward = max(
    q((transition.next_state, a))
    for a in actions(transition.next_state)
)

This step keeps lets us learn the optimal Q function over time.

In the previous version of the Monte Carlo evaluate_mrp algorithm, we explicitly updated the policy between each episode:

 while True:
     trace = mdp.simulate_actions(states, p)
     ...
     p = markov_decision_process.policy_from_q(q, mdp, ϵ)

However, this only works because we're taking the mdp as an input. In this PR, I tried changing the interface so that we took traces instead:

def evaluate_mdp(
        traces: Iterable[Iterable[mdp.TransitionStep[S, A]]],
        ...
)

How can this work for evaluating control problems if there's no way to impact the control of the system?

I think we'll need to figure out a bidirectional interface here—whether it's taking an MDP object or something else—before we can use Monte Carlo for control problems. Not sure we could make TD(N)/TD(λ)/etc work without that either.

Here's my first implementation of the experience-replay-style Monte Carlo algorithm for optimizing MDPs.

However, this implementation is not correct because there's no feedback mechanism to improve the policy between traces. This means that the algorithm evaluates the MDP *on a uniform random policy* rather than the optimal policy, which is why it arrives at a value of ≈150 rather than ≈170.

In the TD(0) code, we didn't have this problem because the reward for each step is calculated based on the best possible action from that state:

```python
next_reward = max(
    q((transition.next_state, a))
    for a in actions(transition.next_state)
)
```

This step keeps lets us learn the *optimal* Q function over time.

In the previous version of the Monte Carlo `evaluate_mrp` algorithm, we explicitly updated the policy between each episode:

```python
 while True:
     trace = mdp.simulate_actions(states, p)
     ...
     p = markov_decision_process.policy_from_q(q, mdp, ϵ)
```

However, this only works because we're taking the `mdp` as an input. In this PR, I tried changing the interface so that we took traces instead:

```python
def evaluate_mdp(
        traces: Iterable[Iterable[mdp.TransitionStep[S, A]]],
        ...
)
```

How can this work for evaluating *control* problems if there's no way to impact the control of the system?

I think we'll need to figure out a bidirectional interface here—whether it's taking an MDP object or something else—before we can use Monte Carlo for control problems. Not sure we could make TD(N)/TD(λ)/etc work without that either.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant