(Incorrect) experience-replay MC control algorithm. #62
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here's my first implementation of the experience-replay-style Monte Carlo algorithm for optimizing MDPs.
However, this implementation is not correct because there's no feedback mechanism to improve the policy between traces. This means that the algorithm evaluates the MDP on a uniform random policy rather than the optimal policy, which is why it arrives at a value of ≈150 rather than ≈170.
In the TD(0) code, we didn't have this problem because the reward for each step is calculated based on the best possible action from that state:
This step keeps lets us learn the optimal Q function over time.
In the previous version of the Monte Carlo
evaluate_mrp
algorithm, we explicitly updated the policy between each episode:However, this only works because we're taking the
mdp
as an input. In this PR, I tried changing the interface so that we took traces instead:How can this work for evaluating control problems if there's no way to impact the control of the system?
I think we'll need to figure out a bidirectional interface here—whether it's taking an MDP object or something else—before we can use Monte Carlo for control problems. Not sure we could make TD(N)/TD(λ)/etc work without that either.