-
-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs-fix: Pistonball #1245
base: master
Are you sure you want to change the base?
Docs-fix: Pistonball #1245
Conversation
…ion because the local-reward was deprecated. Modify some aspects of the documentation for improved clarity & skimming.
Hi @anordin95, Indeed the local reward was removed in #486 but the code was never updated further than this TODO comment. I can merge the PR as it stands (once the CI checks have passed), but if you'd like to also modify the environment code to completely remove the local reward do let me know, and I will merge it when you're done with the code update. Up to you, just let me know, and thank you for your contribution! :) [Edit]: the connect_four tutorials CI tests have been removed from the latest version so you can ignore CI workflow failures there. |
# A rough, first-order prediction (i.e. velocity-only) of the balls next position. | ||
# The physics environment may bounce the ball off the wall in the next time-step | ||
# without us first registering that win-condition. | ||
ball_predicted_next_pos = ( | ||
ball_curr_pos + | ||
self.ball.velocity[0] * self.dt | ||
) | ||
if ball_next_x <= self.wall_width + 1: | ||
# Include a single-pixel fudge-factor for the approximation. | ||
if ball_predicted_next_pos <= self.wall_width + 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is my guess at the purpose of this logic! Please let me know if it looks wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
This looks good to me at first glance. Let me know when you're ready for a full review, and I'll tag 2 contributors that have (way) more experience with pistonball than me to have their feedback.
Keep up the good work and thank you for helping improve PettingZoo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on! :)
I think it's ready for a full review now.
P.S. Is there a particular reason the current distance-based reward-function which compensates at every update step was chosen over, say, a reward of 1 upon touching the left-wall and at all other times zero? |
78ba747
to
9a43547
Compare
@dm-ackerman @benblack769 @RyanNavillus I saw in the git blame that you made extensive contributions to the pistonball environment, if you have the time would you mind taking a look at this PR? @jkterry1 since you made the original removal of the local reward in PR #486 I'm also tagging you in case you want/have the time to take a look at it. |
70bedbc
to
63ff2a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. I think my changes to the code were mostly a refactor, I just replaced all of the magic numbers with named variables so that the code was more readable and did some refactoring. I think that global reward already didn't work the way we intended back then, or wasn't useful for training, so we partially removed it and never fixed it. I do think it's a good feature to include, but if you're just trying to make the docs consistent I think it's fine to remove for now, and maybe consider adding it back correctly in the future. Overall this PR looks good to me.
self.lastX = int(self.ball.position[0] - self.ball_radius) | ||
self.distance = self.lastX - self.wall_width | ||
self.ball_prev_pos = self._get_ball_position() | ||
self.distance_to_wall_at_game_start = self.ball_prev_pos - self.wall_width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.distance_to_wall_at_game_start = self.ball_prev_pos - self.wall_width | |
self.initial_wall_distance = self.ball_prev_pos - self.wall_width |
Maybe a bit more concise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm that is more concise. Though, it may be harder for a reader to differentiate whether "initial" refers to the start of a time-step or the start of the game.
Edit: the variable is also only used once and computed a few hundred lines away, so I think precise context is helpful here.
reward = ( | ||
-1 | ||
* (ball_curr_pos - self.ball_prev_pos) | ||
* (100 / self.distance_to_wall_at_game_start) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reward = ( | |
-1 | |
* (ball_curr_pos - self.ball_prev_pos) | |
* (100 / self.distance_to_wall_at_game_start) | |
) | |
reward = (self.ball_prev_pos - ball_curr_pos) * (100 / self.initial_wall_distance) |
I think this is still more readable, as long as you keep your comment about how the x axis increases left to right. Another option is this, which makes it clear that 100 is an arbitrary scalar.
reward = ( | |
-1 | |
* (ball_curr_pos - self.ball_prev_pos) | |
* (100 / self.distance_to_wall_at_game_start) | |
) | |
reward = 100 * ((self.ball_prev_pos - ball_curr_pos) / self.initial_wall_distance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two changes here.
The first is compressing the multi-line expression to one line. I think your current suggestion would still end up being linted, though to a fewer number of lines. For reference, I tried it locally and saw the linter update it like so:
reward = 100 * (
(self.ball_prev_pos - ball_curr_pos) / self.initial_wall_distance
)
The second is explicitly including the -1
versus modifying the order of current - previous
. I think a delta is generally assumed to be current minus previous for any given quantity (e.g. delta-x, delta-t). I'm a bit wary of modifying that typical, intuitive ordering for the sake of packing an extra bit of logic in to save a line of code.
I think we want this reward function to be as easy to learn as possible, since pistonball is sort of the cartpole of pettingzoo and used in most tutorials (unless things have changed since I last worked on it). Sparse rewards are harder to learn than dense rewards. That's also why I think it makes sense to include the 100x reward multiplier in the code, because that reward scale probably works with the default hyperparameters for most PPO implementations. |
Ah, right on. Thanks! |
Description
Remove mentions of local-reward for Pistonball because the code no longer supports it. I'm guessing the functionality was removed earlier but the documentation was not accordingly modified. See below.
PettingZoo/pettingzoo/butterfly/pistonball/pistonball.py
Lines 239 to 240 in d6bd110
Besides that comment, you can run the environment and see the rewards are constant across all pistons which shouldn't be possible if there was a local reward component.
Also, re-arrange and edit some aspects of the phrasing for clarity & improved ability to quickly skim.
Type of change
Checklist:
relevant [note: docs-only change]
pre-commit
checks withpre-commit run --all-files
(seeCONTRIBUTING.md
instructions to set it up)pytest -v
and no errors are present.pytest -v
has generated that are related to my code to the best of my knowledge.not-relevant