Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in SB3 tutorial ActionMask #1203

Merged
merged 2 commits into from
May 3, 2024

Conversation

dm-ackerman
Copy link
Contributor

Description

SB3ActionMaskWrapper.step() is intended to be compatible with Gymnansium's
interface where step() returns observation, reward, termination, truncation, info

This was implemented using the last() function. But this returns the values for
the current agent, not the agent that just acted as Gymnasium would.

Among other things, this trains on the opponent's reward, encouraging bad play.

The function now returns the reward, termination, truncation, info values for the
agent that just acted. It still returns the observation for the next agent since
it is used to determine the next action.

Fixes #1147

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have run pytest -v and no errors are present.
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

SB3ActionMaskWrapper.step() is intended to be compatible with Gymnansium's
interface where step() returns observation, reward, termination, truncation, info

This was implemented using the last() function. But this returns the values for
the current agent, not the agent that just acted as Gymnasium would.

Among other things, this trains on the opponent's reward, encouraging bad play.

The function now returns the reward, termination, truncation, info values for the
agent that just acted. It still returns the observation for the next agent since
it is used to determine the next action.
Some shifting of items from medium to easy with new changes
@elliottower
Copy link
Contributor

Good catch, cheers

@elliottower elliottower merged commit 38e2520 into Farama-Foundation:master May 3, 2024
46 checks passed
@dm-ackerman dm-ackerman deleted the sb3_fix branch May 3, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug Report] SB3 Connect four tutorial does not train properly.
2 participants