Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typo? #3

Open
Khev opened this issue Feb 25, 2019 · 3 comments
Open

Typo? #3

Khev opened this issue Feb 25, 2019 · 3 comments

Comments

@Khev
Copy link

Khev commented Feb 25, 2019

Hi there, thanks for sharing your code. I think there's an error on line 280 in main.py

critic_loss = self.critic.fit([obs], [reward], batch_size=BATCH_SIZE, shuffle=True, epochs=EPOCHS, verbose=False)

Shoudn't the critic be fitting to the discounted_returns instead of the rewards? That is the line should read

critic_loss = self.critic.fit([obs], [discounted_returns], batch_size=BATCH_SIZE, shuffle=True, epochs=EPOCHS, verbose=False)

@LuEE-C
Copy link
Owner

LuEE-C commented Feb 25, 2019

On line 204 we call the function self.transform_reward() which transforms the content of the reward array into the discounted reward, hope that clarifies

@Khev
Copy link
Author

Khev commented Feb 25, 2019

Ah ya, that makes sense. Thanks!

@Khev
Copy link
Author

Khev commented Feb 25, 2019

Also, I noticed you didn't use target networks for the critic. Did you observe any instability in the learning as a result? Just curious!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants