Typo? #3

Khev · 2019-02-25T17:06:08Z

Hi there, thanks for sharing your code. I think there's an error on line 280 in main.py

critic_loss = self.critic.fit([obs], [reward], batch_size=BATCH_SIZE, shuffle=True, epochs=EPOCHS, verbose=False)

Shoudn't the critic be fitting to the discounted_returns instead of the rewards? That is the line should read

critic_loss = self.critic.fit([obs], [discounted_returns], batch_size=BATCH_SIZE, shuffle=True, epochs=EPOCHS, verbose=False)

LuEE-C · 2019-02-25T19:17:20Z

On line 204 we call the function self.transform_reward() which transforms the content of the reward array into the discounted reward, hope that clarifies

Khev · 2019-02-25T19:52:03Z

Ah ya, that makes sense. Thanks!

Khev · 2019-02-25T19:56:32Z

Also, I noticed you didn't use target networks for the critic. Did you observe any instability in the learning as a result? Just curious!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typo? #3

Typo? #3

Khev commented Feb 25, 2019 •

edited

Loading

LuEE-C commented Feb 25, 2019

Khev commented Feb 25, 2019

Khev commented Feb 25, 2019

Typo? #3

Typo? #3

Comments

Khev commented Feb 25, 2019 • edited Loading

LuEE-C commented Feb 25, 2019

Khev commented Feb 25, 2019

Khev commented Feb 25, 2019

Khev commented Feb 25, 2019 •

edited

Loading