Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-entropy loss with negative reward/advantage resulting in nan values #90

Open
nutpen85 opened this issue Apr 30, 2021 · 2 comments

Comments

@nutpen85
Copy link

Hi again. I finally found some time to continue with your book. This time I ran into a problem in chapters 10 and 12, where you have the policy and the actor-critic agents (same problem for both). After calling the train function the model fit starts as expected. However, after some steps the loss becomes negative, more negative, extremely negative and then nan. So, why does that happen?

I think it's because of the cross-entropy loss in combination with negative rewards/advantages. Those were introduced to punish bad moves and to lower their probabilities. Now, when the prediction value of such a move is really small (e.g. 1e-10), the log in the cross entropy will make a huge value out of it. This is then multiplied by the negative label, resulting in a huge negative value. I mean, technically, the direction is fine. However, as soon as the loss reaches nan values, the model becomes useless, because you can't optimize any further.

I don't know much about the theory of using softmax and cross-entropy loss and negative rewards. So, probably I'm simply missing something. Does anyone have an idea?

@macfergus
Copy link
Collaborator

Hi @nutpen85, I've run into the same problem and eventually found a (partial) solution. It didn't make it into the book because I only learned about it recently.

Your diagnosis of the root cause is right, the exp in the softmax function is prone to extreme values. Here's what I ended up doing:

  1. Remove the softmax activation from your action layer.

    Now your output will return logits instead of probabilities. When you are selecting a move, you can apply the softmax function then, to turn the logits into probabilities.

  2. Change your loss function from categorical_crossentropy to CategoricalCrossentropy(from_logits=True).

    This makes Keras pretend your function has a softmax activation when calculating the gradient, so you can train exactly as before. I think the theory here is that the derivative of softmax is less prone to extreme values than the raw softmax.

  3. Add a small amount of activity regularization to your policy layer.

    Like output = Dense(num_moves, activity_regularizer=L2(0.01))(prev_layer). The regularization prevents the logit layer from getting too far from 0, which helps prevent extreme values. Essentially, the true gradient will get altered as the logits values move farther away from zero. If you still get extreme values you can increase the 0.01 to something bigger.

    There's a second benefit, which is that it keeps your policy from getting fixated on a single move too early in training (i.e., the regularization preserves a little exploration).

lmk if this helps. Here's a couple examples from an unrelated project:

Policy output with regularization and no activation: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/model.py#L76
Moving softmax into the crossentropy function: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/model.py#L98
Selecting actions from the unactivated logit output: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/bot.py#L22

@nutpen85
Copy link
Author

nutpen85 commented May 2, 2021

Hi @macfergus
Thank you very much. This seems to work like a charm and I learned something new again. I'm very curious how strong the bot will become with this. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants