Why should we use Temperature in softmax? [closed]

One reason to use the temperature function is to change the output distribution computed by your neural net. It is added to the logits vector according to this equation :

𝑞𝑖 =exp(𝑧𝑖/𝑇)/ ∑𝑗exp(𝑧𝑗/𝑇)

where 𝑇 is the temperature parameter.

You see, what this will do is change the final probabilities. You can choose T to be anything (the higher the T, the ‘softer’ the distribution will be – if it is 1, the output distribution will be the same as your normal softmax outputs). What I mean by ‘softer’ is that is that the model will basically be less confident about it’s prediction. As T gets closer to 0, the ‘harder’ the distribution gets.

a) Sample ‘hard’ softmax probs : [0.01,0.01,0.98]

b) Sample ‘soft’ softmax probs : [0.2,0.2,0.6]

‘a’ is a ‘harder’ distribution. Your model is very confident about its predictions. However, in many cases, you don’t want your model to do that. For example, if you are using an RNN to generate text, you are basically sampling from your output distribution and choosing the sampled word as your output token(and next input). IF your model is extremely confident, it may produce very repetitive and uninteresting text. You want it to produce more diverse text which it will not produce because when the sampling procedure is going on, most of the probability mass will be concentrated in a few tokens and thus your model will keep selecting a select number of words over and over again. In order to give other words a chance of being sampled as well, you could plug in the temperature variable and produce more diverse text.

With regards to why higher temperatures lead to softer distributions, that has to do with the exponential function. The temperature parameter penalizes bigger logits more than the smaller logits. The exponential function is an ‘increasing function’. So if a term is already big, penalizing it by a small amount would make it much smaller (% wise) than if that term was small.

Here’s what I mean,

exp(6) ~ 403
exp(3) ~ 20

Now let’s ‘penalize’ this term with a temperature of let’s say 1.5:

exp(6/1.5) ~ 54
exp(3/1.5) ~ 7.4

You can see that in % terms, the bigger the term is, the more it shrinks when the temperature is used to penalize it. When the bigger logits shrink more than your smaller logits, more probability mass (to be computed by the softmax) will be assigned to the smaller logits.

Leave a Comment Cancel reply