We explicitly need to call zero_grad()
because, after loss.backward()
(when gradients are computed), we need to use optimizer.step()
to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, loss.backward()
and optimizer.step()
, are separated, and optimizer.step()
requires the just computed gradients.
In addition, sometimes, we need to accumulate gradient among some batches; to do that, we can simply call backward
multiple times and optimize once.