Gradient Accumulation
A useful technique when GPU resources are limited.
num_accum = 2optimizer.zero_grad()for epoch in range(10): running_loss = 0.0 for i, data in enumerate(train_loader, 0): inputs, labels = data outputs = net(inputs)
loss = criterion(outputs, labels) / num_accum loss.backward()
if i % num_accum == 0: optimizer.step() optimizer.zero_grad()- Model parameters are only updated after num_accum iterations.
- The criterion output is divided by num_accum for normalization.
- My guess: since the accumulated loss over num_accum steps is applied in a single step, dividing by num_accum gives each individual loss an equal weight, producing a normalizing effect.