Troubleshooting

GPUtil

Similar to nvidia-smi.
Continuously prints GPU and memory stats to the console.

1
!pip install GPUtil
2
import GPUtil
3
GPUtil.showUtilization()

Tensor accumulation

Most tensor variables use GPU memory.

If these variables accumulate in a loop, GPU memory will be exhausted quickly.

e.g.,

1
total_loss = 0
2
for i in range(10):
3
  optim.zero_grad()
4
    output = model(input)
5
    loss = criterion(output)
6
    loss.backward()
7
    optim.step()
8
    total_loss += loss ## here!!!

For tensors that accumulate, are used only once, or are simple, convert them to Python native objects whenever possible.

Out of Memory (OOM)

Try batch size = 1 first and experiment while monitoring memory.

torch.no_grad()

Always use it during inference. Obviously, if you don’t, backward pass computations accumulate just like during training.

Model size

For example, LSTMs consume quite a bit of memory, so consider the model’s own size too.

Tensor dtype

Float precision can be set to 16-bit.