To be continued…
Grad Cache
Overview
Grad Cache is a technique that allows in-batch negative contrastive learning to use large batches, much like gradient accumulation.
In conventional training setups, loss computation is not batch-wise, so there is no issue with accumulating loss updates across steps. However, when using in-batch negatives for contrastive learning — as in DPR or MRC models — the loss is computed batch-wise, which introduces dependencies among samples within a batch. This means standard gradient accumulation simply cannot be used in contrastive learning.
Grad Cache implements an approach analogous to gradient accumulation for contrastive learning, making it possible to achieve large effective batch sizes even on a single GPU.

In Text and Code Embeddings by Contrastive Pre-Training, the batch size is scaled up to 12,288. Since this is nearly impossible to handle with hardware alone, Grad Cache is used to secure large batch sizes for contrastive learning.