node
A term used interchangeably with system.
model parallelization
 Model parallelization was already used in AlexNet.
 Good GPU parallelization requires coding a pipeline structure so that GPUs are used simultaneously, as shown in the figure.
data parallelization

- GPU1 collects data and distributes it
- Each GPU runs forward pass independently
- GPU1 collects forward results
- GPU1 distributes gradient information
- Each GPU computes gradients independently
- Gradients are collected and computed
DataParallel in PyTorch
- Implements the approach described above directly
- Simply distributes data and takes the average
- Reduced batch size due to unbalanced GPU usage
DistributedDataParallel in PyTorch
Each GPU gets its own CPU thread and computes its own averaged results independently.
- sampler: an object that determines how data is sampled from the dataloader. Provided by torch.
train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)shuffle = Falsepin_memory = True
train_loader = torch.utils.data.DataLoader(train_data, batch_size=20, shuffle=shuffle, pin_memory=pin_memory, num_workers=4, sampler=train_sampler)- num_workers: number of threads. Typically set to 4x the number of GPUs.
- pin_memory: data goes through paging in memory, then gets pinned, then loaded to GPU — this option pins it directly.
def main(): ngpus_per_node = torch.cuda.device_count() world_size = ngpus_per_node
torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, ))Create a worker and pass it to spawn like Python’s map function.