Multi GPU

node

A term used interchangeably with system.

model parallelization

![](/assets/images/Multi gpu/94cb8fdb-a3de-471a-a93c-998795dd3c17-image.png) Model parallelization was already used in AlexNet.

![](/assets/images/Multi gpu/67beca0a-0359-4bb8-8dd6-e8b1ebe83c05-image.png) Good GPU parallelization requires coding a pipeline structure so that GPUs are used simultaneously, as shown in the figure.

data parallelization

![](/assets/images/Multi gpu/31b41aa1-a575-4581-aef3-26b9cce4c85f-image.png)

GPU1 collects data and distributes it
Each GPU runs forward pass independently
GPU1 collects forward results
GPU1 distributes gradient information
Each GPU computes gradients independently
Gradients are collected and computed

DataParallel in PyTorch

Implements the approach described above directly
Simply distributes data and takes the average
Reduced batch size due to unbalanced GPU usage

DistributedDataParallel in PyTorch

Each GPU gets its own CPU thread and computes its own averaged results independently.

sampler: an object that determines how data is sampled from the dataloader. Provided by torch.

1
train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
2
shuffle = False
3
pin_memory = True
4

5
train_loader = torch.utils.data.DataLoader(train_data, batch_size=20, shuffle=shuffle, pin_memory=pin_memory, num_workers=4, sampler=train_sampler)

num_workers: number of threads. Typically set to 4x the number of GPUs.
pin_memory: data goes through paging in memory, then gets pinned, then loaded to GPU — this option pins it directly.

1
def main():
2
    ngpus_per_node = torch.cuda.device_count()
3
    world_size = ngpus_per_node
4

5
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, ))

Create a worker and pass it to spawn like Python’s map function.

ref: https://blog.si-analytics.ai/12