Multi-GPU Training with PyTorch: Data and Model Parallelism

About

The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown.

Software Environment Setup

If it is the first time you are using Conda, make sure you follow the guide of how to use Conda with this link: https://www.carc.usc.edu/user-guides/data-science/building-conda-environment

$ ssh <YourNetID>@discovery.usc.edu  # VPN required if off-campus
$ salloc --partition=gpu --gres=gpu:1 --cpus-per-task=8 --mem=32GB --time=1:00:00
$ mamba create --name torch-env
$ mamba activate torch-env
$ mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
$ mamba install line_profiler --channel conda-forge
$ git clone https://github.com/uschpc/multi_gpu_training.git
$ cd multi_gpu_training

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
01_single_gpu		01_single_gpu
02_pytorch_ddp		02_pytorch_ddp
04_model_parallel_with_fsdp		04_model_parallel_with_fsdp
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Software Environment Setup

About

Releases

Packages

Languages

uschpc/multi_gpu_training

Folders and files

Latest commit

History

Repository files navigation

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Software Environment Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages