通过实现一个线性变换模型
的单机2卡分布式训练任务,来初步体验pytorch中DistributeDataparallel的使用。本文主要参考pytorch tutorial中的介绍。
代码编写流程如下:
pytorch中分布式通信模块为torch.distributed
本例中初始化代码为:
# set env信息
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['NCCL_DEBUG'] = "INFO"
# create default process group
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
通过下面的代码分别创建本地模型和分布式模型:
# create local model
model = nn.Linear(10, 10).to(rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
# define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
通过ddp_model执行forward和backward计算,这样才能够达到分布式计算的效果;
# forward pass
outputs = ddp_model(torch.randn(20, 10).to(rank))
labels = torch.randn(20, 10).to(rank)
# backward pass
loss_fn(outputs, labels).backward()
# update parameters
optimizer.step()
启动一个有两个process组成的分布式任务:
def main():
worker_size = 2
mp.spawn(run_worker,
args=(worker_size,),
nprocs=worker_size,
join=True)
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
def run_worker(rank, world_size):
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['NCCL_DEBUG'] = "INFO"
# create default process group
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
# create local model
model = nn.Linear(10, 10).to(rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
# define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
# forward pass
outputs = ddp_model(torch.randn(20, 10).to(rank))
labels = torch.randn(20, 10).to(rank)
# backward pass
loss_fn(outputs, labels).backward()
# update parameters
optimizer.step()
def main():
worker_size = 2
mp.spawn(run_worker,
args=(worker_size,),
nprocs=worker_size,
join=True)
if __name__=="__main__":
main()
代码执行如下:
root@g48r13:/workspace/DDP# python linear-ddp.py
g48r13:350:350 [0] NCCL INFO Bootstrap : Using [0]bond0:11.139.84.88<0>
g48r13:350:350 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
g48r13:350:350 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
g48r13:350:350 [0] NCCL INFO NET/Socket : Using [0]bond0:11.139.84.88<0>
g48r13:350:350 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
g48r13:351:351 [1] NCCL INFO Bootstrap : Using [0]bond0:11.139.84.88<0>
g48r13:351:351 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
g48r13:351:351 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
g48r13:351:351 [1] NCCL INFO NET/Socket : Using [0]bond0:11.139.84.88<0>
g48r13:351:351 [1] NCCL INFO Using network Socket
g48r13:350:366 [0] NCCL INFO Channel 00/02 : 0 1
g48r13:351:367 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
g48r13:350:366 [0] NCCL INFO Channel 01/02 : 0 1
g48r13:351:367 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
g48r13:351:367 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff
g48r13:350:366 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
g48r13:350:366 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
g48r13:350:366 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
g48r13:351:367 [1] NCCL INFO Channel 00 : 1[5000] -> 0[4000] via P2P/IPC
g48r13:350:366 [0] NCCL INFO Channel 00 : 0[4000] -> 1[5000] via P2P/IPC
g48r13:351:367 [1] NCCL INFO Channel 01 : 1[5000] -> 0[4000] via P2P/IPC
g48r13:350:366 [0] NCCL INFO Channel 01 : 0[4000] -> 1[5000] via P2P/IPC
g48r13:351:367 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
g48r13:351:367 [1] NCCL INFO comm 0x7fb0b4001060 rank 1 nranks 2 cudaDev 1 busId 5000 - Init COMPLETE
g48r13:350:366 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
g48r13:350:366 [0] NCCL INFO comm 0x7fc7a8001060 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
g48r13:350:350 [0] NCCL INFO Launch mode Parallel
页面更新:2024-05-20
本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828
© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号