Skip to content

Re-construct code for supporting DDP framework #5

@RuizhouLiu

Description

@RuizhouLiu

Hi,

I try to re-construct your code with DDP for faster multi-GPU training. However, I encountered some issue.

File "/data2/liurzh/Decent_parallel/train.py", line 243, in <module>
    main()
  File "/data2/liurzh/Decent_parallel/train.py", line 109, in main
    model.optimize_parameters()   # calculate loss functions, get gradients, update network weights
  File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 126, in optimize_parameters
    self.loss_D = self.compute_D_loss()
  File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 162, in compute_D_loss
    pred_fake = self.netD(fake)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data2/liurzh/Decent_parallel/models/networks.py", line 1486, in forward
    return self.model(input)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data2/liurzh/Decent_parallel/models/networks.py", line 61, in forward
    return F.conv2d(self.pad(inp), self.filt, stride=self.stride, groups=inp.shape[1])
 (Triggered internally at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data2/liurzh/Decent_parallel/train.py", line 243, in <module>
[rank0]:     main()
[rank0]:   File "/data2/liurzh/Decent_parallel/train.py", line 109, in main
[rank0]:     model.optimize_parameters()   # calculate loss functions, get gradients, update network weights
[rank0]:   File "/data2/liurzh/Decent_parallel/models/decent_gan_model.py", line 127, in optimize_parameters
[rank0]:     self.loss_D.backward()
[rank0]:   File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/data3/liurzh/miniconda3/envs/dhc/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1, 3, 3]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
[rank0]:[W529 08:32:03.448938579 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0529 08:32:09.141000 1181402 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1186218) of binary: /data3/liurzh/miniconda3/envs/dhc/bin/python

The torch version is 2.6, coda version is 12.4 python version is 3.10. How so solve this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions