Skip to content

Distributed training error #7

@ruida

Description

@ruida

Hi,
I allocate 64g REM with 4 A100 GPUs

#SBATCH --time=72:00:00
#SBATCH --mem=64g
#SBATCH --job-name="ifseg"
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=4
#SBATCH --mail-type=BEGIN,END,ALL

sh run_scripts/IFSeg/coco_unseen.sh

Here is the distributed training error message. Any input? Thanks.

--Ruida

single-machine distributed training is initialized.
/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2012686) of binary: /gpfs/gsfs12/users/me/conda/envs/ifseg/bin/python3
Traceback (most recent call last):
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/gpfs/gsfs12/users/me/conda/envs/ifseg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

[1]:
time : 2023-12-23_05:09:30
host : localhost
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2012687)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2012687
[2]:
time : 2023-12-23_05:09:30
host : localhost
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 2012688)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2012688
[3]:
time : 2023-12-23_05:09:30
host : localhost
rank : 3 (local_rank: 3)
exitcode : -11 (pid: 2012689)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2012689

Root Cause (first observed failure):
[0]:
time : 2023-12-23_05:09:30
host : localhost
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2012686)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2012686

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions