You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I was following the Step-by-Step tutorial and try to build from the source code.
The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change
DMLC_NUM_WORKER=1
to
DMLC_NUM_WORKER=2
I launched on Single Node, and this node has 8 GPUs in it.
It is giving the following error: src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
munmap_chunk(): invalid pointer
Aborted (core dumped)
I launched in the following order on the same node. Worker -> Server -> Worker -> Scheduler
Bash Script for launching Worker-0 is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100
Bash Script for launching Worker-1 is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100
Bash Script for launching Server is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch
Bash Script for launching Scheduler is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch
Hello,
I was following the Step-by-Step tutorial and try to build from the source code.
The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change
DMLC_NUM_WORKER=1
to
DMLC_NUM_WORKER=2
I launched on Single Node, and this node has 8 GPUs in it.
It is giving the following error:
src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
munmap_chunk(): invalid pointer
Aborted (core dumped)
I launched in the following order on the same node.
Worker -> Server -> Worker -> Scheduler
Bash Script for launching Worker-0 is:
Bash Script for launching Worker-1 is:
Bash Script for launching Server is:
Bash Script for launching Scheduler is:
Environment:
Can you please help me in solving this error. Thank you.
The text was updated successfully, but these errors were encountered: