Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

udaykiran009 · 2021-05-30T09:59:23Z

Hello,
I was following the Step-by-Step tutorial and try to build from the source code.

The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change

DMLC_NUM_WORKER=1
to
DMLC_NUM_WORKER=2

I launched on Single Node, and this node has 8 GPUs in it.

It is giving the following error:
src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
munmap_chunk(): invalid pointer
Aborted (core dumped)

I launched in the following order on the same node.
Worker -> Server -> Worker -> Scheduler

Bash Script for launching Worker-0 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Worker-1 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Server is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Bash Script for launching Scheduler is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Environment:

OS: Ubuntu 18.04.5 LTS
GCC version: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CUDA and NCCL version: CUDA 11.0 & NCCL 2.7.8
Framework (TF, PyTorch, MXNet): PyTorch

Can you please help me in solving this error. Thank you.

The text was updated successfully, but these errors were encountered:

ymjiang · 2021-05-31T01:22:54Z

Can you use gdb and paste the backtrace here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

udaykiran009 commented May 30, 2021

ymjiang commented May 31, 2021

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

Comments

udaykiran009 commented May 30, 2021

ymjiang commented May 31, 2021