ENH: Display model name in process #1891

frostyplanet · 2024-07-18T08:25:06Z

在进程名中显示模型的名字，方便管理调试. 顺带提交一些日志优化

Retry 3 times to load model when CUDA busy error, the error can be encountered on AWS instance which is just launched.
ModelActor.repr ( add replica_model_uid to generate wrapper function log )
Previously Worker use forkserver to spawn model process, ps auxf will show

root     1047138  1.6  0.0 41306296 643932 ?     Sl   Jul18  30:22      \_ /opt/conda/bin/python /opt/conda/bin/xinference-worker --supervisor-addr 127.0.0.1:9099 --worker-port 9010 --host 192.168.0.32 --log-lev
root     1048111  0.0  0.0  13704 10708 ?        S    Jul18   0:00          \_ /opt/conda/bin/python -c from multiprocessing.resource_tracker import main;main(87)                                                 
root     1048112  0.0  0.0  14500 11784 ?        S    Jul18   0:00          \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib/pyt
root     1048113  0.4  0.2 31870884 1782516 ?    Sl   Jul18   8:03              \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib
root     1048269  0.9  1.0 193959100 6789368 ?   Sl   Jul18  18:08              \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib
root     1048346  2.4  0.0 1834028 437944 ?      Sl   Jul18  45:47              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-18_12-25-06_49660
root     1048472  0.0  0.0 394544 72864 ?        Sl   Jul18   1:19              |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/sess
root     1048473  0.0  0.0 398528 93600 ?        Sl   Jul18   1:48              |   \_ /opt/conda/bin/python /opt/conda/lib/python3.10/site-packages/ray/dashboard/dashboard.py --host=127.0.0.1 --port=8265 --port
root     1048525  2.4  0.0 122161128 69092 ?     Sl   Jul18  45:12              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-18_12-2
root     1048601  0.0  0.0 1798280 84468 ?       Sl   Jul18   0:16              |   |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=192.168.0.32 -
root     1048603  0.0  0.0 1779412 66892 ?       Sl   Jul18   0:13              |   |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/_private/runtime_env/agent/main.py --node-ip-addres

we cannot distinguish model from process name. （Some times we might need to debug the network connection or resource of a model）
Add optional dependency setproctitle to rename the process name, to Model: {replica_model_uid}

Because not all model type has self._model.model_uid, self._model.repl_id, so I added replica_model_uid arg to ModelActor)

root     1915511  8.1  0.1 42043624 672568 ?     Sl   03:06   2:19      \_ /opt/conda/bin/python /opt/conda/bin/xinference-worker --supervisor-addr 127.0.0.1:9099 --worker-port 9010 --host 192.168.0.32 --log-lev
root     1915615  0.0  0.0  13704 10640 ?        S    03:06   0:00          \_ /opt/conda/bin/python -c from multiprocessing.resource_tracker import main;main(89)                                                 
root     1915616  0.0  0.0  14332 11636 ?        S    03:06   0:00          \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(89, 91, ['__main__'], **{'sys_path': ['/opt/conda/lib/pyt
root     1915617  7.3  1.1 191508928 7813660 ?   Sl   03:06   2:04              \_ Model: qwen2-instruct:72-1-1                                                                                    
root     1915856  2.6  0.0 1748516 417652 ?      Sl   03:06   0:44              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-21_03-06-41_30925
root     1916060  0.1  0.0 382252 69496 ?        Sl   03:06   0:01              |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/sess

qinxuye · 2024-09-08T11:45:31Z

Add the requirements to https://github.com/xorbitsai/inference/tree/main/xinference/deploy/docker requirements.txt and requirements_cpu.txt

frostyplanet · 2024-09-09T15:00:02Z

@qinxuye done

qinxuye · 2024-11-05T04:26:50Z

There is conflict again, please resolve it.

I think this PR is helpful when we want to see what the model is running.

The error might encounter on aws newly launched instance File "/opt/inference/xinference/core/model.py", line 239, in load self._model.load() File "/opt/inference/xinference/model/image/stable_diffusion/core.py", line 61, in load self._model = move_model_to_available_device(self._model) File "/opt/inference/xinference/device_utils.py", line 56, in move_model_to_available_device return model.to(device) File "/opt/conda/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 418, in to module.to(device, dtype) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: [address=172.31.25.185:39061, pid=44] CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

`ps auxf` will see "Model: XXX", instead of "python -c from multiprocessing.forkserver import main" ModelActor add replica_model_uid arg because attribute in _model has no uniform

It's good for distinguish log_async for DEBUG differenent models A chat() log will look like: 2024-07-14 10:18:23,974 xinference.core.model 2168589 DEBUG Enter wrapped_func, args: (ModelActor(qwen1.5-chat:1_8), 'aaaa', None,

qinxuye

LGTM

XprobeBot added enhancement New feature or request gpu labels Jul 18, 2024

XprobeBot added this to the v0.13.2 milestone Jul 18, 2024

frostyplanet force-pushed the model_load_opt branch 8 times, most recently from f7b1127 to e57454e Compare July 20, 2024 19:40

XprobeBot modified the milestones: v0.13.2, v0.13.4 Jul 26, 2024

frostyplanet force-pushed the model_load_opt branch from e57454e to 2136fdd Compare July 29, 2024 02:06

frostyplanet changed the title ~~ENH: Model loading optimization~~ ENH: Display model name in process Jul 29, 2024

XprobeBot modified the milestones: v0.14, v0.15 Sep 3, 2024

frostyplanet force-pushed the model_load_opt branch 5 times, most recently from b5bb0b6 to 530ede9 Compare September 5, 2024 10:01

frostyplanet force-pushed the model_load_opt branch from 530ede9 to a23c57d Compare September 9, 2024 14:59

frostyplanet force-pushed the model_load_opt branch from a23c57d to d1de7d4 Compare September 21, 2024 09:45

XprobeBot modified the milestones: v0.15, v0.16 Oct 30, 2024

frostyplanet force-pushed the model_load_opt branch from d1de7d4 to 2af5229 Compare November 5, 2024 05:19

frostyplanet force-pushed the model_load_opt branch from 172455f to d528b87 Compare November 5, 2024 06:03

frostyplanet added 4 commits November 5, 2024 16:35

model: Set process title for subprocess running ModelActor

6ad9a82

`ps auxf` will see "Model: XXX", instead of "python -c from multiprocessing.forkserver import main" ModelActor add replica_model_uid arg because attribute in _model has no uniform

ModelActor: Add __repr__ to log

9334fb0

It's good for distinguish log_async for DEBUG differenent models A chat() log will look like: 2024-07-14 10:18:23,974 xinference.core.model 2168589 DEBUG Enter wrapped_func, args: (ModelActor(qwen1.5-chat:1_8), 'aaaa', None,

Add replica_model_uid for test

ecf8cfa

frostyplanet force-pushed the model_load_opt branch from d528b87 to ecf8cfa Compare November 5, 2024 08:36

qinxuye approved these changes Nov 5, 2024

View reviewed changes

qinxuye merged commit 325626f into xorbitsai:main Nov 5, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Display model name in process #1891

ENH: Display model name in process #1891

frostyplanet commented Jul 18, 2024 •

edited

Loading

qinxuye commented Sep 8, 2024

frostyplanet commented Sep 9, 2024

qinxuye commented Nov 5, 2024

qinxuye left a comment

ENH: Display model name in process #1891

ENH: Display model name in process #1891

Conversation

frostyplanet commented Jul 18, 2024 • edited Loading

qinxuye commented Sep 8, 2024

frostyplanet commented Sep 9, 2024

qinxuye commented Nov 5, 2024

qinxuye left a comment

Choose a reason for hiding this comment

frostyplanet commented Jul 18, 2024 •

edited

Loading