Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Display model name in process #1891

Merged
merged 4 commits into from
Nov 5, 2024

Conversation

frostyplanet
Copy link
Contributor

@frostyplanet frostyplanet commented Jul 18, 2024

在进程名中显示模型的名字,方便管理调试. 顺带提交一些日志优化

  1. Retry 3 times to load model when CUDA busy error, the error can be encountered on AWS instance which is just launched.

  2. ModelActor.repr ( add replica_model_uid to generate wrapper function log )

  3. Previously Worker use forkserver to spawn model process, ps auxf will show

root     1047138  1.6  0.0 41306296 643932 ?     Sl   Jul18  30:22      \_ /opt/conda/bin/python /opt/conda/bin/xinference-worker --supervisor-addr 127.0.0.1:9099 --worker-port 9010 --host 192.168.0.32 --log-lev
root     1048111  0.0  0.0  13704 10708 ?        S    Jul18   0:00          \_ /opt/conda/bin/python -c from multiprocessing.resource_tracker import main;main(87)                                                 
root     1048112  0.0  0.0  14500 11784 ?        S    Jul18   0:00          \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib/pyt
root     1048113  0.4  0.2 31870884 1782516 ?    Sl   Jul18   8:03              \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib
root     1048269  0.9  1.0 193959100 6789368 ?   Sl   Jul18  18:08              \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(87, 89, ['__main__'], **{'sys_path': ['/opt/conda/lib
root     1048346  2.4  0.0 1834028 437944 ?      Sl   Jul18  45:47              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-18_12-25-06_49660
root     1048472  0.0  0.0 394544 72864 ?        Sl   Jul18   1:19              |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/sess
root     1048473  0.0  0.0 398528 93600 ?        Sl   Jul18   1:48              |   \_ /opt/conda/bin/python /opt/conda/lib/python3.10/site-packages/ray/dashboard/dashboard.py --host=127.0.0.1 --port=8265 --port
root     1048525  2.4  0.0 122161128 69092 ?     Sl   Jul18  45:12              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-18_12-2
root     1048601  0.0  0.0 1798280 84468 ?       Sl   Jul18   0:16              |   |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=192.168.0.32 -
root     1048603  0.0  0.0 1779412 66892 ?       Sl   Jul18   0:13              |   |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/_private/runtime_env/agent/main.py --node-ip-addres

we cannot distinguish model from process name. (Some times we might need to debug the network connection or resource of a model)
Add optional dependency setproctitle to rename the process name, to Model: {replica_model_uid}

Because not all model type has self._model.model_uid, self._model.repl_id, so I added replica_model_uid arg to ModelActor)

root     1915511  8.1  0.1 42043624 672568 ?     Sl   03:06   2:19      \_ /opt/conda/bin/python /opt/conda/bin/xinference-worker --supervisor-addr 127.0.0.1:9099 --worker-port 9010 --host 192.168.0.32 --log-lev
root     1915615  0.0  0.0  13704 10640 ?        S    03:06   0:00          \_ /opt/conda/bin/python -c from multiprocessing.resource_tracker import main;main(89)                                                 
root     1915616  0.0  0.0  14332 11636 ?        S    03:06   0:00          \_ /opt/conda/bin/python -c from multiprocessing.forkserver import main; main(89, 91, ['__main__'], **{'sys_path': ['/opt/conda/lib/pyt
root     1915617  7.3  1.1 191508928 7813660 ?   Sl   03:06   2:04              \_ Model: qwen2-instruct:72-1-1                                                                                    
root     1915856  2.6  0.0 1748516 417652 ?      Sl   03:06   0:44              |   \_ /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-21_03-06-41_30925
root     1916060  0.1  0.0 382252 69496 ?        Sl   03:06   0:01              |   \_ /opt/conda/bin/python -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/sess

@XprobeBot XprobeBot added enhancement New feature or request gpu labels Jul 18, 2024
@XprobeBot XprobeBot added this to the v0.13.2 milestone Jul 18, 2024
@frostyplanet frostyplanet force-pushed the model_load_opt branch 8 times, most recently from f7b1127 to e57454e Compare July 20, 2024 19:40
@XprobeBot XprobeBot modified the milestones: v0.13.2, v0.13.4 Jul 26, 2024
@frostyplanet frostyplanet changed the title ENH: Model loading optimization ENH: Display model name in process Jul 29, 2024
@XprobeBot XprobeBot modified the milestones: v0.14, v0.15 Sep 3, 2024
@frostyplanet frostyplanet force-pushed the model_load_opt branch 5 times, most recently from b5bb0b6 to 530ede9 Compare September 5, 2024 10:01
@qinxuye
Copy link
Contributor

qinxuye commented Sep 8, 2024

Add the requirements to https://github.com/xorbitsai/inference/tree/main/xinference/deploy/docker requirements.txt and requirements_cpu.txt

@frostyplanet
Copy link
Contributor Author

@qinxuye done

@XprobeBot XprobeBot modified the milestones: v0.15, v0.16 Oct 30, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 5, 2024

There is conflict again, please resolve it.

I think this PR is helpful when we want to see what the model is running.

The error might encounter on aws newly launched instance

File "/opt/inference/xinference/core/model.py", line 239, in load
    self._model.load()
  File "/opt/inference/xinference/model/image/stable_diffusion/core.py", line 61, in load
    self._model = move_model_to_available_device(self._model)
  File "/opt/inference/xinference/device_utils.py", line 56, in move_model_to_available_device
    return model.to(device)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 418, in to
    module.to(device, dtype)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: [address=172.31.25.185:39061, pid=44] CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
`ps auxf` will see "Model: XXX",
instead of "python -c from multiprocessing.forkserver import main"

ModelActor add replica_model_uid arg because attribute in _model has no uniform
It's good for distinguish log_async for DEBUG differenent models

A chat() log will look like:
2024-07-14 10:18:23,974 xinference.core.model 2168589 DEBUG    Enter wrapped_func, args: (ModelActor(qwen1.5-chat:1_8), 'aaaa', None,
Copy link
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 325626f into xorbitsai:main Nov 5, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants