Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xinference0.15.0 不能自动引入flash_attn #2287

Closed
1 of 3 tasks
LaureatePoet opened this issue Sep 12, 2024 · 3 comments · Fixed by LaureatePoet/inference#1 or #2289
Closed
1 of 3 tasks

xinference0.15.0 不能自动引入flash_attn #2287

LaureatePoet opened this issue Sep 12, 2024 · 3 comments · Fixed by LaureatePoet/inference#1 or #2289
Milestone

Comments

@LaureatePoet
Copy link
Contributor

System Info / 系統信息

  • Python 3.8.17
  • torch 2.0.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

xinference0.15.0

The command used to start Xinference / 用以启动 xinference 的命令

nohup xinference-local --host 0.0.0.0 --port 9997 > ./xinfer.log 2>&1 &

Reproduction / 复现过程

比如Qwen2-VL,启动模型后并没有开启flash_attention_2

目前的临时解决方案,进入xinference pip包路径

# 修改该模型的load参数
vim model/llm/transformers/qwen2_vl.py
#修改第63行
self.model_path, torch_dtype="bfloat16", device_map=device, attn_implementation="flash_attention_2", trust_remote_code=True

或者

sed -i '63s|self.model_path, device_map=device, trust_remote_code=True|self.model_path, torch_dtype="bfloat16", device_map=device, attn_implementation="flash_attention_2", trust_remote_code=True|' model/llm/transformers/qwen2_vl.py

Expected behavior / 期待表现

能够检测系统环境是否存在flash_attn包,如果存在则load模型时开启 torch_dtype="bfloat16", attn_implementation="flash_attention_2"

@XprobeBot XprobeBot added this to the v0.15 milestone Sep 12, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Sep 12, 2024

有兴趣提交 PR 来支持吗?

@LaureatePoet
Copy link
Contributor Author

可能有兴趣,但不知道如何下手,还没搞清楚trust_remote_code这里传递的是什么,或者是否可以传递torch_dtype="bfloat16", attn_implementation="flash_attention_2"

@qinxuye
Copy link
Contributor

qinxuye commented Sep 12, 2024

没关系,可以先让支持这两个参数,有了 PR 我们再来看兼容等问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants