We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flash_attn
xinference0.15.0
nohup xinference-local --host 0.0.0.0 --port 9997 > ./xinfer.log 2>&1 &
比如Qwen2-VL,启动模型后并没有开启flash_attention_2
flash_attention_2
目前的临时解决方案,进入xinference pip包路径
xinference
# 修改该模型的load参数 vim model/llm/transformers/qwen2_vl.py #修改第63行 self.model_path, torch_dtype="bfloat16", device_map=device, attn_implementation="flash_attention_2", trust_remote_code=True
或者
sed -i '63s|self.model_path, device_map=device, trust_remote_code=True|self.model_path, torch_dtype="bfloat16", device_map=device, attn_implementation="flash_attention_2", trust_remote_code=True|' model/llm/transformers/qwen2_vl.py
能够检测系统环境是否存在flash_attn包,如果存在则load模型时开启 torch_dtype="bfloat16", attn_implementation="flash_attention_2"。
torch_dtype="bfloat16", attn_implementation="flash_attention_2"
The text was updated successfully, but these errors were encountered:
有兴趣提交 PR 来支持吗?
Sorry, something went wrong.
可能有兴趣,但不知道如何下手,还没搞清楚trust_remote_code这里传递的是什么,或者是否可以传递torch_dtype="bfloat16", attn_implementation="flash_attention_2"
trust_remote_code
没关系,可以先让支持这两个参数,有了 PR 我们再来看兼容等问题。
Successfully merging a pull request may close this issue.
System Info / 系統信息
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
xinference0.15.0
The command used to start Xinference / 用以启动 xinference 的命令
Reproduction / 复现过程
比如Qwen2-VL,启动模型后并没有开启
flash_attention_2
目前的临时解决方案,进入
xinference
pip包路径或者
sed -i '63s|self.model_path, device_map=device, trust_remote_code=True|self.model_path, torch_dtype="bfloat16", device_map=device, attn_implementation="flash_attention_2", trust_remote_code=True|' model/llm/transformers/qwen2_vl.py
Expected behavior / 期待表现
能够检测系统环境是否存在
flash_attn
包,如果存在则load模型时开启torch_dtype="bfloat16", attn_implementation="flash_attention_2"
。The text was updated successfully, but these errors were encountered: