Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H20上新版本magic-pdf不可用 #1293

Closed
wzl0329 opened this issue Dec 14, 2024 · 8 comments
Closed

H20上新版本magic-pdf不可用 #1293

wzl0329 opened this issue Dec 14, 2024 · 8 comments
Labels
question Further information is requested

Comments

@wzl0329
Copy link

wzl0329 commented Dec 14, 2024

Description of the bug | 错误描述

在H20上使用magic-pdf 0.9.x 和0.10.x会报错,报错信息如下。同样环境下,magic-pdf==0.8.1是没有问题的。

2024-12-11 14:34:18.129 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 0.33
2024-12-11 14:34:18.175 | INFO | magic_pdf.model.pdf_extract_kit:call:192 - mfd time: 0.04
2024-12-11 14:34:18.176 | INFO | magic_pdf.model.pdf_extract_kit:call:199 - formula nums: 0, mfr time: 0.0
2024-12-11 14:34:19.285 | INFO | magic_pdf.model.pdf_extract_kit:call:230 - ocr time: 1.11
2024-12-11 14:34:19.286 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 2, page total time: 1.48-----
2024-12-11 14:34:19.672 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 0.39
2024-12-11 14:34:19.673 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 12.32, speed: 0.24 pages/second


C++ Traceback (most recent call last):

0 at::_ops::linear::call(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&)
1 at::native::linear(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&)
2 at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)
3 at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)


Error Message Summary:

FatalError: Erroneous arithmetic operation is detected by the operating system.
[TimeInfo: *** Aborted at 1733898861 (unix time) try "date -d @1733898861" if you are using GNU date ***]
[SignalInfo: *** SIGFPE (@0x7522f18fd914) received by PID 41 (TID 0x752433404480) from PID 18446744073467320596 ***]

Floating point exception (core dumped)

How to reproduce the bug | 如何复现

情况与#908 类似,但测试过paddle的cpu、gpu等多个版本,都失败了
测试情况:

<style> </style>
显卡 宿主机nvcc -V docker镜像nvidia/cuda magic-pdf paddlepaddle paddlepaddle-gpu 结论
H20 V12.2.91 12.2.2-devel-ubuntu22.04 0.8.1 3.0.0b1   可以运行
H20 V12.2.91 12.2.2-devel-ubuntu22.04 0.9.0 2.6.2   Floating point exception (core dumped)
H20 V12.2.91 12.2.2-devel-ubuntu22.04 0.9.0 3.0.0b1   Floating point exception (core dumped)
H20 V12.2.91 12.2.2-devel-ubuntu22.04 0.10.5 3.0.0b1   Floating point exception (core dumped)
H20 V12.2.91 12.3.2-cudnn9-devel-ubuntu22.04 0.10.5 2.6.2   Floating point exception (core dumped)
H20 V12.2.91 12.3.2-cudnn9-devel-ubuntu22.04 0.10.5 3.0.0b1   Floating point exception (core dumped)
H20 V12.2.91 12.3.2-cudnn9-devel-ubuntu22.04 0.10.5   2.6.2 Floating point exception (core dumped)
H20 V12.2.91 12.3.2-cudnn9-devel-ubuntu22.04 0.10.5   3.0.0b1 和torch 2.3.1不兼容

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.10.x

Device mode | 设备模式

cuda

@wzl0329 wzl0329 added the bug Something isn't working label Dec 14, 2024
@wzl0329 wzl0329 closed this as completed Dec 18, 2024
@wzl0329 wzl0329 reopened this Dec 18, 2024
@wzl0329
Copy link
Author

wzl0329 commented Dec 18, 2024

@myhloli 您好,这个问题有结论吗,能否提供一个在H20上可用的环境版本

@myhloli
Copy link
Collaborator

myhloli commented Dec 18, 2024

抱歉,我们没有H系列显卡进行测试,目前只能参考 #558 的案例使用高版本cuda的paddlegpu尝试,如果仍有兼容性问题,请卸载paddlepaddle和paddlepaddle-gpu,并重新安装paddlepaddle使用cpu进行推理

@wzl0329
Copy link
Author

wzl0329 commented Dec 18, 2024

我们一开始就是使用的paddle cpu版本,cpu版本报错才去尝试paddlegpu的。而高版本cuda的paddlegpu(cuda12.3)会和torch2.3.1版本不兼容,其他库又依赖2.3.1,所以导致高cuda的paddlegpu装不上。总之尝试了paddle的cpu、cuda各版本,都没成功

@wzl0329
Copy link
Author

wzl0329 commented Dec 18, 2024

我们一开始就是使用的paddle cpu版本,cpu版本报错才去尝试paddlegpu的。而高版本cuda的paddlegpu(cuda12.3)会和torch2.3.1版本不兼容,其他库又依赖2.3.1,所以导致高cuda的paddlegpu装不上。总之尝试了paddle的cpu、cuda各版本,都没成功

#558 (comment) #558的评论区也有人遇到这个问题

@myhloli
Copy link
Collaborator

myhloli commented Dec 18, 2024

cpu版本不应该不兼容吧,根据用户反馈,cpu不兼容的情况一般是cpu不支持avx/avx2指令集,你也可以通过这个点查一下

@myhloli
Copy link
Collaborator

myhloli commented Dec 18, 2024

我们一开始就是使用的paddle cpu版本,cpu版本报错才去尝试paddlegpu的。而高版本cuda的paddlegpu(cuda12.3)会和torch2.3.1版本不兼容,其他库又依赖2.3.1,所以导致高cuda的paddlegpu装不上。总之尝试了paddle的cpu、cuda各版本,都没成功

可以将unimernet更新到0.2.2,移除了对torchtext的依赖,这样就可以更新torch到2.3.1以上,如果提示其他包对torch 版本限制,可以先不管,直接手动强制更新torch(需要同步更新torchvision到匹配版本

@wzl0329
Copy link
Author

wzl0329 commented Dec 18, 2024

好的,谢谢建议~我们尝试一下

@wzl0329
Copy link
Author

wzl0329 commented Dec 28, 2024

将宿主机(H20)的cuda版本更新到了12.4,Floating point exception的问题解决了

@wzl0329 wzl0329 closed this as completed Dec 28, 2024
@dt-yy dt-yy added question Further information is requested and removed bug Something isn't working labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants