You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't know any other OCR software which needs DPI information. Ideally Tesseract should work without it, too. Code contributions for this goal are welcome, but must make sure that there is no regression of course.
In fact since version 3.05 we figured out that the best results are when dpi is set to 200. We use C API with dll, so we do it this way:
Read with Leptonica pixRead
Optionally do the necessary image processing
Force resolution to 200 with pixSetResolution (no matter what the original dpi is - 96, 100, 300 etc)
Do the recognition via TessBaseAPIGetHOCRText or via Iterators
In recent versions the dpi became less influencing on the result but in some cases the 200 dpi trick still works, so we continue using it.
Current Behavior
With --dpi 300 Tesseract produces the following mediocre results (misses a couple words) for attached sample image:
sample_crop.zip
Full command line:
tesseract.exe -l rus --dpi 300 sample_crop.png crop300dpi
Expected Behavior
With --dpi 299 for the same image Tesseract produces much better results:
Full command line:
tesseract.exe -l rus --dpi 299 sample_crop.png crop299dpi
I would expect it to perform at similar level at --dpi 300.
Suggested Fix
No response
tesseract -v
tesseract v5.4.0.20240606
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6
UB Mannheim build (setup) with default tessdata files
Operating System
Windows 10
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
i7-4700MQ
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered: