Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

brainsucker-na · 2024-10-06T16:23:14Z

Current Behavior

With --dpi 300 Tesseract produces the following mediocre results (misses a couple words) for attached sample image:

с or бухгалтерей право второй подписи документов для проведения расчетов по банковским и иным
счетам клиента;

Full command line:
tesseract.exe -l rus --dpi 300 sample_crop.png crop300dpi

Expected Behavior

With --dpi 299 for the same image Tesseract produces much better results:

Сведения о главном бухгалтере/лице, имеющем право второй подписи документов для проведения расчетов по банковским и иным
счетам клиента;

Full command line:
tesseract.exe -l rus --dpi 299 sample_crop.png crop299dpi

I would expect it to perform at similar level at --dpi 300.

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

UB Mannheim build (setup) with default tessdata files

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

i7-4700MQ

Virtualization / Containers

No response

Other Information

No response

The text was updated successfully, but these errors were encountered:

stweil · 2024-10-08T13:13:00Z

I don't know any other OCR software which needs DPI information. Ideally Tesseract should work without it, too. Code contributions for this goal are welcome, but must make sure that there is no regression of course.

KopatychMobile · 2024-11-15T18:36:46Z

In fact since version 3.05 we figured out that the best results are when dpi is set to 200. We use C API with dll, so we do it this way:

Read with Leptonica pixRead
Optionally do the necessary image processing
Force resolution to 200 with pixSetResolution (no matter what the original dpi is - 96, 100, 300 etc)
Do the recognition via TessBaseAPIGetHOCRText or via Iterators
In recent versions the dpi became less influencing on the result but in some cases the 200 dpi trick still works, so we continue using it.

stweil added the accuracy label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

brainsucker-na commented Oct 6, 2024 •

edited

Loading

stweil commented Oct 8, 2024

KopatychMobile commented Nov 15, 2024

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

Comments

brainsucker-na commented Oct 6, 2024 • edited Loading

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Oct 8, 2024

KopatychMobile commented Nov 15, 2024

brainsucker-na commented Oct 6, 2024 •

edited

Loading