Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

Open
brainsucker-na opened this issue Oct 6, 2024 · 2 comments
Open
Labels

Comments

@brainsucker-na
Copy link

brainsucker-na commented Oct 6, 2024

Current Behavior

With --dpi 300 Tesseract produces the following mediocre results (misses a couple words) for attached sample image:

с or бухгалтерей право второй подписи документов для проведения расчетов по банковским и иным
счетам клиента;

sample_crop
sample_crop.zip

Full command line:
tesseract.exe -l rus --dpi 300 sample_crop.png crop300dpi

Expected Behavior

With --dpi 299 for the same image Tesseract produces much better results:

Сведения о главном бухгалтере/лице, имеющем право второй подписи документов для проведения расчетов по банковским и иным
счетам клиента;

Full command line:
tesseract.exe -l rus --dpi 299 sample_crop.png crop299dpi

I would expect it to perform at similar level at --dpi 300.

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

UB Mannheim build (setup) with default tessdata files

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

i7-4700MQ

Virtualization / Containers

No response

Other Information

No response

@stweil stweil added the accuracy label Oct 8, 2024
@stweil
Copy link
Member

stweil commented Oct 8, 2024

I don't know any other OCR software which needs DPI information. Ideally Tesseract should work without it, too. Code contributions for this goal are welcome, but must make sure that there is no regression of course.

@KopatychMobile
Copy link

In fact since version 3.05 we figured out that the best results are when dpi is set to 200. We use C API with dll, so we do it this way:

  1. Read with Leptonica pixRead
  2. Optionally do the necessary image processing
  3. Force resolution to 200 with pixSetResolution (no matter what the original dpi is - 96, 100, 300 etc)
  4. Do the recognition via TessBaseAPIGetHOCRText or via Iterators
    In recent versions the dpi became less influencing on the result but in some cases the 200 dpi trick still works, so we continue using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants