-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HOCR output always sets textangle 180
and omits baseline info if Tesseract is compiled with --disable-legacy
#4010
Labels
Comments
robertknight
added a commit
to robertknight/tesseract-wasm
that referenced
this issue
Jan 30, 2023
hOCR output was missing `baseline` information for `ocr_line` entries and incorrectly reporting every line as being upside-down. This was happening due to tesseract-ocr/tesseract#4010. Work around this issue by handling missing rotation information better in `PageIterator::Orientation`, by assuming the page is facing up.
#3997 seems related to this issue. |
tesseract/src/ccmain/pagesegmain.cpp Lines 335 to 406 in 67841aa
I think we can fix the issue by enabling some parts of the code in this block instead of disabling the whole block of code when the legacy engine is disabled. |
amitdo
added a commit
to amitdo/tesseract
that referenced
this issue
Mar 28, 2023
Enable some code blocks that were wrongly disabled when the legacy engine is disabled at compile time.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Basic Information
tesseract 5.3.0-19-ga3b9ac, compiled with
--disable-legacy
Operating System
macOS 13 Ventura
Compiler
clang 14.0
Current Behavior
When Tesseract is compiled with
--disable-legacy
, hOCR output reports each line as being upside-down (textangle 180
) and omits baseline information.Steps to reproduce:
In the generated
output.hocr
file,ocr_line
entries look like this:Expected Behavior
If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:
Suggested Fix
Internally, it looks like the issue is that:
ColumnFinder::text_rotation_
is initialized to a null vector. When the legacy engine is disabled, theColumnFinder::CorrectOrientation
function does not get called, and so this vector remains null.PageIterator::Orientation
, which does not handle this case correctly, as it converts this null vector toORIENTATION_PAGE_DOWN
-tesseract/src/ccmain/pageiterator.cpp
Line 585 in a3b9acf
textangle 180
and omits baseline infoSome fixes I tested locally were to change the initialization of
ColumnFinder::text_rotation_
to be the same as thenorotation
value inColumnFinder::CorrectOrientation
, or to change the logic inPageIterator::Orientation
to handle null rotation vectors by mapping them toORIENTATION_PAGE_UP
. I'm happy to submit a PR but I'm not sure the preferred way to go.The text was updated successfully, but these errors were encountered: