Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support training without lstmf files #4215

Merged
merged 1 commit into from
Apr 24, 2024
Merged

Conversation

stweil
Copy link
Member

@stweil stweil commented Mar 25, 2024

The list of the training or validation files may now contain .png files instead of .lstmf files, so it is no longer necessary to spend time and disc space for creating .box and .lstmf files for a training.

@stweil stweil marked this pull request as draft March 25, 2024 14:40
src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved
src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved
@stweil stweil force-pushed the training branch 2 times, most recently from a4a22c0 to 8c1a3bc Compare April 4, 2024 09:16
@stweil stweil added this to the 5.4.0 milestone Apr 10, 2024
@stweil stweil marked this pull request as ready for review April 16, 2024 17:29
@stweil stweil marked this pull request as draft April 16, 2024 17:32
src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved
@stweil stweil marked this pull request as ready for review April 19, 2024 19:20
@stweil stweil requested review from zdenop, egorpugin and amitdo April 19, 2024 19:20
@@ -546,6 +547,31 @@ bool DocumentData::ReCachePages() {
delete page;
}
pages_.clear();
#if !defined(TESSERACT_IMAGEDATA_AS_PIX)
auto name_size = document_name_.size();
if (name_size > 4 && document_name_.substr(name_size - 4) == ".png") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use to use std::filesystem::path as in

if (filePath.extension() == ".box") {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path refactoring should be done as more general procedure. It is ok to keep std::string in very local code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would you prefer using std::filesystem::path::extension() here or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok as is. fs::paths can be refactored for 6.0 wit api changes/breakage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egorpugin: document_name_ is not part of tesseract API - it is the private object used in imagedata.cpp only, So I do not think it will break something or change API.

@stweil: I am fine with the current PR code, just I think we should step by step use c++17 features.

@stweil stweil merged commit 549b876 into tesseract-ocr:main Apr 24, 2024
7 checks passed
@stweil stweil deleted the training branch April 24, 2024 08:35
@yaofuzhou
Copy link

yaofuzhou commented May 24, 2024

The list of the training or validation files may now contain .png files instead of .lstmf files, so it is no longer necessary to spend time and disc space for creating .box and .lstmf files for a training.

@stweil

Hi - In my understanding of Tesseract before your merge, list.train and list.eval list the .lstmf files used in training and evaluation, where each .lstmf file is the product of an image and its .box file. Without the .lstmf files, how should I tell the Tesseract training program which files to use in the training process? Thanks!

@stweil
Copy link
Member Author

stweil commented May 24, 2024

You still use list.train and list.eval, but they can now include .png files instead of .lstmf files.

@yaofuzhou
Copy link

yaofuzhou commented May 24, 2024

You still use list.train and list.eval, but they can now include .png files instead of .lstmf files.

Sorry for the silly question, but may I assume that the corresponding .box files will be automatically fetched?

I care about this because I am getting the error of

Compute CTC targets failed for xyz.lstmf

from time to time. I was wondering if this is due to insufficient LSTM cells to accommodate an entire training image, and I wonder if training without the .lstmf file would make any difference. Thanks!

@stweil
Copy link
Member Author

stweil commented May 24, 2024

Neither .box nor .lstmf files are used if you provide only .png files in the lists.

@amitdo
Copy link
Collaborator

amitdo commented May 24, 2024

I assume the textline within the png image should not have any padding, is that right?

@stweil
Copy link
Member Author

stweil commented May 24, 2024

Our line images are from real prints and typically have random padding. As tesstrain uses PSM=13 by default, the .lstmf files would also contain line images with the same padding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants