Support training without lstmf files #4215

stweil · 2024-03-25T14:40:29Z

The list of the training or validation files may now contain .png files instead of .lstmf files, so it is no longer necessary to spend time and disc space for creating .box and .lstmf files for a training.

src/ccstruct/imagedata.cpp

Signed-off-by: Stefan Weil <[email protected]>

src/ccstruct/imagedata.cpp

zdenop · 2024-04-19T20:25:22Z

src/ccstruct/imagedata.cpp

@@ -546,6 +547,31 @@ bool DocumentData::ReCachePages() {
    delete page;
  }
  pages_.clear();
+#if !defined(TESSERACT_IMAGEDATA_AS_PIX)
+  auto name_size = document_name_.size();
+  if (name_size > 4 && document_name_.substr(name_size - 4) == ".png") {


Why not use to use std::filesystem::path as in

tesseract/src/training/unicharset_extractor.cpp

Line 74 in b217cfc

if (filePath.extension() == ".box") {

Path refactoring should be done as more general procedure. It is ok to keep std::string in very local code.

So would you prefer using std::filesystem::path::extension() here or not?

It's ok as is. fs::paths can be refactored for 6.0 wit api changes/breakage.

@egorpugin: document_name_ is not part of tesseract API - it is the private object used in imagedata.cpp only, So I do not think it will break something or change API.

@stweil: I am fine with the current PR code, just I think we should step by step use c++17 features.

yaofuzhou · 2024-05-24T04:45:37Z

The list of the training or validation files may now contain .png files instead of .lstmf files, so it is no longer necessary to spend time and disc space for creating .box and .lstmf files for a training.

@stweil

Hi - In my understanding of Tesseract before your merge, list.train and list.eval list the .lstmf files used in training and evaluation, where each .lstmf file is the product of an image and its .box file. Without the .lstmf files, how should I tell the Tesseract training program which files to use in the training process? Thanks!

stweil · 2024-05-24T05:00:14Z

You still use list.train and list.eval, but they can now include .png files instead of .lstmf files.

yaofuzhou · 2024-05-24T05:01:59Z

You still use list.train and list.eval, but they can now include .png files instead of .lstmf files.

Sorry for the silly question, but may I assume that the corresponding .box files will be automatically fetched?

I care about this because I am getting the error of

Compute CTC targets failed for xyz.lstmf

from time to time. I was wondering if this is due to insufficient LSTM cells to accommodate an entire training image, and I wonder if training without the .lstmf file would make any difference. Thanks!

stweil · 2024-05-24T05:18:44Z

Neither .box nor .lstmf files are used if you provide only .png files in the lists.

amitdo · 2024-05-24T09:16:50Z

I assume the textline within the png image should not have any padding, is that right?

stweil · 2024-05-24T09:21:40Z

Our line images are from real prints and typically have random padding. As tesstrain uses PSM=13 by default, the .lstmf files would also contain line images with the same padding.

stweil marked this pull request as draft March 25, 2024 14:40

stweil force-pushed the training branch from d8192f0 to 1a6224a Compare March 25, 2024 14:48

stweil commented Mar 25, 2024

View reviewed changes

src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved

stweil commented Mar 25, 2024

View reviewed changes

src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved

stweil force-pushed the training branch 2 times, most recently from a4a22c0 to 8c1a3bc Compare April 4, 2024 09:16

stweil mentioned this pull request Apr 10, 2024

make lists -j32 doesn't seem to be honoring the thread count. (Also happens when calling make training -j32) tesseract-ocr/tesstrain#382

Open

stweil added this to the 5.4.0 milestone Apr 10, 2024

stweil marked this pull request as ready for review April 16, 2024 17:29

stweil marked this pull request as draft April 16, 2024 17:32

stweil commented Apr 16, 2024

View reviewed changes

src/ccstruct/imagedata.cpp Outdated Show resolved Hide resolved

Support training without lstmf files

b217cfc

Signed-off-by: Stefan Weil <[email protected]>

stweil force-pushed the training branch from 8c1a3bc to b217cfc Compare April 19, 2024 19:19

stweil marked this pull request as ready for review April 19, 2024 19:20

stweil requested review from zdenop, egorpugin and amitdo April 19, 2024 19:20

egorpugin approved these changes Apr 19, 2024

View reviewed changes

zdenop reviewed Apr 19, 2024

View reviewed changes

src/ccstruct/imagedata.cpp Show resolved Hide resolved

zdenop reviewed Apr 19, 2024

View reviewed changes

stweil merged commit 549b876 into tesseract-ocr:main Apr 24, 2024
7 checks passed

stweil deleted the training branch April 24, 2024 08:35

yaofuzhou mentioned this pull request May 23, 2024

"Compute CTC targets failed for xyz.lstmf!" for custom NET_SPECs tesseract-ocr/tesstrain#390

Closed

BrewTestBot mentioned this pull request Jun 6, 2024

tesseract 5.4.0 Homebrew/homebrew-core#173888

Merged

1 task

josef821 mentioned this pull request Jul 9, 2024

problem with train without lstmf #4282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support training without lstmf files #4215

Support training without lstmf files #4215

stweil commented Mar 25, 2024 •

edited

Loading

zdenop Apr 19, 2024

egorpugin Apr 19, 2024

stweil Apr 20, 2024

egorpugin Apr 20, 2024

zdenop Apr 20, 2024

yaofuzhou commented May 24, 2024 •

edited

Loading

stweil commented May 24, 2024

yaofuzhou commented May 24, 2024 •

edited

Loading

stweil commented May 24, 2024

amitdo commented May 24, 2024

stweil commented May 24, 2024 •

edited

Loading

Support training without lstmf files #4215

Support training without lstmf files #4215

Conversation

stweil commented Mar 25, 2024 • edited Loading

zdenop Apr 19, 2024

Choose a reason for hiding this comment

egorpugin Apr 19, 2024

Choose a reason for hiding this comment

stweil Apr 20, 2024

Choose a reason for hiding this comment

egorpugin Apr 20, 2024

Choose a reason for hiding this comment

zdenop Apr 20, 2024

Choose a reason for hiding this comment

yaofuzhou commented May 24, 2024 • edited Loading

stweil commented May 24, 2024

yaofuzhou commented May 24, 2024 • edited Loading

stweil commented May 24, 2024

amitdo commented May 24, 2024

stweil commented May 24, 2024 • edited Loading

stweil commented Mar 25, 2024 •

edited

Loading

yaofuzhou commented May 24, 2024 •

edited

Loading

yaofuzhou commented May 24, 2024 •

edited

Loading

stweil commented May 24, 2024 •

edited

Loading