Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run LSTM recognition in multiple threads #4275

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jkarthic
Copy link

Init time option lstm_num_threads should be used to set the number of LSTM threads. This will ensure that word recognition can run independently in multiple threads, thus effectively utilizing multi-core processors.

Following are my test results for a sample screenshot.
CPU : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
OS : WIndows
Compiler : MSVC 19.38.33130.0 (Installed from Visual Studio 2022)
Model: eng.traineddata from tessfast
PSM: 6

Total time taken for Recognize API call, Built without OpenMP
With lstm_num_threads=1, total time taken = 3.95 seconds
With lstm_num_threads=4, total time taken = 1.4 seconds

On the other hand, here are the numbers with OpenMP
OMP_THREAD_LIMIT not set, total time taken = 3.59 seconds
OMP_THREAD_LIMIT=4, total time taken = 3.57 seconds
OMP_THREAD_LIMIT=1, total time taken = 4.19 seconds

As we can observe, this branch with lstm_num_threads set as 4, performs way better than the openmp multithreading supported currently. Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Init time option lstm_num_threads should be used to set the number of LSTM threads
@@ -477,7 +481,10 @@ Tesseract::~Tesseract() {
for (auto *lang : sub_langs_) {
delete lang;
}
delete lstm_recognizer_;
for (int i = 0; i < lstm_recognizers_.size(); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i should use the same data type as the return value of size().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (auto &&r : lstm_recognizers_) delete r;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review comments. I have addressed this now.

@jkarthic jkarthic requested review from stweil and egorpugin June 27, 2024 16:38
lstm_recognizer_ = new LSTMRecognizer(language_data_path_prefix.c_str());
ASSERT_HOST(lstm_recognizer_->Load(this->params(), lstm_use_matrix ? language : "", mgr));
for (int i = 0; i < lstm_num_threads; ++i) {
lstm_recognizers_.push_back(new LSTMRecognizer(language_data_path_prefix.c_str()));
Copy link
Contributor

@egorpugin egorpugin Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me.
But maybe it is possible to switch to unique_ptrs here?

Upd.:
Since this is new code std::vector<LSTMRecognizer *> lstm_recognizers_;, could you try to use uniq ptrs here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the existing design pattern of tesseract of using LSTMRecognizer * and performing new and delete for allocation and deallocation.
As per the C++ best practice, we should ideally use shared_ptrs everywhere for accessing LSTMRecognizer, to avoid any dangling ptrs. But that will involve a lot of refactoring, deviating away from the main purpose of this PR. Hence I just followed the existing tesseract practice of using LSTMRecognizer *.

auto segment_start = words->begin() + segment_size;
for (int i = 1; i < lstm_num_threads; ++i) {
auto segment_end = (i == lstm_num_threads - 1) ? words->end() : segment_start + segment_size;
futures.push_back(std::async(std::launch::async, &Tesseract::RecogWordsSegment,
Copy link
Contributor

@egorpugin egorpugin Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about static thread pool here?
If code is called multiple times, how is std::async performance compared to static thread pool?
Considering async will create a new thread each time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
As stweil has clarified in his latest comment, this multiple lstm threads is not meant for mass production. This feature is probably meant for consumer-end devices running a single page OCR once in a while, but with least possible latency. In such cases thread creation/deletion is not a big overhead. But if we really come across use-cases where thread creation/deletion overhead becomes significant, we could look at replacing this with a thread pool at that point.

Copy link
Member

@stweil stweil Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default case (lstm_num_threads == 1) must not create a new thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the default case (lstm_num_threads == 1) does not create a new thread. As you see, the loop starts with i = 1, and hence does not execute when lstm_num_threads == 1.

@stweil
Copy link
Member

stweil commented Jun 27, 2024

Many thanks for this nice contribution.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways? If a command line argument is desired (like in the case of --dpi), I think that there might be more user friendly variants. Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

Maybe we could also extend the command line syntax to have --PARAMETER VALUE as an alternative for -c PARAMETER=VALUE for any Tesseract parameter.

@stweil
Copy link
Member

stweil commented Jun 27, 2024

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

@jkarthic
Copy link
Author

Many thanks for this nice contribution.

And many thanks to you for reviewing this patiently.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways?

This lstm_num_threads is a init time parameter. The LSTMRecognizer instances are created during init. Setting this new parameter with -c lstm-num-threads=N will not work, as it is setting the variable after the init is done.

Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

When I tested tesseract with a psm of 3(which is the default for tesseract.exe), page segmentation was taking significantly more time than the actual LSTM recognition. For example, in one of my tests, page segmentation was taking ~7 seconds, and lstm was taking ~3 seconds, taking the total to ~10 seconds. Users running with default psm parameter should not expect that the entire 10 seconds will be run in multiple threads. In this case, the major part of ~7 seconds will run single threaded and only a minor part of ~3 seconds will be multi threaded. Hence I thought adding a longer name is setting the user expectation right, that only a portion of tesseract will be running multithreaded.
Also there are other numthreads variables related to OpenMP, inside the code which were named generically such as kNumThreads, __num_threads and num_threads. Naming this as lstm_num_threads also differentiates this as a seperate variable, not to be confused with OpenMP num threads.

@jkarthic
Copy link
Author

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

Totally agreed. This is meant for latency-sensitive real-time applications, with ocr probably running in the consumer's device itself.

Changed the WERD_RES linked link to use shared pointers instead of raw pointers.
This is needed so that even if one thread deletes a WERD_RES object, other thread's which needs to iterate thru them can still access it safely.
In terms of LSTM processing, only one threads processes one WERD_RES. This change is needed as all the threads can iterate thru due to single linked list data structure.
@jkarthic
Copy link
Author

jkarthic commented Jul 5, 2024

@stweil
I observed a crash issue in the earlier code due to WERD_RES objects freed by one thread was used by another thread for iterating thru the WERD_RES singly linked list.
To fix the above above issue, I have modified WERD_RES linked list to use shared pointer instead of raw pointers, so that lifetime of the objects are managed automatically. I have also added mutex protections around the PAGE_RES_IT functions that modify this list in order to avoid race conditions.
Please take a look at the modifications whenever you get some time for this.

Copy link
Contributor

@egorpugin egorpugin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is much much worse.

Remove std::async.

@egorpugin
Copy link
Contributor

I suggest to use previous version as base.

@jkarthic
Copy link
Author

jkarthic commented Jul 5, 2024

Now it is much much worse.

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

@egorpugin
Copy link
Contributor

egorpugin commented Jul 5, 2024

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

  • More complex code.
  • A lot of sync.
  • Much harder to review.
  • Most likely a 'no go' in current state.

You need to provide a very detailed description of:

  1. algorithm. How it works? Is it possible to sync less?
  2. changes in files. I see new types, mutex locks in some existing functions. See example how this can be described from gcc commit messages, e.g. gcc-mirror/gcc@5185274 (bottom part of the commit message)

@burningfireplace
Copy link

I tested this PR on my Mac (M1 chip). I have a few observations to share:

  1. The latest commit (d1eed6a) does not compile successfully on my system. I encountered multiple errors related to ELIST_ITERATOR_T, BLOBNBOX_IT, and other classes. It seems some changes may be missing or incomplete in this commit.

  2. The initial commit (6a2e239) does compile and produces a working binary. Using the new --lstm-num-threads option, I was able to achieve a ~2x speedup (compared to 5.4.1 + OpenMP) by finding the optimal N value for my system.

  3. However, I noticed an issue with language/script recognition when using multiple languages. For example:

    • Using -l "eng+ukr" tries to recognize Cyrillic chars as Latin chars
    • Using -l "ukr+eng" vice-versa: incorrectly recognizes Cyrillic in places with Latin chars

This behavior differs from Tesseract 5.4.1, which produces correct output for the same inputs in both cases.

I hope this feedback is helpful.

@jkarthic
Copy link
Author

@burningfireplace

Thanks for trying it out and providing a detailed analysis. Here is my reply.

  1. I was using windows while developing this. Never tested this on Mac, as mac had apple's inbuilt OCR Vision API, which in my experience performs better than tesseract. But that's no reason for breaking the build on Mac. I will try to fix the compile issues on Mac, when I get some time.
  2. Nice to know, this is improving speed on Mac as well.
  3. The initial commit had a few race condition bugs that might be causing this. Once compile issues are resolved, I would encourage you to try the latest commit once again. Ideally the latest commit should provide identical output as the single threaded version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants