Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Tesseract in a multi-threaded environment? #4281

Open
kinghelong opened this issue Jul 9, 2024 · 4 comments
Open

How to use Tesseract in a multi-threaded environment? #4281

kinghelong opened this issue Jul 9, 2024 · 4 comments

Comments

@kinghelong
Copy link

kinghelong commented Jul 9, 2024

Current Behavior

#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#pragma comment(lib, "tesseract54.lib")

std::mutex io_mutex;

void performOCR(const std::string& imagePath, int threadId) {
    tesseract::TessBaseAPI* api = new tesseract::TessBaseAPI();
    if (api->Init(NULL, "chi_sim")) {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cerr << "Could not initialize tesseract for thread " << threadId << std::endl;
        delete api;
        return;
    }

    Pix* image = pixRead(imagePath.c_str());
    if (!image) {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cerr << "Could not open input image for thread " << threadId << std::endl;
        delete api;
        return;
    }

    api->SetImage(image);
    char* outText = api->GetUTF8Text();

    {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cout << "Thread " << threadId << " OCR output: " << std::endl << outText << std::endl;
    }

    delete[] outText;
    pixDestroy(&image);
    api->End();
    delete api;
}

int main() {
    const int numThreads = 5; 
    const std::string imagePath = "H:\\1.png";

    std::vector<std::thread> threads;
    for (int i = 0; i < numThreads; ++i) {
        threads.emplace_back(performOCR, imagePath, i);
    }

    for (auto& th : threads) {
        th.join();
    }

    return 0;
}

this is sample code.
I am integrating Tesseract OCR into a multithreaded application to perform real-time text recognition from dynamically changing screens. However, I'm encountering several issues related to multithreading:

Exception Handling: Intermittently, the application crashes with access violations or segmentation faults when attempting to interact with Tesseract API functions from multiple threads simultaneously.

Thread Synchronization: Despite using mutexes to synchronize access to Tesseract API calls, I observe occasional data corruption or deadlock situations, particularly when multiple threads concurrently attempt to initialize or interact with Tesseract instances.

Resource Management: There are concerns regarding memory management and resource leaks when multiple OCR tasks are spawned and terminated rapidly in response to screen changes. This includes potential issues with cleanup of Tesseract resources after OCR tasks complete.

Performance Impact: The performance of Tesseract OCR appears to degrade under heavy multithreaded load, leading to increased latency in text recognition or failure to accurately capture screen content changes.

Debugging Output: Debugging the application reveals sporadic errors related to memory access violations or invalid API state transitions, especially when multiple OCR tasks are active concurrently.

I have attempted to implement thread-safe practices such as mutexes and careful resource allocation, but these issues persist. I am seeking guidance on best practices for integrating Tesseract OCR effectively in a multithreaded environment, ensuring stable performance and reliable text recognition across dynamic screen updates.

Expected Behavior

In the multithreaded application integrating Tesseract OCR, the following expected behaviors are anticipated:

Thread Safety: Tesseract OCR operations should be robustly thread-safe, allowing multiple threads to concurrently capture screen content, process bitmap data, and perform text recognition without encountering crashes or resource conflicts.

Real-Time Text Recognition: The application should accurately extract text from dynamically changing screen content in real-time, leveraging Tesseract's capabilities to handle varied fonts, sizes, and languages commonly encountered in screen-based applications.

Performance Optimization: Efficient utilization of system resources to ensure minimal latency in OCR processing, even under heavy concurrent workload scenarios. This includes optimizing memory usage and processing efficiency to maintain responsive performance.

Error Handling: Effective error detection and recovery mechanisms should be in place to gracefully handle exceptional conditions such as image data corruption, API initialization failures, or temporary unavailability of OCR resources.

Scalability: The application should scale seamlessly with the number of concurrent OCR tasks, supporting parallel processing of screen regions and ensuring that OCR results are consistently accurate and reliable.

Resource Management: Proper cleanup and release of resources after OCR tasks complete, ensuring that memory leaks or resource exhaustion issues are minimized, even during rapid task creation and termination cycles.

By achieving these expected behaviors, the integration of Tesseract OCR into a multithreaded environment should enable robust, responsive, and reliable text recognition capabilities across diverse screen-based applications.

Suggested Fix

Suggested Fix:

To address the challenges observed with Tesseract OCR in a multithreaded environment, the following approaches are recommended:

Thread-Safe Initialization: Ensure that Tesseract API initialization (TessBaseAPI::Init) and resource allocation are performed in a thread-safe manner. Consider using mutex locks or synchronization mechanisms to prevent concurrent access issues during initialization.

Scoped API Usage: Utilize Tesseract API functions (SetImage, GetUTF8Text, etc.) within scoped regions to limit their visibility and prevent simultaneous access from multiple threads. This helps in managing concurrent OCR tasks more effectively.

Resource Isolation: Implement strategies to isolate OCR resources per thread or task. For example, allocate separate instances of TessBaseAPI or other necessary objects for each thread to avoid contention over shared resources.

Error Handling and Recovery: Enhance error handling routines to gracefully manage exceptions and recover from OCR failures. Implement retry mechanisms or fallback strategies to retry OCR operations upon transient errors or resource unavailability.

Performance Optimization: Optimize OCR processing by reducing unnecessary resource allocations and minimizing data copying between threads. Utilize efficient memory management techniques and leverage asynchronous processing where applicable to enhance overall system performance.

Testing and Validation: Conduct rigorous testing in diverse multithreaded scenarios to validate the reliability and stability of Tesseract OCR integration. Use stress testing to simulate high concurrent loads and identify potential bottlenecks or performance degradation points.

By implementing these suggested fixes, the application should enhance its robustness and performance when utilizing Tesseract OCR in a multithreaded environment, ensuring smooth operation and accurate text recognition across varying workload conditions.

tesseract -v

No response

Operating System

Windows 11

Other Operating System

No response

uname -a

No response

Compiler

Visual C++ 2022 00482-10000-00261-AA603
C++14

CPU

AMD Ryzen r5 5600g

Virtualization / Containers

No response

Other Information

No response

@stweil
Copy link
Member

stweil commented Jul 9, 2024

Please fix the sample code in your report. It should be possible to understand and use it without wasting time on guessing.

Did you know that the Tesseract development is entirely driven by a small number of volunteers? Feel free to fix any issue when you think it's necessary.

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2024

Regarding performance, you should disable OpenMP. either at compile time or at runtime.

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2024

https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#v301

Thread-safety! Moved all critical global and static variables to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2024

tesseract/src/tesseract.cpp

Lines 675 to 678 in 577e8a8

// Call GlobalDawgCache here to create the global DawgCache object before
// the TessBaseAPI object. This fixes the order of destructor calls:
// first TessBaseAPI must be destructed, DawgCache must be the last object.
tesseract::Dict::GlobalDawgCache();

tesseract/src/dict/dict.cpp

Lines 172 to 177 in 215b023

DawgCache *Dict::GlobalDawgCache() {
// This global cache (a singleton) will outlive every Tesseract instance
// (even those that someone else might declare as global static variables).
static DawgCache cache;
return &cache;
}

Currently, the API does not expose this static variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants