Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balinese Script OCR #152

Open
gindrawan opened this issue Mar 17, 2020 · 26 comments
Open

Balinese Script OCR #152

gindrawan opened this issue Mar 17, 2020 · 26 comments

Comments

@gindrawan
Copy link

Hi,
I want to develop an OCR for Balinese Script (https://en.wikipedia.org/wiki/Balinese_script) using Tesseract 4.0 and tool jTessBoxEditor 2.2.1 (still not support LSTM?).

There are two font involved (at the attachment)

  1. Bali Simbar Dwijendra (glyph shape quite close to ancient Balinese glyph, most popular use in Bali but non-unicode)
  2. Noto Serif Balinese (quite modern glyph and allready using Balinese unicode block)

I wanto accomodate both type of fonts with priority to Bali Simbar Dwijendra. Sorry I am new to Tesseract and the question is how do I start with it?

Thank you very much for your kind attention.

Best regards, Indra

bali-simbar-dj-noto-serif-balinese.zip

@stweil
Copy link
Member

stweil commented Mar 18, 2020

Hi @gindrawan, with jTessBoxEditor you will get a recognition model which uses the old legacy recognizer, but not the LSTM one.

For training LSTM, you need a large number of ground truth data, that means pairs of line images and text files with the corresponding text. You can use generated images by rendering the text with a Balinese font, and you can also use scans from Balinese publications (books, newspapers, ...) where you have to extract the lines and transcribe the text. Ideally both kinds of images are available.

@Shreeshrii
Copy link
Contributor

Are there any converters from Bali Simbar Dwijendra to Unicode?

@gindrawan
Copy link
Author

Are there any converters from Bali Simbar Dwijendra to Unicode?

As far as I know, there is no such converter. I found Vimala font with glyph shape quite close to Bali Simbar Dwijendra font, as I mentioned at #126.

@gindrawan
Copy link
Author

Hi @Shreeshrii ,

Based on your tesseract code base changing in

tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d

if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed)

tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d
tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d

I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks...

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 24, 2020 via email

@gindrawan
Copy link
Author

jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images.

On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d> @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ .

Thank you @Shreeshrii
Here they are scanned page images from book (quick search from the Internet) with various image type and size.
I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow.

Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android (https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0

balinese-script-images-v1.zip

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 24, 2020 via email

@gindrawan
Copy link
Author

Just images are not enough. What is needed is the correct (ground truth) text in unicode format for each of those images. So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt for the unicode text for each. For a work in progress, see https://github.com/Shreeshrii/tesstrain-bali/tree/master/test I need the correct text for the images so that it can be compared with the OCRed text to verify accuracy on actual images.

On Tue, Mar 24, 2020 at 1:56 PM gindrawan @.> wrote: jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … <#m_-1197623344891217353_> On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d < @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d>> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d>> @.#diff-eaafd22a79065f5b8d28318d482e650d < @.**#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d>> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment) <#152 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ . Thank you @Shreeshrii https://github.com/Shreeshrii Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow. Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android ( https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0 balinese-script-images-v1.zip https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Sorry, I forgot about the txt. May be need longer time for that.
Ok then, I still prepare for the synthetic images, I think faster to make it ready.
One question, how many words needed per line ?

@gindrawan
Copy link
Author

This is small pair image and text file using Noto Serif Balinese, I took them from https://en.wikipedia.org/wiki/Balinese_script. Hope can be used for now..
small-pair-image-text.zip

@gindrawan
Copy link
Author

Oh, I forgot. Do the image need its box file or only the unicode text?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 24, 2020 via email

@gindrawan
Copy link
Author

Hi @Shreeshrii,

It seems more time I need to prepare the training data (1-2 more days).

Meanwhile, I just realize that there are kind of training data in page images (https://github.com/topherseance/javanese-aksara-training-text) and line images.

Based on your previous answer, it seems you prefer line images? What happened with page images?

On preparing line images in my case, it seems more effort because a page image need to be converted to several line images. But if training result will better enough, it's Ok then.

At the attachment I have sample of my page image with its ground truth text. Is that Ok before I proceed further to line images?

ban.notoserifbalinese.gt_001.zip

@Shreeshrii
Copy link
Contributor

Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?

@Shreeshrii
Copy link
Contributor

https://github.com/Shreeshrii/tesstrain-bali/tree/master/langdata

I had done a training run with 4-5 fonts.

@gindrawan
Copy link
Author

Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?

I am preparing about 5 thousands word (the remaining about 29 thousands word still on verification on the unicode) for synthetic data using Noto Serif Balinese, just download the latest font, updated 3 days ago (https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/ttf/NotoSerifBalinese). Somehow more updated than Noto Sans Balinese.

Those 5 thousands words has already transformed into 101 page images, each contains 12 line training texts, each line about 5-10 words. Need a little more time to finalized it. If go into line images, well.. need more extra time.

After that I am going to Vimala with the same unicode with Noto Serif Balinese. Vimala more likely needed for actual images recognition.

The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode, so more time and effort to prepare the training data. Actually, if involved BSD, the balinese script recognition app would has 2 option for post processing: unicode and non-unicode (I imagine some switch radio button to select before recognition).

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 26, 2020

Generation of synthetic data is not an issue. It is actually quite easy to generate page images or line images given a training text and set of fonts.

See https://github.com/Shreeshrii/tesstrain-bali/tree/master/gt/bali-Vimala
which has line images and their groundtruth generated from random sanskrit text (https://github.com/Shreeshrii/tesstrain-bali/blob/master/langdata/bali.training_text) converted to Balinese script. This is not showing up correctly in my web brower, but it is ok when I apply the Vimala font in notepad++.

LSTM training works on line images, so it is better to do line images. But this can be done easily by a computer.

It seems to me that you are just taking a word list and generating text lines and images from that. Instead you should actually be using sentences and paragraphs and phrases along with punctuation similar to the pages that need to be recognized.

The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode,

If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.

When I asked for page images for testing, I meant some sample actual images (in BSD) .

I am generating images in five fonts:
Kadiri
Noto Sans Balinese
Noto Serif Balinese
Pustaka Bali
Vimala

However, if only Vimala is required, it will probably be faster to get convergence.

@gindrawan
Copy link
Author

It's ok I think you put all of those fonts. Kadiri, Pustaka, and Vimala seem try to mimic certain different styles of ancient glyph. Moreover Vimala was also developed with BSD style reference. Noto Sans Balinese and Noto Serif Balinese seem not so many difference each other. I don't know what the consideration Google release both of them.

@gindrawan
Copy link
Author

If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.

@Shreeshrii , I just make any map from BSD to Balinese Unicode, perhaps it useful.
bsdcode.2.balineseunicode.txt

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 27, 2020

Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?

I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify.

bsd2unicode.sed.txt

udhr.unicode.txt
udhr.latn.txt

@gindrawan
Copy link
Author

Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?

It is in Balinese Latin (like Javanese Latin using convention name "java"; and its Javanese Script using "java-jav") . From there we can convert it to many Balinese Script (BSD, Vimala, Noto Serif Balinese, etc) but need some rule-based text preprocessing first.
For an example:
First word "Sami" at the second line must be convert for

  1. BSD (non-unicode) to "smi" (see illustration file ODT LibreOffice at the attachment)
  2. Other Balinese font (unicode) to "ᬲᬫᬶ"

At the reverse process (Balinese script to Balinese Latin), actually I don't know, how to make this work in Tesseract, as I illustrated it at the attachment.

sami.zip

@gindrawan
Copy link
Author

Oh, for Balinese Script to Balinese Latin at the illustration file
"the input" means "the image input"

@gindrawan
Copy link
Author

This is my libre office screenshot. You must install bali simbar dwijendra font at your linux OS.

libre-bsd

@Shreeshrii
Copy link
Contributor

The way tesseract (lstm version) works, the image will be recognised as Unicode text which will render correctly with Unicode Balinese fonts. So, both Vimala and Noto fonts should be able to render the same output.

@gindrawan
Copy link
Author

I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify.

Hi @Shreeshrii , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link).

I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD.

Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en

bsd2unicode.sed.txt
bakta.zip

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Apr 1, 2020 via email

@gindrawan
Copy link
Author

I just make them but still in small size since quite manual to generate them.
https://github.com/gindrawan/balinese-script-training
I am thinking how to speed it up...

How will you train tesseract wilth such data?
I guess you will feed it up with generated image (from related BSD gt text file) and mapping it to NSB gt text file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants