Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Tesseract "script" language options (such as Cyrillic.traineddata) #15

Open
RasimKhusaenov opened this issue Nov 4, 2024 · 5 comments

Comments

@RasimKhusaenov
Copy link

Thank you for your project! I am interested in the question, should we expect support for the use of scripts? As far as I understand, now we can transfer the code of any language to the tesseract, but not all languages covered by the tesseract are represented exactly as languages. For example, many Cyrillic languages can be recognized using a tesseract, but only if you pass a script as an argument, not a language. Thank you!

@Balearica
Copy link
Contributor

I'm not sure if I understand your question. If you're asking whether languages that use Cyrillic script are supported, both Russian and Ukrainian should already be supported, and we can add support for additional languages that use Cyrillic script as requested. You can set these languages using the language codes rus or ukr when setting the langs argument (see the docs). You can test Russian and Ukrainian support with individual documents using the officially supported GUI at scribeocr.com.

@RasimKhusaenov
Copy link
Author

Thank you so much for taking the time to help with my question! I apologize for the confusion in my initial message—I realize now that I may not have explained the issue clearly and might have led you in the wrong direction. I really appreciate your support and all the work you’ve put into this project.

Let me clarify: After some research, I discovered that it’s possible to load custom .traineddata files, including specific scripts from the Tesseract tessdata repository, using Tesseract.js. For example:

Tesseract.recognize('path/to/image.png', 'Cyrillic', {
  langPath: 'path/to/script/folder/with/Cyrillic.traineddata.gz'
}).then(({ data: { text } }) => {
  console.log(text);
});

However, it seems that scribe.js does not currently support a langPath parameter for custom traineddata file locations. The langPath option appears to be commented out in js/worker/generalWorker.js at line 86, which prevents loading custom scripts.

Is there any plan to add support for a custom langPath argument in scribe.js? This would allow for greater flexibility when working with specific scripts or custom language files.

@Balearica
Copy link
Contributor

Is there any plan to add support for a custom langPath argument in scribe.js? This would allow for greater flexibility when working with specific scripts or custom language files.

Can you explain specifically what you are looking to accomplish with this? Specifically, are you trying to use custom .traineddata files with scripts that are already supported (currently only Latin and Cyrillic) or are you trying to add support for new scripts (e.g. Arabic, Devanagari, etc.)? I'm not opposed to exposing the langPath argument, however want to confirm this would solve an actual problem.

For context, if I simply exposed the langPath argument, that would allow users to provide their own training data, however it would only work properly with scripts that are already supported. For example, if you trained a custom English LSTM model and have your own .traineddata file, then exposing langPath would allow you to use it.

However, exposing langPath would not allow individual users to use new scripts, as (unlike Tesseract.js), Scribe.js requires additional code and resources to support different scripts. For example, the font resources Scribe.js uses do not contain Devanagari characters, so adding support for Hindi (as requested by #16) will take more work than simply adding new .traineddata.

@RasimKhusaenov
Copy link
Author

The purpose of customizing langPath is to enable the use of the script/Cyrillic, as I need Tesseract to recognize not only Russian and Ukrainian characters. Specifically, it should recognize the characters of the Tatar language (ә, җ, ң, ө, ү, һ).

@Balearica
Copy link
Contributor

Thanks for explaining, I was not familiar with the "script" .traineddata files. Unfortunately, it does not look like Scribe.js would work well with these language files. The (default) Scribe.js accurate mode requires .traineddata files that contain data for both the Legacy and LSTM models. Per the Tesseract documentation, the Cyrillic.traineddata and tat.traineddata files only contain data for the LSTM models. Therefore, somebody would need to train a Legacy Tatar model for the Tatar language to be well supported with Scribe.js.

@Balearica Balearica changed the title Scripts (like Cyrillic) support Support Tesseract "script" language options (such as Cyrillic.traineddata) Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants