Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TessdataManager able to save archive using LibArchive #4187

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sadra-barikbin
Copy link
Contributor

  • Made TessDataManager able to save archive using LibArchive
  • Added -t option to combine_tessdata to transform proprietary .traineddata to archive file.

@sadra-barikbin
Copy link
Contributor Author

@stweil , shall I add a test?

ASSERT_HOST(is_loaded_);
std::vector<char> data;
Serialize(&data);
if (writer == nullptr) {
#if defined(HAVE_LIBARCHIVE)
return SaveArchiveFile(filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change TessdataManager::SaveFile will always write traineddata files in ZIP format which are incompatible with Tesseract binaries which were build without LibArchive. I'm afraid that would cause problems for a lot of people.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought libarchive can deduce archive types.

Copy link
Member

@stweil stweil Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can, but the current code uses archive_write_set_format_zip instead of archive_write_set_format_by_name, so it will always write a ZIP file. And of course libarchive cannot write the proprietary traineddata format.

@stweil
Copy link
Member

stweil commented Feb 7, 2024

Added -t option to combine_tessdata to transform proprietary .traineddata to archive file.

I'd prefer a more general solution which allows different target formats. In addition it should allow writing to a different output file and use long options. So the syntax might look like this:

Usage: combine_tessdata --convert [--format TARGET_FORMAT] INFILE [OUTFILE]

TARGET_FORMAT would default to zip, but should allow any typical file extension of archive files and also traineddata for conversions into the proprietary Tesseract format.

@egorpugin
Copy link
Contributor

Isn't automatic format simpler?
Libarchive writes files based on their extension.
If we write .zip - it will be .zip.
.tar.gz - tar.gz
.tar.xz or lz - ...

@stweil
Copy link
Member

stweil commented Feb 8, 2024

That's right, and this feature of LibArchive would also be used to implement my suggested solution.

If we implement support for combine_tessdata --convert eng.traineddata eng.zip, that would create a filename which is currently unsupported by tesseract, so an additional renaming mv eng.zip eng.traineddata would be required. Should we extend the Tesseract code to support more extensions than the current .traineddata, so -l eng.zip would work? Or do you think of another solution? Maybe eng.traineddata.zip?

@zdenop
Copy link
Contributor

zdenop commented Feb 8, 2024

Is there also intention to read such converted data by tesseract?

If yes, than please be careful about changing extension: it will break a lot of workflows that looks for available/installed languages (AFAIR also GetAvailableLanguagesAsVector a.k.a tesseract --list-langs)

Nowadays it is quite common to use private file extension instated of indication it is archive (e.h. xlsx, odt are zip archives)

On other hand: if file extension will not be changed and tesseract will be build without libarchive support, that has to be improved error handling why tesseract is not able read traineddata...

@sadra-barikbin
Copy link
Contributor Author

sadra-barikbin commented Mar 4, 2024

I thought this feature is decided on to be implemented. Now it seems it's an arguable one. My own requirement i.e. inspecting config file in the .traineddata file and possibly overwriting it is fulfilled with the sequence of combine_tessdata -e and combine_tessdata -o. I don't see a strong point in having this feature either. Instead, I could work on PageXML renderer if you agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants