Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of UTF-8 filenames #85

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cerisola
Copy link

@cerisola cerisola commented Nov 2, 2021

Fixes #84.

Each commit explains the purpose of the different changes so hopefully everything is clear enough.
The changes were mostly based on how Python's "zipfile" module handles UTF-8, which in my experience seems to work fine across different platforms.
I have tested the changes on both Linux and macOS and for both the original tests included pass and Unicode is handled correctly by the "native" zip tools (for Linux I tested with Python's zipfile, GNOME File Roller and INFO-Zip's unzip and for macOS I tested Python's zipfile, the native macOS graphical unarchiver and INFO-Zip's unzip). The major platform I haven't been able to test is Windows (due to lack of access to a Windows computer with Julia), which probably should be done before merging to make sure Unicode is handled fine there too.

Following the ZIP Format Specification, when using UTF-8 encoded strings
for the filenames, it is recommended to set the Bit 11 of the general
purpose bit flag (see sections 4.4.4 and APPENDIX D of the ZIP
specification v6.3.9). This change is required to produce zip files
containing unicode filenames that are compatible with other zip tools
and libraries (such as Python's zipfile module).
Set the "version made by" field depending on the creating OS (currently
either Windows on generic Unix) and add default file attributes for the
files when being created on *nix. These changes are needed for proper
UTF-8 support since some zip tools assume that if the creation OS is
MSDOS (the value being set before this commit) then the non-ASCII
characters are encoded using Code Page 437 and not UTF-8 (even though we
already set the UTF-8 encoding flag these tools seem to ignore it for
MSDOS). This is for example the case for the "unzip" tool by INFO-Zip
found on most Linux distros.
Copy link
Owner

@fhs fhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always set "version made by" to unix, even when we're creating the zip file on Windows? I'd rather not create different zip file depending on what OS we're running. When we're reading a zip file, it's a different story -- we need to be able to read any kind of zip file. However, we're creating a zip file here, so we're allowed to lie.

This change need some unit tests. See test/runtests.jl.

It looks like we're running the unit tests on Windows on appveyor, so as long as the tests pass there, I'll be happy.

@davibarreira
Copy link

Facing this issue right now. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue when creating zip file with Unicode filenames inside
4 participants