Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrecDocs: .Z and .z files are different. #189

Open
ArthurCamara opened this issue May 5, 2022 · 7 comments
Open

TrecDocs: .Z and .z files are different. #189

ArthurCamara opened this issue May 5, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@ArthurCamara
Copy link
Contributor

Describe the bug
I've stumbled on this before, and it seems like the same issue happens here. .z and .Z files are not always equivalent, but TrecDocs treat them like so by calling .lower() on the suffix of the Path object:

def _docs_iter(self, path):
if Path(path).is_file():
path_suffix = Path(path).suffix.lower()
if path_suffix == '.gz':
with gzip.open(path, 'rb') as f:
yield from self._parser(f)
elif path_suffix in ['.z', '.0z', '.1z', '.2z']:
# unix "compress" command encoding
unlzw3 = ir_datasets.lazy_libs.unlzw3()
with io.BytesIO(unlzw3.unlzw(path)) as f:
yield from self._parser(f)

.Z files are created by calling the Unix command compress:
(from the man page:

Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (...)

while .z files are created by using gzip:

gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (...)

Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)

There are some Disks45 distributions (mine, for instance) that are compressed with .z (i.e. using gunzip with option -S .z):

-S .suf --suffix .suf
When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.

Affected dataset(s)

All that used TrecDocs, but Disks45 more likely.

To Reproduce
Trying to read documents with a .z compressed files results in this:

TypeError: string argument without an encoding

Additional context
Error is trigged on this line:

with io.BytesIO(unlzw3.unlzw(path)) as f:

@ArthurCamara ArthurCamara added the bug Something isn't working label May 5, 2022
@seanmacavaney
Copy link
Collaborator

seanmacavaney commented May 5, 2022

Thanks @ArthurCamara! Along with #188, yet even more distribution formats for trec disks 4 & 5... I'm starting to wonder if anybody actually has the same copy :-).

Looks like they should be easy enough fixes, though. Since I don't have a copy in these formats myself, would you be able to test a patch if I make a PR with fixes?

@ArthurCamara
Copy link
Contributor Author

@seanmacavaney Of course! I think #188 should be REALLY straight forward to fix.
We first got into this kind of problem with Robust during OSIRRC 2019. Everyone had a different version of it.

@seanmacavaney
Copy link
Collaborator

Awesome, thanks.

Hopefully we're getting close to all formats.

@seanmacavaney
Copy link
Collaborator

Moving this here from a closed issue so it's not lost:

Less important, and it can be an issue with my version of the dataset: filenames and folders can be lowercase (i.e. disks45/corpus/fbis/fb496247), so the glob would not find these

seanmacavaney added a commit that referenced this issue May 5, 2022
improved tests

regarding #189
@seanmacavaney
Copy link
Collaborator

...filenames and folders can be lowercase (i.e. disks45/corpus/fbis/fb496247), so the glob would not find these...

Still mulling over how I want to handle this. I don't think it's as easy as just adding the lowercase versions of the globs into path_globs, because on case insensitive file systems it'll probably pick up each source file twice. Could be a new argument? Or just detect if the globs don't match anything and then try a lowercase version of the glob? Hmmm...

@seanmacavaney
Copy link
Collaborator

@ArthurCamara -- hope it's okay, but I cannot priotize this right now.

@ArthurCamara
Copy link
Contributor Author

Of course. I don't think you should worry about it, as I said in #191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants