-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrecDocs: .Z and .z files are different. #189
Comments
Thanks @ArthurCamara! Along with #188, yet even more distribution formats for trec disks 4 & 5... I'm starting to wonder if anybody actually has the same copy :-). Looks like they should be easy enough fixes, though. Since I don't have a copy in these formats myself, would you be able to test a patch if I make a PR with fixes? |
@seanmacavaney Of course! I think #188 should be REALLY straight forward to fix. |
Awesome, thanks. Hopefully we're getting close to all formats. |
Moving this here from a closed issue so it's not lost:
|
Still mulling over how I want to handle this. I don't think it's as easy as just adding the lowercase versions of the globs into path_globs, because on case insensitive file systems it'll probably pick up each source file twice. Could be a new argument? Or just detect if the globs don't match anything and then try a lowercase version of the glob? Hmmm... |
@ArthurCamara -- hope it's okay, but I cannot priotize this right now. |
Of course. I don't think you should worry about it, as I said in #191 |
Describe the bug
I've stumbled on this before, and it seems like the same issue happens here.
.z
and.Z
files are not always equivalent, butTrecDocs
treat them like so by calling.lower()
on the suffix of thePath
object:ir_datasets/ir_datasets/formats/trec.py
Lines 127 to 137 in 27317b2
.Z files are created by calling the Unix command compress:
(from the man page:
while .z files are created by using gzip:
Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)
There are some Disks45 distributions (mine, for instance) that are compressed with
.z
(i.e. using gunzip with option-S .z
):Affected dataset(s)
All that used
TrecDocs
, but Disks45 more likely.To Reproduce
Trying to read documents with a
.z
compressed files results in this:Additional context
Error is trigged on this line:
ir_datasets/ir_datasets/formats/trec.py
Line 136 in 27317b2
The text was updated successfully, but these errors were encountered: