-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filename in error message when ReadError or similar occur #7280
Comments
Hi Elisa, please share the error traceback here, and if you manage to find the location in the |
I don't have access to it but it should be during this exception which happens during the loading of a dataset. If one of the downloaded files is corrupted, the for loop will not yield correctly, and the error will come from, say, in the case of tar files, this iterable which has no explicit error handling that leaves clues as to which file has failed. I only know the case for tar files, but I consider this issue could be happening across different file types too. |
I think having a better error handling for this tar iterable would be useful already, maybe a simple try/except in |
I think not just from higher calls like the |
so maybe there should be better error messages in each dataset builder definition ? e.g. in https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/webdataset/webdataset.py for webdataset TAR archives |
Please update error messages to include relevant information for debugging when loading datasets with
load_dataset()
that may have a few corrupted files.Whenever downloading a full dataset, some files might be corrupted (either at the source or from downloading corruption).
However the errors often only let me know it was a tar file if
tarfile.ReadError
appears on the traceback, and I imagine similarly for other file types.This makes it really hard to debug which file is corrupted, and when dealing with very large datasets, it shouldn't be necessary to force download everything again.
The text was updated successfully, but these errors were encountered: