Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filename in error message when ReadError or similar occur #7280

Open
elisa-aleman opened this issue Nov 7, 2024 · 5 comments
Open

Add filename in error message when ReadError or similar occur #7280

elisa-aleman opened this issue Nov 7, 2024 · 5 comments

Comments

@elisa-aleman
Copy link

Please update error messages to include relevant information for debugging when loading datasets with load_dataset() that may have a few corrupted files.

Whenever downloading a full dataset, some files might be corrupted (either at the source or from downloading corruption).

However the errors often only let me know it was a tar file if tarfile.ReadError appears on the traceback, and I imagine similarly for other file types.

This makes it really hard to debug which file is corrupted, and when dealing with very large datasets, it shouldn't be necessary to force download everything again.

@lhoestq
Copy link
Member

lhoestq commented Nov 18, 2024

Hi Elisa, please share the error traceback here, and if you manage to find the location in the datasets code where the error occurs, feel free to open a PR to add the necessary logging / improve the error message.

@elisa-aleman
Copy link
Author

please share the error traceback

I don't have access to it but it should be during this exception which happens during the loading of a dataset. If one of the downloaded files is corrupted, the for loop will not yield correctly, and the error will come from, say, in the case of tar files, this iterable which has no explicit error handling that leaves clues as to which file has failed.

I only know the case for tar files, but I consider this issue could be happening across different file types too.

@lhoestq
Copy link
Member

lhoestq commented Nov 18, 2024

I think having a better error handling for this tar iterable would be useful already, maybe a simple try/except in _iter_from_urlpath that checks for tarfile.ReadError and raises an error with the urlpath mentioned in the error ?

@elisa-aleman
Copy link
Author

I think not just from higher calls like the _iter_from_urlpath but directly wherever a file is attempted to be opened would be the best case, as the traceback would simply lead to that.

@lhoestq
Copy link
Member

lhoestq commented Nov 20, 2024

so maybe there should be better error messages in each dataset builder definition ? e.g. in https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/webdataset/webdataset.py for webdataset TAR archives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants