Add filename in error message when ReadError or similar occur #7280

elisa-aleman · 2024-11-07T06:00:53Z

Please update error messages to include relevant information for debugging when loading datasets with load_dataset() that may have a few corrupted files.

Whenever downloading a full dataset, some files might be corrupted (either at the source or from downloading corruption).

However the errors often only let me know it was a tar file if tarfile.ReadError appears on the traceback, and I imagine similarly for other file types.

This makes it really hard to debug which file is corrupted, and when dealing with very large datasets, it shouldn't be necessary to force download everything again.

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-11-18T11:40:23Z

Hi Elisa, please share the error traceback here, and if you manage to find the location in the datasets code where the error occurs, feel free to open a PR to add the necessary logging / improve the error message.

elisa-aleman · 2024-11-18T14:19:57Z

please share the error traceback

I don't have access to it but it should be during this exception which happens during the loading of a dataset. If one of the downloaded files is corrupted, the for loop will not yield correctly, and the error will come from, say, in the case of tar files, this iterable which has no explicit error handling that leaves clues as to which file has failed.

I only know the case for tar files, but I consider this issue could be happening across different file types too.

lhoestq · 2024-11-18T15:07:32Z

I think having a better error handling for this tar iterable would be useful already, maybe a simple try/except in _iter_from_urlpath that checks for tarfile.ReadError and raises an error with the urlpath mentioned in the error ?

elisa-aleman · 2024-11-19T05:03:01Z

I think not just from higher calls like the _iter_from_urlpath but directly wherever a file is attempted to be opened would be the best case, as the traceback would simply lead to that.

lhoestq · 2024-11-20T13:23:10Z

so maybe there should be better error messages in each dataset builder definition ? e.g. in https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/webdataset/webdataset.py for webdataset TAR archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filename in error message when ReadError or similar occur #7280

Add filename in error message when ReadError or similar occur #7280

elisa-aleman commented Nov 7, 2024

lhoestq commented Nov 18, 2024

elisa-aleman commented Nov 18, 2024

lhoestq commented Nov 18, 2024

elisa-aleman commented Nov 19, 2024

lhoestq commented Nov 20, 2024

Add filename in error message when ReadError or similar occur #7280

Add filename in error message when ReadError or similar occur #7280

Comments

elisa-aleman commented Nov 7, 2024

lhoestq commented Nov 18, 2024

elisa-aleman commented Nov 18, 2024

lhoestq commented Nov 18, 2024

elisa-aleman commented Nov 19, 2024

lhoestq commented Nov 20, 2024