Fix data file module inference #7132

HennerM · 2024-08-29T13:48:16Z

I saved a dataset with two splits to disk with DatasetDict.save_to_disk. The train is bigger and ended up in 10 shards, whereas the test split only resulted in 1 split.

Now when trying to load the dataset, an error is raised that not all splits have the same data format:

ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}

This is not expected because both splits are saved as arrow files.

I did some debugging and found that this is the case because the list of data_files includes a state.json file.

Now this means for train split I get 10 ".arrow" and 1 ".json" file. Since datasets picks based on the most common extension this is correctly inferred as "arrow". In the test split, there is 1 .arrow and 1 .json file. Given the function description:

It picks the module based on the most common file extension.
In case of a draw ".parquet" is the favorite, and then alphabetical order.

This is not quite true though, because in a tie the extensions are actually based on reverse-alphabetical order:

for (ext, _), _ in sorted(extensions_counter.items(), key=sort_key, *reverse=True*):

Which thus leads to the module wrongly inferred as "json", whereas it should be "arrow", matching the train split.

I first thought about adding "state.json" in the list of excluded files for the inference: https://github.com/huggingface/datasets/blob/main/src/datasets/load.py#L513. However, I think from digging into the code it looks like the right thing to do is to exclude it in the list of data_files to start with, because it is more of a metadata than a data file.

state.json is not a data_file. Not excluding it can lead to an error when trying to load a dataset with a split that has only one shard. Excluding state.json should be fine because it is excluded later on anyway when the format is inferred as e.g. "arrow"

lhoestq · 2024-09-02T14:07:05Z

Hi ! datasets saved using save_to_disk should be loaded with load_from_disk ;)

HennerM · 2024-09-02T16:54:25Z

It is convienient to just pass in a path to a local dataset or one from the hub and use the same function to load it. Is it not possible to get this fix merged in to allow this?

lhoestq · 2024-09-02T19:52:12Z

We can modify save_to_disk to write the dataset in a structure supported by the Hub in this case, it's kind of a legacy function anyway

Fix data file module inference

7ca2f0c

state.json is not a data_file. Not excluding it can lead to an error when trying to load a dataset with a split that has only one shard. Excluding state.json should be fine because it is excluded later on anyway when the format is inferred as e.g. "arrow"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data file module inference #7132

Fix data file module inference #7132

HennerM commented Aug 29, 2024

lhoestq commented Sep 2, 2024

HennerM commented Sep 2, 2024

lhoestq commented Sep 2, 2024

Fix data file module inference #7132

Are you sure you want to change the base?

Fix data file module inference #7132

Conversation

HennerM commented Aug 29, 2024

lhoestq commented Sep 2, 2024

HennerM commented Sep 2, 2024

lhoestq commented Sep 2, 2024