`load_dataset` fails to load dataset saved by `save_to_disk` #7018

sliedes · 2024-07-01T12:19:19Z

Describe the bug

This code fails to load the dataset it just saved:

from datasets import load_dataset
from transformers import AutoTokenizer

MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

dataset = load_dataset("yelp_review_full")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")

tokenized_datasets = load_dataset("dataset/")  # raises

It raises ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}.

I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow file and have two JSON metadata files, causing the format to be inferred as JSON:

$ ls -l dataset/test
-rw-r--r-- 1 sliedes sliedes 191498784 Jul  1 13:55 data-00000-of-00001.arrow
-rw-r--r-- 1 sliedes sliedes      1730 Jul  1 13:55 dataset_info.json
-rw-r--r-- 1 sliedes sliedes       249 Jul  1 13:55 state.json

Steps to reproduce the bug

Execute the code above.

Expected behavior

The dataset is loaded successfully.

Environment info

datasets version: 2.20.0
Platform: Linux-6.9.3-arch1-1-x86_64-with-glibc2.39
Python version: 3.12.4
huggingface_hub version: 0.23.4
PyArrow version: 16.1.0
Pandas version: 2.2.2
fsspec version: 2024.5.0

The text was updated successfully, but these errors were encountered:

happyTonakai · 2024-07-23T09:09:16Z

In my case the error was:

ValueError: You are trying to load a dataset that was saved using `save_to_disk`. Please use `load_from_disk` instead.

Did you try load_from_disk?

ManuelFay · 2024-08-05T09:21:54Z

More generally, any reason there is no API consistency between save_to_disk and push_to_hub ?

Would be nice to be able to save_to_disk and then upload manually to the hub and load_dataset (which works in some situations but not all)...

kfarivar · 2024-12-03T11:17:44Z

I have the exact same problem !

kfarivar · 2024-12-03T11:26:16Z

load_from_disk managed to load the dataset, but the bug with load_dataset needs to be fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`load_dataset` fails to load dataset saved by `save_to_disk` #7018

`load_dataset` fails to load dataset saved by `save_to_disk` #7018

sliedes commented Jul 1, 2024

happyTonakai commented Jul 23, 2024

ManuelFay commented Aug 5, 2024

kfarivar commented Dec 3, 2024

kfarivar commented Dec 3, 2024

load_dataset fails to load dataset saved by save_to_disk #7018

load_dataset fails to load dataset saved by save_to_disk #7018

Comments

sliedes commented Jul 1, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

happyTonakai commented Jul 23, 2024

ManuelFay commented Aug 5, 2024

kfarivar commented Dec 3, 2024

kfarivar commented Dec 3, 2024

`load_dataset` fails to load dataset saved by `save_to_disk` #7018

`load_dataset` fails to load dataset saved by `save_to_disk` #7018