Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty datasets.exceptions.ExpectedMoreSplitsError #7282

Open
meg-huggingface opened this issue Nov 7, 2024 · 0 comments
Open

Faulty datasets.exceptions.ExpectedMoreSplitsError #7282

meg-huggingface opened this issue Nov 7, 2024 · 0 comments

Comments

@meg-huggingface
Copy link
Contributor

meg-huggingface commented Nov 7, 2024

Describe the bug

Trying to download only the 'validation' split of my dataset; instead hit the error datasets.exceptions.ExpectedMoreSplitsError.
Appears to be the same undesired behavior as reported in #6939, but with data_files, not data_dir.

Here is the Traceback:

Traceback (most recent call last):
  File "/home/user/app/app.py", line 12, in <module>
    ds = load_dataset('datacomp/imagenet-1k-random0.0', token=GATED_IMAGENET, data_files={'validation': 'data/val*'}, split='validation', trust_remote_code=True)
  File "/usr/local/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/site-packages/datasets/builder.py", line 1018, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 68, in verify_splits
    raise ExpectedMoreSplitsError(str(set(expected_splits) - set(recorded_splits)))
datasets.exceptions.ExpectedMoreSplitsError: {'train', 'test'}

Note: I am using the data_files argument only because I am trying to specify that I only want the 'validation' split, and the whole dataset will be downloaded even when the split='validation' argument is specified, unless you also specify data_files, as described here: https://discuss.huggingface.co/t/how-can-i-download-a-specific-split-of-a-dataset/79027

Steps to reproduce the bug

  1. Create a Space with the default blank 'gradio' SDK https://huggingface.co/new-space
  2. Create a file 'app.py' that loads a dataset to only extract a 'validation' split:

ds = load_dataset('datacomp/imagenet-1k-random0.0', token=GATED_IMAGENET, data_files={'validation': 'data/val*'}, split='validation', trust_remote_code=True)

Expected behavior

Downloading validation split.

Environment info

Default environment for creating a new Space. Relevant to this bug, that is:

FROM docker.io/library/python:3.10@sha256:fd0fa50d997eb56ce560c6e5ca6a1f5cf8fdff87572a16ac07fb1f5ca01eb608

--> RUN pip install --no-cache-dir pip==22.3.1 && 	pip install --no-cache-dir 	datasets 	"huggingface-hub>=0.19" "hf-transfer>=0.1.4" "protobuf<4" "click<8.1" 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant