Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: Not enough disk space. #2972

Closed
qqaatw opened this issue Sep 27, 2021 · 6 comments
Closed

OSError: Not enough disk space. #2972

qqaatw opened this issue Sep 27, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@qqaatw
Copy link
Contributor

qqaatw commented Sep 27, 2021

Describe the bug

I'm trying to download natural_questions dataset from the Internet, and I've specified the cache_dir which locates in a mounted disk and has enough disk space. However, even though the space is enough, the disk space checking function still reports the space of root / disk having no enough space.

The file system structure is like below. The root / has 115G disk space available, and the sda1 is mounted to /mnt, which has 1.2T disk space available:

/
/mnt/sda1/path/to/args.dataset_cache_dir 

Steps to reproduce the bug

dataset_config = DownloadConfig(
    cache_dir=os.path.abspath(args.dataset_cache_dir),
    resume_download=True,
)
dataset = load_dataset("natural_questions", download_config=dataset_config)

Expected results

Can download the dataset without an error.

Actual results

The following error raised:

OSError: Not enough disk space. Needed: 134.92 GiB (download: 41.97 GiB, generated: 92.95 GiB, post-processed: Unknown size)

Environment info

  • datasets version: 1.9.0
  • Platform: Ubuntu 18.04
  • Python version: 3.8.10
  • PyArrow version:
@qqaatw qqaatw added the bug Something isn't working label Sep 27, 2021
@qqaatw
Copy link
Contributor Author

qqaatw commented Sep 27, 2021

Maybe we can change the disk space calculating API from shutil.disk_usage to os.statvfs in UNIX-like system, which can provide correct results.

statvfs = os.statvfs('path')
avail_space_bytes = statvfs.f_frsize * statvfs.f_bavail

@albertvillanova
Copy link
Member

albertvillanova commented Sep 27, 2021

Hi @qqaatw, thanks for reporting.

Could you please try:

dataset = load_dataset("natural_questions", cache_dir=os.path.abspath(args.dataset_cache_dir))

@qqaatw
Copy link
Contributor Author

qqaatw commented Sep 27, 2021

@albertvillanova it works! Thanks for your suggestion. Is that a bug of DownloadConfig?

@albertvillanova
Copy link
Member

DownloadConfig only sets the location to download the files. On the other hand, cache_dir sets the location for both downloading and caching the data. You can find more information here: https://huggingface.co/docs/datasets/loading_datasets.html#cache-directory

@davidtan-tw
Copy link

davidtan-tw commented Aug 29, 2022

I had encountered the same error when running a command ds = load_dataset('food101') in a docker container. The error I got: OSError: Not enough disk space. Needed: 9.43 GiB (download: 4.65 GiB, generated: 4.77 GiB, post-processed: Unknown size)

In case anyone encountered the same issue, this was my fix:

# starting the container (mount project directory onto /app, so that the code and data in my project directory are available in the container)
docker run -it --rm -v $(pwd):/app my-demo:latest bash
# other code ...
ds = load_dataset('food101', cache_dir="/app/data") # set cache_dir to the absolute path of a directory (e.g. /app/data) that's mounted from the host (MacOS in my case) into the docker container

# this assumes ./data directory exists in your project folder. If not, create it or point it to any other existing directory where you want to store the cache

Thanks @albertvillanova for posting the fix above :-)

@sankexin
Copy link

sankexin commented Dec 4, 2024

import datasets
datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory='.': True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants