Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough disk space (Needed: Unknown size) when caching on a cluster #1785

Closed
olinguyen opened this issue Jan 27, 2021 · 9 comments
Closed

Comments

@olinguyen
Copy link
Contributor

olinguyen commented Jan 27, 2021

I'm running some experiments where I'm caching datasets on a cluster and accessing it through multiple compute nodes. However, I get an error when loading the cached dataset from the shared disk.

The exact error thrown:

>>> load_dataset(dataset, cache_dir="/path/to/cluster/shared/path")
OSError: Not enough disk space. Needed: Unknown size (download: Unknown size, generated: Unknown size, post-processed: Unknown size)

utils.has_sufficient_disk_space fails on each job because of how the cluster system is designed (disk_usage(".").free can't compute on the cluster's shared disk).

This is exactly where the error gets thrown:
https://github.com/huggingface/datasets/blob/master/src/datasets/builder.py#L502

if not utils.has_sufficient_disk_space(self.info.size_in_bytes or 0, directory=self._cache_dir_root):
    raise IOError(
          "Not enough disk space. Needed: {} (download: {}, generated: {}, post-processed: {})".format(
          utils.size_str(self.info.size_in_bytes or 0),
          utils.size_str(self.info.download_size or 0),
          utils.size_str(self.info.dataset_size or 0),
          utils.size_str(self.info.post_processing_size or 0),
       )
    )

What would be a good way to circumvent this? my current fix is to manually comment out that part, but that is not ideal.
Would it be possible to pass a flag to skip this check on disk space?

@olinguyen olinguyen changed the title Not enough disk space (Needed: Unknown size) on cluster Not enough disk space (Needed: Unknown size) when caching on a cluster Jan 28, 2021
@lhoestq
Copy link
Member

lhoestq commented Jan 28, 2021

Hi !

What do you mean by "disk_usage(".").free` can't compute on the cluster's shared disk" exactly ?
Does it return 0 ?

@olinguyen
Copy link
Contributor Author

Yes, that's right. It shows 0 free space even though there is. I suspect it might have to do with permissions on the shared disk.

>>> disk_usage(".")
usage(total=999999, used=999999, free=0)

@lhoestq
Copy link
Member

lhoestq commented Jan 28, 2021

That's an interesting behavior...
Do you know any other way to get the free space that works in your case ?
Also if it's a permission issue could you try fix the permissions and let mus know if that helped ?

@olinguyen
Copy link
Contributor Author

I think its an issue on the clusters end (unclear exactly why -- maybe something with docker containers?), will close the issue

@philippnoah
Copy link

Were you able to figure it out?

@olinguyen
Copy link
Contributor Author

@philippnoah I had fixed it with a small hack where I patched has_sufficient_disk_space to always return True. you can do that with an import without having to modify the datasets package

@sahutkarsh
Copy link

@olinguyen Thanks for the suggestion, it works but I had to to edit builder.py in the installed package. Can you please explain how were you able to do this using import?

@nitsanluke
Copy link

I was able to patch the builder code in my notebook before the load data call and it works.

import datasets
datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory='.': True

@sankexin
Copy link

sankexin commented Dec 4, 2024

import datasets
datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory='.': True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants