You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When downloading a dataset, we frequently hit the below Permission Denied error. This looks to happen (at least) across datasets in from HF, S3, and GCS.
It looks like the temp_file being passed here can sometimes be created with 000 permissions leading to the permission denied error (the user running the code is still the owner of the file). Deleting that particular file and re-running the code with 0 changes will usually succeed.
Is there some race condition happening with the umask, which is process global, and the file creation?
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/datasets/load.py:2084: in load_dataset
builder_instance.download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:925: in download_and_prepare
self._download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:1649: in _download_and_prepare
super()._download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:979: in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
.venv/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py:120: in _split_generators
downloaded_files = dl_manager.download(files)
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:159: in download
downloaded_path_or_paths = map_nested(
.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py:514: in map_nested
_single_map_nested((function, obj, batched, batch_size, types, None, True, None))
.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py:382: in _single_map_nested
return [mapped_item for batch in iter_batched(data_struct, batch_size) for mapped_item in function(batch)]
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:206: in _download_batched
return thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:449: in result
return self.__get_result()
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:229: in _download_single
out = cached_path(url_or_filename, download_config=download_config)
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:206: in cached_path
output_path = get_from_cache(
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:412: in get_from_cache
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:331: in fsspec_get
fs.get_file(path, temp_file.name, callback=callback)
.venv/lib/python3.12/site-packages/fsspec/asyn.py:118: in wrapper
return sync(self.loop, func, *args, **kwargs)
.venv/lib/python3.12/site-packages/fsspec/asyn.py:103: in sync
raise return_result
.venv/lib/python3.12/site-packages/fsspec/asyn.py:56: in _runner
result[0] = await coro
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <s3fs.core.S3FileSystem object at 0x7f27c18b2e70>
rpath = '<my-bucket>/<my-prefix>/img_1.jpg'
lpath = '/home/runner/_work/_temp/hf_cache/downloads/6c97983efa4e24e534557724655df8247a0bd04326cdfc4a95b638c11e78222d.incomplete'
callback = <datasets.utils.file_utils.TqdmCallback object at 0x7f27c00cdbe0>
version_id = None, kwargs = {}
_open_file = <function S3FileSystem._get_file.<locals>._open_file at 0x7f27628d1120>
body = <StreamingBody at 0x7f276344fa80 for ClientResponse at 0x7f27c015fce0>
content_length = 521923, failed_reads = 0, bytes_read = 0
async def _get_file(
self, rpath, lpath, callback=_DEFAULT_CALLBACK, version_id=None, **kwargs
):
if os.path.isdir(lpath):
return
bucket, key, vers = self.split_path(rpath)
async def _open_file(range: int):
kw = self.req_kw.copy()
if range:
kw["Range"] = f"bytes={range}-"
resp = await self._call_s3(
"get_object",
Bucket=bucket,
Key=key,
**version_id_kw(version_id or vers),
**kw,
)
return resp["Body"], resp.get("ContentLength", None)
body, content_length = await _open_file(range=0)
callback.set_size(content_length)
failed_reads = 0
bytes_read = 0
try:
> with open(lpath, "wb") as f0:
E PermissionError: [Errno 13] Permission denied: '/home/runner/_work/_temp/hf_cache/downloads/6c97983efa4e24e534557724655df8247a0bd04326cdfc4a95b638c11e78222d.incomplete'
.venv/lib/python3.12/site-packages/s3fs/core.py:1355: PermissionError
Steps to reproduce the bug
I believe this is a race condition and cannot reliably re-produce it, but it happens fairly frequently in our GitHub Actions tests and can also be re-produced (with lesser frequency) on cloud VMs.
Expected behavior
The dataset loads properly with no permission denied error.
It must be an issue with umask being used by multiple threads indeed. Maybe we can try to make a thread safe function to apply the umask (using filelock for example)
Describe the bug
When downloading a dataset, we frequently hit the below Permission Denied error. This looks to happen (at least) across datasets in from HF, S3, and GCS.
It looks like the
temp_file
being passed here can sometimes be created with000
permissions leading to the permission denied error (the user running the code is still the owner of the file). Deleting that particular file and re-running the code with 0 changes will usually succeed.Is there some race condition happening with the umask, which is process global, and the file creation?
Steps to reproduce the bug
I believe this is a race condition and cannot reliably re-produce it, but it happens fairly frequently in our GitHub Actions tests and can also be re-produced (with lesser frequency) on cloud VMs.
Expected behavior
The dataset loads properly with no permission denied error.
Environment info
datasets
version: 3.5.0huggingface_hub
version: 0.30.2fsspec
version: 2024.12.0The text was updated successfully, but these errors were encountered: