You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that this bug is only triggered when streaming=True. #5459 introduced always calling fsspec with client_kwargs={"requote_redirect_url": False}, which seems to have incompatibility issues even in the newest versions.
Analysis of what's happening:
datasets passes the client_kwargs through fsspec
fsspec passes the client_kwargs through s3fs
s3fs passes the client_kwargs to aiobotocore which uses aiohttp
The session tries to create an aiohttp session but the **kwargs are not just kept as unfolded **kwargs but passed in as individual variables (requote_redirect_url and trust_env).
Error:
Traceback (most recent call last):
File "/Users/cxrh/Documents/GitHub/nlp_foundation/nlp_train/test.py", line 14, in <module>
batch = next(iter(ds))
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
for key, example in ex_iterable:
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 255, in __iter__
for key, pa_table in self.generate_tables_fn(**self.kwargs):
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 78, in _generate_tables
for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__
yield from self.generator(*self.args, **self.kwargs)
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 921, in _iter_from_urlpaths
elif xisdir(urlpath, download_config=download_config):
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 305, in xisdir
return fs.isdir(inner_path)
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/spec.py", line 721, in isdir
return self.info(path)["type"] == "directory"
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/archive.py", line 38, in info
self._get_dirs()
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/filesystems/compression.py", line 64, in _get_dirs
f = {**self.file.fs.info(self.file.path), "name": self.uncompressed_name}
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 1302, in _info
out = await self._call_s3(
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 341, in _call_s3
await self.set_session()
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 524, in set_session
s3creator = self.session.create_client(
File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/aiobotocore/session.py", line 114, in create_client
return ClientCreatorContext(self._create_client(*args, **kwargs))
TypeError: AioSession._create_client() got an unexpected keyword argument 'requote_redirect_url'
Steps to reproduce the bug
Install the necessary libraries, datasets having a requirement for being at least 2.19.0:
You get the unexpected keyword argument 'requote_redirect_url' error.
Expected behavior
The datasets is able to load a batch from the dataset stored on S3, without triggering this requote_redirect_url error.
Fix: I could fix this by directly removing the requote_redirect_url and trust_env - then it loads properly.
Environment info
datasets version: 3.1.0
Platform: macOS-15.1-arm64-arm-64bit
Python version: 3.10.15
huggingface_hub version: 0.26.2
PyArrow version: 18.0.0
Pandas version: 2.2.3
fsspec version: 2024.9.0
The text was updated successfully, but these errors were encountered:
casper-hansen
changed the title
[BUG]: Load from S3 results in Unexpected keyword error
[BUG]: Streaming from S3 triggers unexpected keyword argument 'requote_redirect_url'Nov 19, 2024
Describe the bug
Note that this bug is only triggered when
streaming=True
. #5459 introduced always calling fsspec withclient_kwargs={"requote_redirect_url": False}
, which seems to have incompatibility issues even in the newest versions.Analysis of what's happening:
datasets
passes theclient_kwargs
throughfsspec
fsspec
passes theclient_kwargs
throughs3fs
s3fs
passes theclient_kwargs
toaiobotocore
which usesaiohttp
session
tries to create anaiohttp
session but the**kwargs
are not just kept as unfolded**kwargs
but passed in as individual variables (requote_redirect_url
andtrust_env
).Error:
Steps to reproduce the bug
unexpected keyword argument 'requote_redirect_url'
error.Expected behavior
The datasets is able to load a batch from the dataset stored on S3, without triggering this
requote_redirect_url
error.Fix: I could fix this by directly removing the
requote_redirect_url
andtrust_env
- then it loads properly.Environment info
datasets
version: 3.1.0huggingface_hub
version: 0.26.2fsspec
version: 2024.9.0The text was updated successfully, but these errors were encountered: