Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Streaming from S3 triggers unexpected keyword argument 'requote_redirect_url' #7295

Open
casper-hansen opened this issue Nov 19, 2024 · 0 comments

Comments

@casper-hansen
Copy link

casper-hansen commented Nov 19, 2024

Describe the bug

Note that this bug is only triggered when streaming=True. #5459 introduced always calling fsspec with client_kwargs={"requote_redirect_url": False}, which seems to have incompatibility issues even in the newest versions.

Analysis of what's happening:

  1. datasets passes the client_kwargs through fsspec
  2. fsspec passes the client_kwargs through s3fs
  3. s3fs passes the client_kwargs to aiobotocore which uses aiohttp
s3creator = self.session.create_client(
    "s3", config=conf, **init_kwargs, **client_kwargs
)
  1. The session tries to create an aiohttp session but the **kwargs are not just kept as unfolded **kwargs but passed in as individual variables (requote_redirect_url and trust_env).

Error:

Traceback (most recent call last):
  File "/Users/cxrh/Documents/GitHub/nlp_foundation/nlp_train/test.py", line 14, in <module>
    batch = next(iter(ds))
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
    for key, example in ex_iterable:
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 255, in __iter__
    for key, pa_table in self.generate_tables_fn(**self.kwargs):
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 78, in _generate_tables
    for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__
    yield from self.generator(*self.args, **self.kwargs)
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 921, in _iter_from_urlpaths
    elif xisdir(urlpath, download_config=download_config):
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 305, in xisdir
    return fs.isdir(inner_path)
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/spec.py", line 721, in isdir
    return self.info(path)["type"] == "directory"
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/archive.py", line 38, in info
    self._get_dirs()
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/datasets/filesystems/compression.py", line 64, in _get_dirs
    f = {**self.file.fs.info(self.file.path), "name": self.uncompressed_name}
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 1302, in _info
    out = await self._call_s3(
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 341, in _call_s3
    await self.set_session()
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/s3fs/core.py", line 524, in set_session
    s3creator = self.session.create_client(
  File "/Users/cxrh/miniconda3/envs/s3_data_loader/lib/python3.10/site-packages/aiobotocore/session.py", line 114, in create_client
    return ClientCreatorContext(self._create_client(*args, **kwargs))
TypeError: AioSession._create_client() got an unexpected keyword argument 'requote_redirect_url'

Steps to reproduce the bug

  1. Install the necessary libraries, datasets having a requirement for being at least 2.19.0:
pip install s3fs fsspec aiohttp aiobotocore botocore 'datasets>=2.19.0'
  1. Run this code:
from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files="s3://your_path/*.jsonl.gz",
    streaming=True,
    split="train",
)

batch = next(iter(ds))

print(batch)
  1. You get the unexpected keyword argument 'requote_redirect_url' error.

Expected behavior

The datasets is able to load a batch from the dataset stored on S3, without triggering this requote_redirect_url error.

Fix: I could fix this by directly removing the requote_redirect_url and trust_env - then it loads properly.

image

Environment info

  • datasets version: 3.1.0
  • Platform: macOS-15.1-arm64-arm-64bit
  • Python version: 3.10.15
  • huggingface_hub version: 0.26.2
  • PyArrow version: 18.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0
@casper-hansen casper-hansen changed the title [BUG]: Load from S3 results in Unexpected keyword error [BUG]: Streaming from S3 triggers unexpected keyword argument 'requote_redirect_url' Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant