Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibis.FileDataset read files from web #918

Open
mark-druffel opened this issue Oct 29, 2024 · 1 comment
Open

ibis.FileDataset read files from web #918

mark-druffel opened this issue Oct 29, 2024 · 1 comment
Labels
bug Something isn't working datasets

Comments

@mark-druffel
Copy link
Contributor

Description

ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")

Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("https://raw.githubusercontent.com/seanwryan/DS210-Final-Project/refs/heads/main/spotify.csv")

However, when adding either file (hf:/ or raw.githubusercontent) as a FileDataset, my pipeline fails:

tracks:
  type: ibis.FileDataset
  filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
  file_format: csv
  connection: ${connection:spotify}
  load_args:
    sep: ","
  save_args:
    materialized: view
    overwrite: True

kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv":

The issue appears to be with _get_load_path().

Context

I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with _get_load_path(). Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.

Possible Implementation

I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...

Possible Alternatives

The easiest workaround is to just download the files.

@lrcouto lrcouto added datasets bug Something isn't working labels Oct 29, 2024
@deepyaman
Copy link
Member

Going to copy in my response from our Slack conversation:

I think:

  1. you're "getting lucky" that Ibis can read it. Ibis actually doesn't do anything smart about cloud protocols like hf:/, but DuckDB is handling it for you (it loads this extension: https://duckdb.org/docs/extensions/httpfs/hugging_face.html)
  2. there is a known gap in functionality; for the ibis.FileDataset to be on par with the rest, it needs to implement something like PyArrowFileSystem or FsspecFileSystem support
  3. probably Kedro doesn't generically handle hf:/ interface, but it could

This is related to the action item to support writing files to PyArrow/fsspec-compatible filesystems from kedro-org/kedro#4190 (probably should create an issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datasets
Projects
None yet
Development

No branches or pull requests

3 participants