You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:
import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")
Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:
import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("https://raw.githubusercontent.com/seanwryan/DS210-Final-Project/refs/heads/main/spotify.csv")
However, when adding either file (hf:/ or raw.githubusercontent) as a FileDataset, my pipeline fails:
kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv":
The issue appears to be with _get_load_path().
Context
I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with _get_load_path(). Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.
Possible Implementation
I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...
Possible Alternatives
The easiest workaround is to just download the files.
The text was updated successfully, but these errors were encountered:
Going to copy in my response from our Slack conversation:
I think:
you're "getting lucky" that Ibis can read it. Ibis actually doesn't do anything smart about cloud protocols like hf:/, but DuckDB is handling it for you (it loads this extension: https://duckdb.org/docs/extensions/httpfs/hugging_face.html)
there is a known gap in functionality; for the ibis.FileDataset to be on par with the rest, it needs to implement something like PyArrowFileSystem or FsspecFileSystem support
probably Kedro doesn't generically handle hf:/ interface, but it could
This is related to the action item to support writing files to PyArrow/fsspec-compatible filesystems from kedro-org/kedro#4190 (probably should create an issue).
Description
ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:
Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:
However, when adding either file (
hf:/
orraw.githubusercontent
) as a FileDataset, my pipeline fails:kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv"
:The issue appears to be with
_get_load_path()
.Context
I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with
_get_load_path()
. Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.Possible Implementation
I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...
Possible Alternatives
The easiest workaround is to just download the files.
The text was updated successfully, but these errors were encountered: