Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support int_range(s) in streaming mode #20015

Open
kalocide opened this issue Nov 26, 2024 · 1 comment
Open

Support int_range(s) in streaming mode #20015

kalocide opened this issue Nov 26, 2024 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@kalocide
Copy link

Description

Using int_range or int_ranges in streaming mode (LazyFrame.sink_*) causes a polars.exceptions.InvalidOperationError. That makes some operations (e.g. prefix explodes) impossible within streaming mode.

Example:

import polars as pl

# imagine a larger-than-ram dataset
lf = pl.scan_parquet("hf://datasets/allenai/tulu-3-sft-mixture/data/*.parquet").head(100)

# get indices
lf = lf.with_columns(
    #indices=pl.lit([0, 1, 2, 3]) # works
    indices=pl.int_ranges(0, pl.col("messages").list.len()) # doesn't work
).explode("indices")

# prefix explode
lf = lf.select(
    messages=pl.col("messages").list.slice(0, pl.col("indices"))
)

print(lf.explain(streaming=True))

# fails here
lf.sink_parquet("output.parquet")

# but this works
#lf.collect().write_parquet("output.parquet")
@kalocide kalocide added the enhancement New feature or an improvement of an existing feature label Nov 26, 2024
@kalocide
Copy link
Author

kalocide commented Nov 26, 2024

The behavior in the example can be approximated with:

...
# get indices
lf = lf.with_columns(
    indices=pl.lit(list(range(1, 32)))
).explode("indices").filter(
    pl.col("indices") <= pl.col("messages").list.len()
)

But only if you know the maximum length of that list. Otherwise, I don't think this is possible without UDFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant