Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add strict parameter to pl.concat(how='horizontal') #20019

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
6038130
added 'strict' as a keyword argument (default: False) to pl.concat an…
nimit Nov 27, 2024
d14049c
bug fix with keyword argument change in pl.concat
nimit Nov 27, 2024
ae2a335
Merge branch 'pola-rs:main' into strict-concat-19133
nimit Dec 1, 2024
6a0df66
rust changes
nimit Dec 1, 2024
ea847bf
fixed python errors with concat_df_horizontal, concat_lf_horizontal
nimit Dec 1, 2024
8f2883d
fixed exception type in python unit test for strict concatenation of …
nimit Dec 1, 2024
21d110a
build: Bump `chrono-tz` to `0.10` (#20094)
stinodego Dec 1, 2024
d85fc9c
chore(rust): Update AWS doc dependencies (#20095)
stinodego Dec 2, 2024
f00e6fd
docs(rust): Fix inconsistency between code and comment (#20070)
YichiZhang0613 Dec 2, 2024
a413c4a
fix: Only slice after sort when slice is smaller than frame length (#…
mcrumiller Dec 2, 2024
5a8cd16
fix: Return null instead of 0. for rolling_std when window contains a…
MarcoGorelli Dec 2, 2024
755ee47
build: Bump `atoi_simd` to version `0.16` (#20098)
stinodego Dec 2, 2024
af4d5a5
build: Bump `thiserror` to version `2` (#20097)
stinodego Dec 2, 2024
06b2ab6
build(rust): Fix path to `polars-dylib` crate in workspace (#20103)
stinodego Dec 2, 2024
511c219
build: Bump `fs4` to version `0.12` (#20101)
stinodego Dec 2, 2024
b276485
build: Bump `object_store` to version `0.11` (#20102)
stinodego Dec 2, 2024
bc3d320
build: Bump `memmap2` to version `0.9` (#20105)
stinodego Dec 2, 2024
8f72e20
refactor(rust): Replace custom `PushNode` trait with `Extend` (#20107)
nameexhaustion Dec 2, 2024
527af9c
build: Upgrade `sqlparser-rs` from version 0.49 to 0.52 (#20110)
alexander-beedie Dec 2, 2024
0d4029a
fixed LazyFrame test, Python & Rust documentation changes regarding a…
nimit Dec 2, 2024
90e65e8
Merge branch 'pola-rs:main' into strict-concat-19133
nimit Dec 2, 2024
6078964
doc fix, removed unnecessary generic
nimit Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions py-polars/polars/functions/eager.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def concat(
how: ConcatMethod = "vertical",
rechunk: bool = False,
parallel: bool = True,
strict: bool = False,
) -> PolarsType:
"""
Combine multiple DataFrames, LazyFrames, or Series into a single object.
Expand Down Expand Up @@ -58,6 +59,8 @@ def concat(
parallel
Only relevant for LazyFrames. This determines if the concatenated
lazy computations may be executed in parallel.
strict
If True, reject concatenating DataFrames that are not the same height
Examples
--------
Expand Down Expand Up @@ -205,6 +208,12 @@ def concat(
)
).collect(no_optimization=True)
elif how == "horizontal":
if strict:
for e in elems[1:]:
if first.shape[0] != e.shape[0]:
msg = f"Number of rows need to be equal ({first.shape[0]} != {e.shape[0]}) when 'strict' is True"
raise ValueError(msg)

out = wrap_df(plr.concat_df_horizontal(elems))
else:
allowed = ", ".join(repr(m) for m in get_args(ConcatMethod))
Expand All @@ -231,6 +240,14 @@ def concat(
)
)
elif how == "horizontal":
if strict:
nrows = first.select(F.len()).collect()[0, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this should be implemented on the rust side is that this collect here could trigger a massive computation if the query plan is complex, which then gets tossed. The check should be performed when the concatenation operation is actually applied.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. When I initially thought about it, I failed to take into account how I would compare the number of rows on Lazyframes.

for e in elems[1:]:
nrows2 = e.select(F.len()).collect()[0, 0]
if nrows != nrows2:
msg = f"Number of rows need to be equal ({nrows} != {nrows2}) when 'strict' is True"
raise ValueError(msg)

return wrap_ldf(
plr.concat_lf_horizontal(
elems,
Expand Down
19 changes: 19 additions & 0 deletions py-polars/tests/unit/functions/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,25 @@ def test_concat_lf_stack_overflow() -> None:
assert bar.collect().shape == (1001, 1)


def test_concat_horizontally_strict() -> None:
a = pl.DataFrame({"a": [0, 1], "b": [1, 2]})
b = pl.DataFrame({"c": [11], "d": [42]})

with pytest.raises(ValueError):
pl.concat([a, b], how="horizontal", strict=True)

with pytest.raises(ValueError):
pl.concat([a.lazy(), b.lazy()], how="horizontal", strict=True)
nimit marked this conversation as resolved.
Show resolved Hide resolved

out = pl.concat([a, b], how="horizontal", strict=False)
assert out.to_dict(as_series=False) == {
"a": [0, 1],
"b": [1, 2],
"c": [11, None],
"d": [42, None],
}


def test_concat_vertically_relaxed() -> None:
a = pl.DataFrame(
data={"a": [1, 2, 3], "b": [True, False, None]},
Expand Down