Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add strict parameter to pl.concat(how='horizontal') #20019

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

nimit
Copy link

@nimit nimit commented Nov 27, 2024

PR that closes #19133
Made changes to the python package so that if how='horizontal', the number of rows in the first element are checked with the rest of the elements for both: lazy and eager DataFrames.
strict is set to False by default

Also added unit tests for the changes for cases:

  1. strict=True, rows don't match in eager DataFrame
  2. strict=True, rows don't match in lazy DataFrame
  3. strict=False, rows don't match

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Nov 27, 2024
@coastalwhite
Copy link
Collaborator

Heya, thank you for the PR.
I do think this needs to happen on the Rust side and not on the Python side. This way there is no way we can reason about it in the query optimizer / engine.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.53%. Comparing base (4c1c51c) to head (6078964).

Files with missing lines Patch % Lines
crates/polars-stream/src/nodes/zip.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20019   +/-   ##
=======================================
  Coverage   79.52%   79.53%           
=======================================
  Files        1563     1563           
  Lines      217104   217121   +17     
  Branches     2464     2464           
=======================================
+ Hits       172659   172690   +31     
+ Misses      43885    43871   -14     
  Partials      560      560           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -231,6 +240,14 @@ def concat(
)
)
elif how == "horizontal":
if strict:
nrows = first.select(F.len()).collect()[0, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this should be implemented on the rust side is that this collect here could trigger a massive computation if the query plan is complex, which then gets tossed. The check should be performed when the concatenation operation is actually applied.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. When I initially thought about it, I failed to take into account how I would compare the number of rows on Lazyframes.

@nimit nimit requested a review from orlp as a code owner December 1, 2024 18:16
@nimit
Copy link
Author

nimit commented Dec 2, 2024

@mcrumiller @coastalwhite
Apologies for the delay. I had no idea about some Rust language features.
Can you please have a look at this? I think I made the required changes in concat_df_horizontal, changed the UnionArgs struct to include strict as a parameter and modified the function call in the HConcat plan executor (also made a few changes to other functions that call concat_df_horizontal but they pass None)
The tests should appropriately raise the polars.exceptions.ShapeError when heights don't match.

Unfortunately, my machine is not capable of running make test & make pre-commit in the cargo directory and GitHub Codespaces runs out of disk space every time I try it there.

@mcrumiller
Copy link
Contributor

@nimit I'm not a repo member, I just lurk here a lot, but I can try to help you get things working--what's the issue with running make on your end?

@nimit
Copy link
Author

nimit commented Dec 2, 2024

@nimit I'm not a repo member, I just lurk here a lot, but I can try to help you get things working--what's the issue with running make on your end?

Thanks for your help!
It's running for >4hours with terrible progress (my guess is because I have an older machine)

@mcrumiller
Copy link
Contributor

Finally, I think rename the issue to:

feat: Add `strict` parameter to `pl.concat(how='horizontal')`
  • This isn't python exclusive ("feat(python):" -> "feat:")
  • PR titles should be active voice ("Add" instead of "Added") to indicate what happens when the PR is merged.
  • Backticks around the strict and pl.concat(how='horizontal') identify as literal keywords/code.
  • The word "parameter" is a bit more precise.

Cheers!

@nimit nimit changed the title feat(python): Added strict option in pl.concat(how='horizontal') feat: Add strict parameter to pl.concat(how='horizontal') Dec 2, 2024
@github-actions github-actions bot added the rust Related to Rust Polars label Dec 2, 2024
@nimit
Copy link
Author

nimit commented Dec 2, 2024

Thank you very much @mcrumiller for your help throughout my first open-source PR.

@mcrumiller
Copy link
Contributor

I'm not sure why the test is failing, it passes on my end.

@nimit
Copy link
Author

nimit commented Dec 2, 2024

I'm not sure why the test is failing, it passes on my end.

Yeah, I am confused as well. I rely on the actions to test it

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 3, 2024

@nimit it's failing on the new streaming engine:

~/projects/polars/py-polars$ export POLARS_AUTO_NEW_STREAMING=1
~/projects/polars/py-polars$ pytest /home/mcrumiller/projects/polars/py-polars/tests/unit/functions/test_concat.py
=========================================================================================================================================================== test session starts ===========================================================================================================================================================
platform linux -- Python 3.12.6, pytest-8.3.2, pluggy-1.5.0
codspeed: 3.0.0 (disabled, mode: walltime, timer_resolution: 1.0ns)
rootdir: /home/mcrumiller/projects/polars/py-polars
configfile: pyproject.toml
plugins: cov-6.0.0, codspeed-3.0.0, hypothesis-6.119.4, xdist-3.6.1
collected 4 items / 2 deselected / 2 selected                                                                                                                                                                                                                                                                                             

tests/unit/functions/test_concat.py F.                                                                                                                                                                                                                                                                                              [100%]

================================================================================================================================================================ FAILURES =================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________ test_concat_horizontally_strict _____________________________________________________________________________________________________________________________________________________
tests/unit/functions/test_concat.py:32: in test_concat_horizontally_strict
    with pytest.raises(pl.exceptions.ShapeError):
E   Failed: DID NOT RAISE <class 'polars.exceptions.ShapeError'>
========================================================================================================================================================= short test summary info =========================================================================================================================================================
FAILED tests/unit/functions/test_concat.py::test_concat_horizontally_strict - Failed: DID NOT RAISE <class 'polars.exceptions.ShapeError'>
================================================================================================================================================ 1 failed, 1 passed, 2 deselected in 0.13s ================================================================================================================================================

I'll look into it.

@mcrumiller
Copy link
Contributor

@nimit can you set this PR to draft until we can get this working?

@nimit nimit marked this pull request as draft December 3, 2024 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pl.concat(how='horizontal') should be strict by default
8 participants