Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider: option to include NULLs in Array.collect()? #9703

Closed
1 task done
NickCrews opened this issue Jul 27, 2024 · 10 comments · Fixed by #9778
Closed
1 task done

consider: option to include NULLs in Array.collect()? #9703

NickCrews opened this issue Jul 27, 2024 · 10 comments · Fixed by #9778
Assignees
Labels
feature Features or general enhancements ux User experience related issues
Milestone

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Jul 27, 2024

What happened?

In #9313 we unified behavior across backends and made it so Array.collect() excluded NULLs

This behavior change broke [this util function of mine](My function that relies on this property is here.

This was due to my reliance on the previously-true-on-duckdb property that unnest() and collect() were inverses (minus the fact that unnest([]) -> no row and unnest(NULL) -> no row, so in order for x.unnest().collect() to get back to x I needed a little post-process massaging). I hadn't realized that my implementation there wasn't fully portable to other backends, this issue is informative, thank you!

Do you see another way to accomplish what I'm trying to do there that can sidestep this behavior change? Or could we consider adding a exclude_nulls=True option?

Regardless, would you like a PR that documents the intended behavior in .collect()s docstring? It was difficult for me to find this issue, I thought this behavior might have just been accidental.

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb, though I ideally want my code as linked above to be portable to pyspark, and pyspark always exludes NULLs from list_agg(), so an entierly new implementation might be the only way to really solve this...

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@gforsyth
Copy link
Member

Hey @NickCrews -- were you able to work around the null exclusion issues?

@NickCrews
Copy link
Contributor Author

@gforsyth only by dropping down to .sql(). Which probably works just fine and won't break on me in the future?

@jcrist
Copy link
Member

jcrist commented Jul 29, 2024

We standardized on ignoring nulls in collect/first/last since it was both more consistent with other aggregations, and also the only one that all our backends could support. If it'd be useful I think adding a ignore_nulls kwarg that defaults to True for the current behavior (or include_nulls/exclude_nulls, no strong thoughts on name) would make sense here. Setting ignore_nulls=False would only work on backends that could support it though.

@gforsyth
Copy link
Member

Which probably works just fine and won't break on me in the future?

That will not break in the future -- we'll probably go ahead and add an ignore_nulls kwarg anyway, but this is helpful to know for helping set priority.

@jcrist
Copy link
Member

jcrist commented Jul 30, 2024

Adding it for collect/first/last is a pretty low lift. We don't have an equivalent API spelling to mirror currently though - anyone have thoughts for what the kwarg name should be? We'll likely want to copy the same kwarg name everywhere else we'd need it. Remember that the default (in these functions at least) is to exclude nulls.

def collect(self, ignore_nulls=True): ...

def collect(self, exclude_nulls=True): ...

def collect(self, include_nulls=False): ...

def collect(self, omit_nulls=True): ...

...?

My slight preference is for ignore_nulls=True, since IGNORE NULLS is a spelling used by a few SQL engines.

@gforsyth
Copy link
Member

+1 on ignore_nulls=True for symmetry with the SQL engines.

I think a pattern where default True kwargs are spelled include_* and ignore_* is pretty reasonable.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2024

We do have some non-pluralization happening, like fill_null and drop_null. I don't really like that we would introduce ignore_nulls that is inconsistent with our previous discussions (IIRC people were actively against pluralizing those methods), but I agree that ignore_nulls is probably the best option.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2024

ignore_nulls should also be included in at least lead and lag since that unblocks bfill/ffill, but definitely a follow-up/separate PR.

@gforsyth
Copy link
Member

IIRC people were actively against pluralizing those methods

I was, but I feel differently about plurals in keyword-arguments. I am trying to square that linguistically

@jcrist jcrist self-assigned this Jul 31, 2024
@NickCrews
Copy link
Contributor Author

For consistency with .fill_null() etc I would vote keep it singular eg ignore_null

@jcrist jcrist added feature Features or general enhancements ux User experience related issues and removed bug Incorrect behavior inside of ibis labels Aug 5, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Aug 7, 2024
@github-actions github-actions bot added this to the 9.3 milestone Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements ux User experience related issues
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants
@cpcloud @jcrist @gforsyth @NickCrews and others