consider: option to include NULLs in Array.collect()? #9703

NickCrews · 2024-07-27T07:37:11Z

What happened?

In #9313 we unified behavior across backends and made it so Array.collect() excluded NULLs

This behavior change broke [this util function of mine](My function that relies on this property is here.

This was due to my reliance on the previously-true-on-duckdb property that unnest() and collect() were inverses (minus the fact that unnest([]) -> no row and unnest(NULL) -> no row, so in order for x.unnest().collect() to get back to x I needed a little post-process massaging). I hadn't realized that my implementation there wasn't fully portable to other backends, this issue is informative, thank you!

Do you see another way to accomplish what I'm trying to do there that can sidestep this behavior change? Or could we consider adding a exclude_nulls=True option?

Regardless, would you like a PR that documents the intended behavior in .collect()s docstring? It was difficult for me to find this issue, I thought this behavior might have just been accidental.

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb, though I ideally want my code as linked above to be portable to pyspark, and pyspark always exludes NULLs from list_agg(), so an entierly new implementation might be the only way to really solve this...

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gforsyth · 2024-07-29T17:34:23Z

Hey @NickCrews -- were you able to work around the null exclusion issues?

NickCrews · 2024-07-29T17:57:06Z

@gforsyth only by dropping down to .sql(). Which probably works just fine and won't break on me in the future?

jcrist · 2024-07-29T18:00:06Z

We standardized on ignoring nulls in collect/first/last since it was both more consistent with other aggregations, and also the only one that all our backends could support. If it'd be useful I think adding a ignore_nulls kwarg that defaults to True for the current behavior (or include_nulls/exclude_nulls, no strong thoughts on name) would make sense here. Setting ignore_nulls=False would only work on backends that could support it though.

gforsyth · 2024-07-29T19:24:28Z

Which probably works just fine and won't break on me in the future?

That will not break in the future -- we'll probably go ahead and add an ignore_nulls kwarg anyway, but this is helpful to know for helping set priority.

jcrist · 2024-07-30T16:56:27Z

Adding it for collect/first/last is a pretty low lift. We don't have an equivalent API spelling to mirror currently though - anyone have thoughts for what the kwarg name should be? We'll likely want to copy the same kwarg name everywhere else we'd need it. Remember that the default (in these functions at least) is to exclude nulls.

def collect(self, ignore_nulls=True): ...

def collect(self, exclude_nulls=True): ...

def collect(self, include_nulls=False): ...

def collect(self, omit_nulls=True): ...

...?

My slight preference is for ignore_nulls=True, since IGNORE NULLS is a spelling used by a few SQL engines.

gforsyth · 2024-07-30T17:28:19Z

+1 on ignore_nulls=True for symmetry with the SQL engines.

I think a pattern where default True kwargs are spelled include_* and ignore_* is pretty reasonable.

cpcloud · 2024-07-30T18:05:20Z

We do have some non-pluralization happening, like fill_null and drop_null. I don't really like that we would introduce ignore_nulls that is inconsistent with our previous discussions (IIRC people were actively against pluralizing those methods), but I agree that ignore_nulls is probably the best option.

cpcloud · 2024-07-30T18:06:46Z

ignore_nulls should also be included in at least lead and lag since that unblocks bfill/ffill, but definitely a follow-up/separate PR.

gforsyth · 2024-07-31T14:22:47Z

IIRC people were actively against pluralizing those methods

I was, but I feel differently about plurals in keyword-arguments. I am trying to square that linguistically

NickCrews · 2024-07-31T19:48:10Z

For consistency with .fill_null() etc I would vote keep it singular eg ignore_null

NickCrews added the bug Incorrect behavior inside of ibis label Jul 27, 2024

github-project-automation bot added this to Ibis planning and roadmap Jul 27, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jul 27, 2024

NickCrews mentioned this issue Jul 27, 2024

Incorrect join on large tables for add_tfidf NickCrews/mismo#50

Open

jcrist self-assigned this Jul 31, 2024

jcrist added feature Features or general enhancements ux User experience related issues and removed bug Incorrect behavior inside of ibis labels Aug 5, 2024

jcrist mentioned this issue Aug 6, 2024

feat(api): support include_null arg in collect/first/last aggregations #9778

Merged

jcrist closed this as completed in #9778 Aug 7, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Aug 7, 2024

github-actions bot added this to the 9.3 milestone Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider: option to include NULLs in Array.collect()? #9703

consider: option to include NULLs in Array.collect()? #9703

NickCrews commented Jul 27, 2024 •

edited

Loading

gforsyth commented Jul 29, 2024

NickCrews commented Jul 29, 2024

jcrist commented Jul 29, 2024 •

edited

Loading

gforsyth commented Jul 29, 2024

jcrist commented Jul 30, 2024

gforsyth commented Jul 30, 2024

cpcloud commented Jul 30, 2024

cpcloud commented Jul 30, 2024

gforsyth commented Jul 31, 2024

NickCrews commented Jul 31, 2024

consider: option to include NULLs in Array.collect()? #9703

consider: option to include NULLs in Array.collect()? #9703

Comments

NickCrews commented Jul 27, 2024 • edited Loading

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

gforsyth commented Jul 29, 2024

NickCrews commented Jul 29, 2024

jcrist commented Jul 29, 2024 • edited Loading

gforsyth commented Jul 29, 2024

jcrist commented Jul 30, 2024

gforsyth commented Jul 30, 2024

cpcloud commented Jul 30, 2024

cpcloud commented Jul 30, 2024

gforsyth commented Jul 31, 2024

NickCrews commented Jul 31, 2024

NickCrews commented Jul 27, 2024 •

edited

Loading

jcrist commented Jul 29, 2024 •

edited

Loading