Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cols_label_with() method #626

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jrycw
Copy link
Collaborator

@jrycw jrycw commented Mar 2, 2025

Hello team,

This PR ports the cols_label_with() function from {gt} to Great Tables.

Copy link

codecov bot commented Mar 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.24%. Comparing base (dc82197) to head (ce5d5f1).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #626      +/-   ##
==========================================
+ Coverage   90.85%   91.24%   +0.39%     
==========================================
  Files          46       47       +1     
  Lines        5423     5484      +61     
==========================================
+ Hits         4927     5004      +77     
+ Misses        496      480      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jrycw
Copy link
Collaborator Author

jrycw commented Mar 3, 2025

Whoops! I'm unsure whether we should call cols_label() instead of directly using Boxhead._set_column_labels(), as cols_label() includes additional logic for handling unit syntax.

@jrycw
Copy link
Collaborator Author

jrycw commented Mar 3, 2025

Alternatively, we could extract this code block into a separate function, allowing both cols_label() and cols_label_with() to call it.

# Handle units syntax in labels (e.g., "Density ({{ppl / mi^2}})")
new_kwargs: dict[str, UnitStr | str | BaseText] = {}
for k, v in new_cases.items():
if isinstance(v, str):
unitstr_v = UnitStr.from_str(v)
if len(unitstr_v.units_str) == 1 and isinstance(unitstr_v.units_str[0], str):
new_kwargs[k] = unitstr_v.units_str[0]
else:
new_kwargs[k] = unitstr_v
elif isinstance(v, BaseText):
new_kwargs[k] = v
else:
raise ValueError(
"Column labels must be strings or BaseText objects. Use `md()` or `html()` for formatting."
)

@rich-iannone
Copy link
Member

Thanks for considering the units syntax codepath (I missed that too)! For incorporating that functionality, I do think your last comment (create a units handler as a shared function) is the way right to go.

@rich-iannone
Copy link
Member

As for the docs-building CI error, we solved that in a recent commit by removing the pip install of the pointblank package in the CI workflow file. Just merge with main and you should be all set!

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR -- this function seems really helpful! WDYT of making columns the first argument? I detail a bit of the rationale in a comment.

Here's a detail that's not critical to this PR

One thing I learned just now going through the R library, is that there cols_label has a .fn= parameter for doing these kinds of transformations. However, it sounds like cols_label_with() is a helpful convenience function, because this is something people often want to do.

Essentially, the breakdown seems like:

  • cols_label() often involves directly relabeling columns (e.g. a = "b").
  • cols_label_with() only allows transformation via a function.

However, because we haven't added cols_label(fn=...), there's an opportunity in the future to define it. One wild idea, would be supporting Polars selectors somehow. E.g.

import polars as pl
import polars.selectors as cs

# imagine that selectors are run over a special DataFrame w/ 1 row
df_names = pl.DataFrame({"my_col": ["my_col"], "my_col2": ["my_col2"]})

# the selected column names are the columns to relabel, the value in the single
# row is the new name. There might be a more natural way to do this with
# polars, but wanted to float it as a broad idea!
df_names.select(cs.starts_with("my").str.to_uppercase())

return self._replace(_boxhead=boxhead)


def cols_label_with(self: GTSelf, fn: Callable[[str], str], columns: SelectExpr = None) -> GTSelf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From pairing w/ Rich, @jrycw WDYT of switching the order of the parameters, so columns is before fn?

Looking through the cols_*() methods in the R library, column selection usually comes first. In the case of cols_align() where it doesn't, it sounds like it's a historical artifact. Keeping column first might help cement a pattern for cols_*() methods.

It makes sense you put it first in the PR, since it doesn't have a default. WDYT of us setting it to fn = None, and then erroring if fn is None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machow, nice suggestion! I definitely like your idea of introducing fn= to cols_label() using Polars syntax.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @machow for the comment! I think this is a good opportunity to introduce fn= to accept a callable in cols_label() as well.

By the way, I noticed that we have two test files for boxhead, test_boxhead.py and test__boxhead.py. I picked one to add the new test, but perhaps we should consider consolidating them in the future.

Copy link
Collaborator Author

@jrycw jrycw Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, we might consider using the pl.Expr.name attribute to handle Polars syntax. For example:

df_names.select(cs.starts_with("my").name.to_uppercase())
shape: (1, 2)
┌────────┬─────────┐
│ MY_COLMY_COL2 │
│ ------     │
│ strstr     │
╞════════╪═════════╡
│ my_colmy_col2 │
└────────┴─────────┘

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've attempted to implement this in the latest commit. I admit it's a bit of a bold move, so feel free to set it aside for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s a simple example:

import polars as pl
import polars.selectors as cs
from great_tables import GT

df_names = pl.DataFrame({"my_col": ["my_col"], "my_col2": ["my_col2"]})

GT(df_names).cols_label_with(converter=cs.starts_with("my").name.to_uppercase())

image

Copy link
Collaborator Author

@jrycw jrycw Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing a list of Polars expressions, if a column is referenced multiple times, we can either superimpose the transformations or follow a "last one wins" approach (current implementation).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for experimenting with this. If it's okay with you, I think we need a little bit more time to chew on this approach / run it past some folks, since it's a bit cutting edge in terms of external libraries using Polars selectors. I definitely think it'll be useful to figure out in the long run though

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working my feedback in. I left comments asking if we could remove fn from .cols_label() and remove unit string processing from .cols_label_with().

I'm still getting a feel for some dynamics in these functions, so thanks for being so patient on this PR 😅

self: GTSelf, cases: dict[str, str | BaseText] | None = None, **kwargs: str | BaseText
self: GTSelf,
cases: dict[str, str | BaseText] | None = None,
fn: Callable[[str | BaseText], str | BaseText] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this and running with it. I'm so sorry to send you down a rough path, but maybe we take fn out of here for now?

We're realizing that the combination of fn= and the ability to wrap inputs with BaseText containers is a bit tricky. Since we'd have to explain and figure out how to apply fn to things like md("**some text**").

This seems resolvable, but I would punt on for the sake of getting cols_label_with() in. There's some cleanup across Great Tables we can likely do with BaseText and UnitStr/definition list, which I'm guessing will make situations like this easier to handle.

new_cases = {col: fn(col) for col in sel_cols}

# Handle units syntax in labels (e.g., "Density ({{ppl / mi^2}})")
new_kwargs = _handle_units_syntax(new_cases)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the R package doesn't automatically apply unit syntax here. My understanding is that this makes .cols_label_with() a very plain custom function applier. WDYT of not handling unit syntax here?

We could tell people they can make it happen with define_units() etc.. so hopefully they'd be able to make it happen if they need it

I'm still getting a feel for some of these dynamics, so could be totally off

Copy link
Collaborator Author

@jrycw jrycw Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machow, no worries—happy to work on this. Regarding define_units(), I’m having some difficulty describing it since its definition and use cases are still a bit unclear to me (e.g., #446). Maybe you could help clarify this in the documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot, I can see how this is confusing. I'll take a look at the linked PR, and think about what might put units more front and center. I think it's okay for now to just omit any mention of units in this docstring.

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrycw there are a lot of really interesting changes here, but I think this PR has a bit more in it than I can review.

When I first reviewed, I'd have been onboard with merging the PR after just the columns and fn arguments were switched (so columns was first). If you could revert to that, and make that change (plus having .cols_label_with() not do any unit syntax handling, I'm happy to merge.

(The extra expression stuff is exciting, and seems really useful down the road, but it's a lot to review at once. I do think there's a lot of interesting Polars logic in the PR right now for handling expressions. If you want to stash it somewhere, it could also go into an issue?)

#626 (comment)

@jrycw
Copy link
Collaborator Author

jrycw commented Mar 25, 2025

@jrycw there are a lot of really interesting changes here, but I think this PR has a bit more in it than I can review.

When I first reviewed, I'd have been onboard with merging the PR after just the columns and fn arguments were switched (so columns was first). If you could revert to that, and make that change (plus having .cols_label_with() not do any unit syntax handling, I'm happy to merge.

(The extra expression stuff is exciting, and seems really useful down the road, but it's a lot to review at once. I do think there's a lot of interesting Polars logic in the PR right now for handling expressions. If you want to stash it somewhere, it could also go into an issue?)

#626 (comment)

@machow, thanks for the feedback. I'm not sure which commit you're referring to, but feel free to pick any relevant one for the revert. Could you help proceed with the revert process? Meanwhile, I've created issue #648 to track the Polars expression implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants