Skip to content

TST[string]: update expecteds for using_string_dtype to fix xfails #61727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jun 27, 2025

It isn't 100% obvious that the new repr for Categoricals is an improvement, but it's non-crazy. One of the remaining xfails one is for eval(repr(categorical_index)) round-tripping that won't be fixable unless we revert back to the old repr behavior.

I'm pretty sure that the fix in test_astype_dt64_to_string is correct and the test is just wrong, but merits a close look.

That leaves 12 xfails, including the one un-fixable round-trip one that we'll just remove. Of those...

  • test_join_on_key i think is surfacing an unrelated bug that I'll take a look at (xref BUG[string]: incorrect index downcast in DataFrame.join #61771)
  • test_to_dict_of_blocks_item_cache is failing because we don't make series.values read-only for ArrowStringArray. I think @mroeschke can comment on how viable/important that is.
  • test_string_categorical_index_repr is about CategoricalIndex repr that span multiple lines; with the StringDtype the padding is changed.
  • 4 in pandas/tests/io/json/test_pandas.py that im hoping @WillAyd can take point on
  • test_to_string_index_with_nan theres a MultiIndex level that reprs with a nan instead of NaN. Not a huge deal but having mixed-and-matched nans/NaNs in the repr is weird.
  • test_from_records_sequencelike i don't have a good read on
  • tests.base.test_misc::test_memory_usage is skipped instead of xfailed, but the reason says that it "doesn't work properly" for arrow strings which seems xfail-adjacent. Instead of skipping can we update the expected behavior cc @jorisvandenbossche ?

(Update: looks like I missed one in test_http_headers and another in test_fsspec)

@WillAyd
Copy link
Member

WillAyd commented Jun 30, 2025

The JSON issues stem back to the fact that:

>>> pd.Series([None, '', 'c']).astype(object)

yields different behavior with/without the future string dtype. In the "old" world, this would preserve the value of None but in the new world, None gets cast to a missing value indicator when contained within a series of string values.

In theory we could try and work around those semantics by natively supporting an object type in the JSON reader, but that's a ton of effort and I don't think worth it, given JSON does not natively support object storage

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jun 30, 2025

thanks, will update those tests' expecteds

@mroeschke
Copy link
Member

we don't make series.values read-only for ArrowStringArray

Can't speak for the viability but I think this should be read-only per CoW. Is this related to the underlying ArrowExtensionArray.__setitem__ immutably issue?

@jbrockmendel
Copy link
Member Author

Can't speak for the viability but I think this should be read-only per CoW. Is this related to the underlying ArrowExtensionArray.setitem immutably issue?

I suspect we would need a mechanism like ndarray.flags for EAs to declare an object as read-only. Definitely out of scope for this PR.

Looking at the test, the pertinent behavior can be tested by making the column specifically object dtype.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Brock, the fixes in the current diff all look good!

@jorisvandenbossche
Copy link
Member

  • tests.base.test_misc::test_memory_usage is skipped instead of xfailed, but the reason says that it "doesn't work properly" for arrow strings which seems xfail-adjacent. Instead of skipping can we update the expected behavior cc @jorisvandenbossche ?

I quickly checked that one, and I am not entirely sure why we skipped it based on "doesn't work properly for arrow strings". The issue with the test seems to be that it is assuming too much about the input data, and its is_object definition is no longer correct. For example, a MultiIndex gets considered as object dtype, but now if all levels of the MultiIndex are using str dtype, then the test should not actually see it as is_object.
Same for is_categorical where the test code assumes that a categorical is using object dtype categories, I think.

(can do a separate PR to fix this one, now that I already dived into it)

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to backport this into 2.3.x?

@jbrockmendel
Copy link
Member Author

Looks like #61757 wasn't backported, so i dont think this one should be either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants