-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe
#1876
base: tswast-refactor-cell-data
Are you sure you want to change the base?
Conversation
I might actually want to do something in |
Right now the behavior is inconsistent across REST and BQ Storage API. |
Marking as |
Actually, I think this needs a few more tests. I'm testing manually with |
# Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available. | ||
# Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0, | ||
# but since they might have an older db-dtypes, have string as a fallback for that. | ||
# TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to | ||
# pyarrow.json_(pyarrow.string()) if available and supported by pandas. | ||
if hasattr(db_dtypes, "JSONArrowType"): | ||
json_arrow_type = db_dtypes.JSONArrowType() | ||
else: | ||
json_arrow_type = pyarrow.string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string())
because of pandas-dev/pandas#60958.
Marking as Edit: Mailed #2144 |
I've added regression tests for #1580 |
tests/system/test_arrow.py
Outdated
"json_array_col", | ||
] | ||
assert table.shape == (0, 5) | ||
assert list(table.field("struct_col").type.names) == ["json_field", "int_field"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure:
____________________ test_to_arrow_query_with_empty_results ____________________
bigquery_client =
def test_to_arrow_query_with_empty_results(bigquery_client):
"""
JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580.
"""
job = bigquery_client.query(
"""
select
123 as int_col,
'' as string_col,
to_json('{}') as json_col,
struct(to_json('[]') as json_field, -1 as int_field) as struct_col,
[to_json('null')] as json_array_col,
from unnest([])
"""
)
table = job.to_arrow()
assert list(table.column_names) == [
"int_col",
"string_col",
"json_col",
"struct_col",
"json_array_col",
]
assert table.shape == (0, 5)
> assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]
E AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'
Need to update this to support older pyarrow.
# but we'd like this to map as closely to the BQ Storage API as | ||
# possible, which uses the string() dtype, as JSON support in Arrow | ||
# predates JSON support in BigQuery by several years. | ||
"JSON": pyarrow.string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.
In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can investigate if such an override is also possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some progress plumbing through a json_type
everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable
, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.
I'd prefer to leave this as-is without the change to allow overriding the arrow type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into it.
Based on #2144, which should merge first.
TODO:
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1580
🦕