fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` #1876

tswast · 2024-03-27T20:44:11Z

Based on #2144, which should merge first.

TODO:

tests
maybe we don't want string columns for JSON? Update: keeping string because that's consistent with (1) current behavior and (2) the BQ Storage Read API.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #1580
🦕

tswast · 2024-03-27T21:35:41Z

I might actually want to do something in db-dtypes so that even though it's a string the unboxed version would give a parsed object like the behavior is when the REST API is used.

tswast · 2024-03-27T21:36:00Z

Right now the behavior is inconsistent across REST and BQ Storage API.

tswast · 2024-03-28T15:57:43Z

Marking as do not merge for now, as this makes JSON dtype consistent now but always return string dtype like the BQ Storage Read API code path does, which isn't ideal.

tswast · 2025-03-10T15:44:56Z

Actually, I think this needs a few more tests. I'm testing manually with pytest 'tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-json]' from googleapis/python-bigquery-pandas#893, but it's currently failing because we parse the JSON string in _row_iterator_page_columns, but we actually want to keep those as strings to use the json_ pyarrow type.

tswast · 2025-03-11T21:14:32Z

google/cloud/bigquery/_pyarrow_helpers.py

+    # Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available.
+    # Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0,
+    # but since they might have an older db-dtypes, have string as a fallback for that.
+    # TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to
+    # pyarrow.json_(pyarrow.string()) if available and supported by pandas.
+    if hasattr(db_dtypes, "JSONArrowType"):
+        json_arrow_type = db_dtypes.JSONArrowType()
+    else:
+        json_arrow_type = pyarrow.string()


This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string()) because of pandas-dev/pandas#60958.

tswast · 2025-03-11T21:36:41Z

Marking as do not merge again. I'll split out the refactor into a separate PR first.

Edit: Mailed #2144

…o_dataframe`

tswast · 2025-03-12T16:38:07Z

I've added regression tests for #1580

tswast · 2025-03-12T18:26:21Z

tests/system/test_arrow.py

+        "json_array_col",
+    ]
+    assert table.shape == (0, 5)
+    assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]


Test failure:

____________________ test_to_arrow_query_with_empty_results ____________________ bigquery_client = def test_to_arrow_query_with_empty_results(bigquery_client): """ JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580. """ job = bigquery_client.query( """ select 123 as int_col, '' as string_col, to_json('{}') as json_col, struct(to_json('[]') as json_field, -1 as int_field) as struct_col, [to_json('null')] as json_array_col, from unnest([]) """ ) table = job.to_arrow() assert list(table.column_names) == [ "int_col", "string_col", "json_col", "struct_col", "json_array_col", ] assert table.shape == (0, 5) > assert list(table.field("struct_col").type.names) == ["json_field", "int_field"] E AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'

Need to update this to support older pyarrow.

GarrettWu · 2025-03-12T18:28:41Z

google/cloud/bigquery/_pyarrow_helpers.py

+        # but we'd like this to map as closely to the BQ Storage API as
+        # possible, which uses the string() dtype, as JSON support in Arrow
+        # predates JSON support in BigQuery by several years.
+        "JSON": pyarrow.string,


Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?

BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.

In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.

I can investigate if such an override is also possible here.

I made some progress plumbing through a json_type everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.

I'd prefer to leave this as-is without the change to allow overriding the arrow type.

If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.

Thanks for digging into it.

product-auto-label bot added size: xs Pull request size is extra small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Mar 27, 2024

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 28, 2024

tswast mentioned this pull request Jul 18, 2024

Support JSON data type googleapis/python-bigquery-pandas#698

Open

product-auto-label bot added size: s Pull request size is small. and removed size: xs Pull request size is extra small. labels Mar 10, 2025

tswast marked this pull request as ready for review March 10, 2025 15:28

tswast requested review from a team as code owners March 10, 2025 15:28

tswast requested a review from shollyman March 10, 2025 15:28

blunderbuss-gcf bot assigned GaoleMeng Mar 10, 2025

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025

product-auto-label bot added size: m Pull request size is medium. size: xl Pull request size is extra large. and removed size: s Pull request size is small. size: m Pull request size is medium. labels Mar 10, 2025

tswast commented Mar 11, 2025

View reviewed changes

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025

tswast requested review from chalmerlowe and Linchin and removed request for shollyman March 11, 2025 21:17

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025

tswast mentioned this pull request Mar 11, 2025

chore: refactor cell data parsing to use classes for easier overrides #2144

Open

4 tasks

fix: avoid "Unable to determine type" warning with JSON columns in `t…

2580ef9

…o_dataframe`

tswast force-pushed the tswast-json branch from 8a1aad9 to 2580ef9 Compare March 11, 2025 21:59

tswast changed the base branch from main to tswast-refactor-cell-data March 11, 2025 21:59

tswast mentioned this pull request Mar 12, 2025

ValueError encountered when to_dataframe returns empty resultset with JSON field #1580

Open

tswast added 2 commits March 12, 2025 11:18

remove references to db_dtypes.JSONArrowType

96ccd13

add regression tests for empty dataframe

e7331a0

product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Mar 12, 2025

tswast mentioned this pull request Mar 12, 2025

feat: support json_[string][pyarrow] dtype and make pandas-gbq dtypes more independent from google-cloud-bigquery logic googleapis/python-bigquery-pandas#893

Open

4 tasks

tswast commented Mar 12, 2025

View reviewed changes

GarrettWu reviewed Mar 12, 2025

View reviewed changes

fix arrow test to be compatible with old pyarrow

8e87248

GarrettWu approved these changes Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` #1876

fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` #1876

tswast commented Mar 27, 2024 •

edited

Loading

tswast commented Mar 27, 2024

tswast commented Mar 27, 2024

tswast commented Mar 28, 2024

tswast commented Mar 10, 2025

tswast Mar 11, 2025

tswast commented Mar 11, 2025 •

edited

Loading

tswast commented Mar 12, 2025

tswast Mar 12, 2025

GarrettWu Mar 12, 2025

tswast Mar 12, 2025

tswast Mar 12, 2025

tswast Mar 12, 2025

tswast Mar 12, 2025

GarrettWu Mar 13, 2025

fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe #1876

Are you sure you want to change the base?

fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe #1876

Conversation

tswast commented Mar 27, 2024 • edited Loading

tswast commented Mar 27, 2024

tswast commented Mar 27, 2024

tswast commented Mar 28, 2024

tswast commented Mar 10, 2025

Choose a reason for hiding this comment

tswast commented Mar 11, 2025 • edited Loading

tswast commented Mar 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` #1876

fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` #1876

tswast commented Mar 27, 2024 •

edited

Loading

tswast commented Mar 11, 2025 •

edited

Loading