Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe #1876

Open
wants to merge 4 commits into
base: tswast-refactor-cell-data
Choose a base branch
from

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Mar 27, 2024

Based on #2144, which should merge first.

TODO:

  • tests
  • maybe we don't want string columns for JSON? Update: keeping string because that's consistent with (1) current behavior and (2) the BQ Storage Read API.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1580
🦕

@product-auto-label product-auto-label bot added size: xs Pull request size is extra small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Mar 27, 2024
@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2024

I might actually want to do something in db-dtypes so that even though it's a string the unboxed version would give a parsed object like the behavior is when the REST API is used.

@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2024

Right now the behavior is inconsistent across REST and BQ Storage API.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 28, 2024
@tswast
Copy link
Contributor Author

tswast commented Mar 28, 2024

Marking as do not merge for now, as this makes JSON dtype consistent now but always return string dtype like the BQ Storage Read API code path does, which isn't ideal.

@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: xs Pull request size is extra small. labels Mar 10, 2025
@tswast tswast marked this pull request as ready for review March 10, 2025 15:28
@tswast tswast requested review from a team as code owners March 10, 2025 15:28
@tswast tswast requested a review from shollyman March 10, 2025 15:28
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025
@tswast
Copy link
Contributor Author

tswast commented Mar 10, 2025

Actually, I think this needs a few more tests. I'm testing manually with pytest 'tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-json]' from googleapis/python-bigquery-pandas#893, but it's currently failing because we parse the JSON string in _row_iterator_page_columns, but we actually want to keep those as strings to use the json_ pyarrow type.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025
@product-auto-label product-auto-label bot added size: m Pull request size is medium. size: xl Pull request size is extra large. and removed size: s Pull request size is small. size: m Pull request size is medium. labels Mar 10, 2025
Comment on lines 66 to 74
# Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available.
# Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0,
# but since they might have an older db-dtypes, have string as a fallback for that.
# TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to
# pyarrow.json_(pyarrow.string()) if available and supported by pandas.
if hasattr(db_dtypes, "JSONArrowType"):
json_arrow_type = db_dtypes.JSONArrowType()
else:
json_arrow_type = pyarrow.string()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string()) because of pandas-dev/pandas#60958.

@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025
@tswast tswast requested review from chalmerlowe and Linchin and removed request for shollyman March 11, 2025 21:17
@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025
@tswast
Copy link
Contributor Author

tswast commented Mar 11, 2025

Marking as do not merge again. I'll split out the refactor into a separate PR first.

Edit: Mailed #2144

@tswast tswast changed the base branch from main to tswast-refactor-cell-data March 11, 2025 21:59
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Mar 12, 2025
@tswast
Copy link
Contributor Author

tswast commented Mar 12, 2025

I've added regression tests for #1580

"json_array_col",
]
assert table.shape == (0, 5)
assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure:

____________________ test_to_arrow_query_with_empty_results ____________________

bigquery_client = 

    def test_to_arrow_query_with_empty_results(bigquery_client):
        """
        JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580.
        """
        job = bigquery_client.query(
            """
            select
            123 as int_col,
            '' as string_col,
            to_json('{}') as json_col,
            struct(to_json('[]') as json_field, -1 as int_field) as struct_col,
            [to_json('null')] as json_array_col,
            from unnest([])
            """
        )
        table = job.to_arrow()
        assert list(table.column_names) == [
            "int_col",
            "string_col",
            "json_col",
            "struct_col",
            "json_array_col",
        ]
        assert table.shape == (0, 5)
>       assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]
E       AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'

Need to update this to support older pyarrow.

# but we'd like this to map as closely to the BQ Storage API as
# possible, which uses the string() dtype, as JSON support in Arrow
# predates JSON support in BigQuery by several years.
"JSON": pyarrow.string,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.

In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can investigate if such an override is also possible here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some progress plumbing through a json_type everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.

I'd prefer to leave this as-is without the change to allow overriding the arrow type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. do not merge Indicates a pull request not ready for merge, due to either quality or timing. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError encountered when to_dataframe returns empty resultset with JSON field
3 participants