Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Implement spark.GBQQueryDataset for reading data from BigQuery as a spark dataframe using SQL query #971

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

abhi8893
Copy link

@abhi8893 abhi8893 commented Dec 20, 2024

Description

  • The current implementation of spark.SparkDataset does not support reading data from bigquery using a SQL query.
  • Adding this functionality in spark.SparkDataset may not comply with kedro_datasets design principles as it requires making filepath as an optional argument.
  • Further, similar to pandas.GBQQueryDataset, the spark.GBQQueryDataset is also a read-only dataset, hence it's a more suited implementation to maintain the overall design of datasets.

Development notes

To test the dataset, the following is the manual way:

  1. Create a GCP Project (project_id)
  2. Create a test dataset inside BigQuery as <project)_id>.<test_dataset>
    1. Create a test materialization dataset inside BigQuery as <project)_id>.<test_mat_dataset>
  3. Create a test table inside the test dataset as <project)_id>.<test_dataset>.<test_table>
  4. Create a service account with following permissions:
  5. Download service account credentials json key
>>> from kedro_datasets.spark import GBQQueryDataset
>>> import pyspark.sql as sql
>>>
>>> # Define your SQL query
>>> sql = "SELECT * FROM `<project)_id>.<test_dataset>.<test_table>`"
>>>
>>> # Initialize dataset
>>> dataset = GBQQueryDataset(
...     sql=sql,
...     materialization_dataset="your_dataset",
...     materialization_project="your_project",  # optional
...     credentials=dict(file="/path/to/your/credentials.json"),
... )
>>>
>>> # Load data
>>> df = dataset.load()
>>>
>>> # Example output
>>> df.show()

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@abhi8893 abhi8893 changed the title feat(datasets): Implement spark.GBQQueryDataset for reading spark dataframes from BigQuery using SQL query feat(datasets): Implement spark.GBQQueryDataset for reading data from BigQuery as a spark dataframe using SQL query Dec 20, 2024
@datajoely
Copy link
Contributor

So this is a great first start - the credentials resolution looks complicated, but also well thought out.

I think we'd need to see some tests for this to go in as is, you can take some inspiration from the pandas equivalent

That also being said we could look at contributing this to the experimental part of [kedro-datasets-experimental](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md#experimental-datasets) which has a lower threshold for admission.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @abhi8893 , this is very promising! We will look at this soon

@abhi8893
Copy link
Author

abhi8893 commented Dec 20, 2024

Thanks @datajoely , @astrojuanlu

Credentials Handling

Yes, the credentials handling is a bit different than the rest of the datasets. Also, I do think that spark.SparkDataset does not support setting GCP service account credentials, rather always relies on setting credentials for the parsed filesystem (fsspec).

Earlier, I have been reading bigquery tables using spark.SparkDataset using the following methods of auth:

my_dataset:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: /tmp/my_table.parquet # Just because it's an non optional arg with type `str`
  load_args:
    table: "<my_project_id>.<my_dataset>.<my_table>"
  1. Work on a GCP resource which already has the required authorization via assigned service account
  2. OR, download service account json key, and set env var
export GOOGLE_APPLICATION_CREDENTIALS=`/path/to/credentials.json
  1. OR, set the following as spark conf
spark.hadoop.google.cloud.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: /path/to/credentials.json

With above dataset, I wanted to allow passing credentials directly to the dataset. But it seems, we may have to standardize it a little bit for all other kedro datasets for this GCP case.

Implementing tests

And let me take a look at how tests can be implemented for this. Initial thoughts: Since this doesn't involve a bigquery client, hence the method of mocking (as in pandas.GBQQueryDataset may not be relevant here).

Moving to experimental

For moving this to experimental, let me know and I'll lift and shift this to kedro_datasets_experimental namespace :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants