You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52224][CONNECT][PYTHON] Introduce pyyaml as a dependency for the Python client
### What changes were proposed in this pull request?
Introduces pyyaml as a dependency for the Python client. When `pip install`-ing the pyspark client, it will be installed with it.
### Why are the changes needed?
The pipeline spec file described in the [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0) expects data in a YAML format. YAML is superior to alternatives, for a few reasons:
- Unlike the flat files that are used for [spark-submit confs](https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file), it supports the hierarchical data required by the pipeline spec.
- It's much more user-friendly to author than JSON.
- It's consistent with the config files used for similar tools, like dbt.
The Declarative Pipelines CLI will be a Spark Connect Python client, and thus require a Python library for loading YAML. The pyyaml library is an extremely stable dependency. The `safe_load` function that we'll use to load YAML files was introduced more than a decade ago.
### Does this PR introduce _any_ user-facing change?
Yes – users who `pip install` the PySpark client library will see the pyyaml library installed.
### How was this patch tested?
- Made a clean virtualenv
- Ran `pip install python/packaging/client`
- Confirmed that I could `import yaml` in a Python shell
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#50944 from sryza/yaml-dep.
Authored-by: Sandy Ryza <[email protected]>
Signed-off-by: Sandy Ryza <[email protected]>
0 commit comments