[SPARK-52224][CONNECT][PYTHON] Introduce pyyaml as a dependency for the Python client

sryza · sryza · commit 7fee2912ba8b · 2025-05-22T13:12:17.000-07:00
### What changes were proposed in this pull request? Introduces pyyaml as a dependency for the Python client. When `pip install`-ing the pyspark client, it will be installed with it. ### Why are the changes needed? The pipeline spec file described in the [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0) expects data in a YAML format. YAML is superior to alternatives, for a few reasons: - Unlike the flat files that are used for [spark-submit confs](https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file), it supports the hierarchical data required by the pipeline spec. - It's much more user-friendly to author than JSON. - It's consistent with the config files used for similar tools, like dbt. The Declarative Pipelines CLI will be a Spark Connect Python client, and thus require a Python library for loading YAML. The pyyaml library is an extremely stable dependency. The `safe_load` function that we'll use to load YAML files was introduced more than a decade ago. ### Does this PR introduce _any_ user-facing change? Yes – users who `pip install` the PySpark client library will see the pyyaml library installed. ### How was this patch tested? - Made a clean virtualenv - Ran `pip install python/packaging/client` - Confirmed that I could `import yaml` in a Python shell ### Was this patch authored or co-authored using generative AI tooling? No Closes #50944 from sryza/yaml-dep. Authored-by: Sandy Ryza <sandyryza@gmail.com> Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com>
diff --git a/dev/requirements.txt b/dev/requirements.txt
@@ -12,6 +12,7 @@ mlflow>=2.3.1
 scikit-learn
 matplotlib
 memory-profiler>=0.61.0
+pyyaml>=3.11
 
 # PySpark test dependencies
 unittest-xml-reporting
diff --git a/python/packaging/classic/setup.py b/python/packaging/classic/setup.py
@@ -155,6 +155,7 @@ def _supports_symlinks():
 _minimum_pyarrow_version = "11.0.0"
 _minimum_grpc_version = "1.67.0"
 _minimum_googleapis_common_protos_version = "1.65.0"
+_minimum_pyyaml_version = "3.11"
 
 
 class InstallCommand(install):
@@ -365,6 +366,7 @@ def run(self):
                 "grpcio-status>=%s" % _minimum_grpc_version,
                 "googleapis-common-protos>=%s" % _minimum_googleapis_common_protos_version,
                 "numpy>=%s" % _minimum_numpy_version,
+                "pyyaml>=%s" % _minimum_pyyaml_version,
             ],
         },
         python_requires=">=3.9",
diff --git a/python/packaging/client/setup.py b/python/packaging/client/setup.py
@@ -137,6 +137,7 @@
     _minimum_pyarrow_version = "11.0.0"
     _minimum_grpc_version = "1.67.0"
     _minimum_googleapis_common_protos_version = "1.65.0"
+    _minimum_pyyaml_version = "3.11"
 
     with open("README.md") as f:
         long_description = f.read()
@@ -209,6 +210,7 @@
             "grpcio-status>=%s" % _minimum_grpc_version,
             "googleapis-common-protos>=%s" % _minimum_googleapis_common_protos_version,
             "numpy>=%s" % _minimum_numpy_version,
+            "pyyaml>=%s" % _minimum_pyyaml_version,
         ],
         python_requires=">=3.9",
         classifiers=[
diff --git a/python/packaging/connect/setup.py b/python/packaging/connect/setup.py
@@ -91,6 +91,7 @@
     _minimum_pyarrow_version = "11.0.0"
     _minimum_grpc_version = "1.67.0"
     _minimum_googleapis_common_protos_version = "1.65.0"
+    _minimum_pyyaml_version = "3.11"
 
     with open("README.md") as f:
         long_description = f.read()
@@ -121,6 +122,7 @@
             "grpcio-status>=%s" % _minimum_grpc_version,
             "googleapis-common-protos>=%s" % _minimum_googleapis_common_protos_version,
             "numpy>=%s" % _minimum_numpy_version,
+            "pyyaml>=%s" % _minimum_pyyaml_version,
         ],
         python_requires=">=3.9",
         classifiers=[