In this repo I show how to build a docker image running spark and pyspark, which will be compatible with the official spark operator.
Please also see my other repository where I show how to deploy the operator.
- python3.9
- poetry
1.1.7
docker-desktop
make
make build-container-image
For m1
make build-container-image DOCKER_BUILD="buildx build --platform linux/amd64"
I have defined a docker
image which uses:
- python 3.9
- spark version 4.0.0preview
- delta version 4.0.0rc1
- scala version 2.13
- java version 17
NOTE!
The tools/scripts/entrypoint.sh
has been modified to setup poetry
to use the docker images
poetry environment.
make local-pyspark-shell
- You will see how the
entrypoint
works in action - The shell starts and you can play around with the spark distribution without having to set it up on your local machine, but rather run it in a shell in the docker image.
+ cd /opt/spark/work-dir
++ poetry show -v
++ cut -d ' ' -f 3
++ head -n1
+ export PYSPARK_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ PYSPARK_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
++ poetry show -v
++ head -n1
++ cut -d ' ' -f 3
+ export PYSPARK_DRIVER_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ PYSPARK_DRIVER_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ cd -
/
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_CLASSPATH=':opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z x ']'
+ export PYSPARK_PYTHON
+ '[' -z x ']'
+ export PYSPARK_DRIVER_PYTHON
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='opt/spark/conf::opt/spark/jars/*'
+ case "$1" in
+ echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...'
Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ exec /usr/bin/tini -s -- pyspark --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/10 08:28:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.1
/_/
Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http://f6c40b7da839:4040
Spark context available as 'sc' (master = local[*], app id = local-1720600130530).
SparkSession available as 'spark'.
>>>
Quick example:
>>> from delta.tables import DeltaTable
>>> data = [[1, ("Alice", "Smith", 29)], [2, ("Bob", "Brown", 40)], [3, ("Charlie", "Johnson", 35)]]
>>> columns = columns = ["id", "data"]
>>> df = spark.createDataFrame(data, columns)
>>>
Please go to my argo workflow
repo to see how I deploy an example job with this image, utilising
the spark operator with this base image which can be re-used.