People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.
Table of Contents
- https://mungingdata.com/pyspark/chaining-dataframe-transformations/
- https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
- https://pawamoy.github.io/copier-poetry/
- https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
All you need is the following configuration already installed:
- Git
- The project was tested with Python 3.10.13 managed by pyenv:
- Use
make pyenv
goal to launch the automated install of pyenv
- Use
JAVA_HOME
environment variable configured with a JavaJDK11
SPARK_HOME
environment variable configured with Spark versionspark-3.5.2-bin-hadoop3
packagePYSPARK_PYTHON
environment variable configured with"python3.10"
PYSPARK_DRIVER_PYTHON
environment variable configured with"python3.10"
- Install Make to run
Makefile
file - Why
Python 3.10
becausePySpark 3.5.2
doesn't work withPython 3.11
at the moment it seems (I haven't tried with Python 3.12)
- pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
pyenv
installed and available in path pyenv installation with Prerequisites- Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10
-
Auto format via IDE https://github.com/psf/black#pycharmintellij-idea
-
[Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration
-
Or remember to type
black .
to apply the black rules formatting to all sources before commit -
Add integratin with Jenkins and it will complain and tests will fail if black format is not applied
-
Add same mypy option for vscode in
Preferences: Open User Settings
-
Use the option to lint/format with black and flake8 on editor save in vscode
Checked optional type with Mypy PEP 484
Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.
- Install mypy plugin for intellij
- Adjust the plugin with the following options:
"--follow-imports=silent", "--show-column-numbers", "--ignore-missing-imports", "--disallow-untyped-defs", "--check-untyped-defs"
- Documentation: Type hints cheat sheet (Python 3)
- Add same mypy option for vscode in
Preferences: Open User Settings
- isort is the default on pycharm
- isort with vscode
- Lint/format/sort import on save with vscode in
Preferences: Open User Settings
:
{
"editor.formatOnSave": true,
"python.formatting.provider": "black",
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
}
}
}
- isort configuration for pycharm. See Set isort and black formatting code in pycharm
- You can use
make lint
command to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration
isort .
- Show a way to treat json erroneous file like
data/pubmed.json
- Create a poetry env with python 3.10
poetry env use 3.10
- Install pyenv
make pyenv
- Install dependencies in poetry env (virtualenv)
make deps
- Lint & Test
make build
- Lint,Test & Run
make run
- Run dev
make dev
- Build binary/python whell
make dist
poetry run drugs_gen --help
Usage: drugs_gen [OPTIONS]
Options:
-d, --drugs TEXT Path to drugs.csv
-p, --pubmed TEXT Path to pubmed.csv
-c, --clinicals_trials TEXT Path to clinical_trials.csv
-o, --output TEXT Output path to result.json (e.g
/path/to/result.json)
--help Show this message and exit.
- Use
spark-submit
with the Python Wheel file built bymake dist
command in thedist
folder.