This repository provides a local experimental environment for data lakes and mock blob storage, leveraging PySpark and Spark clusters. It allows you to mimic Blob Storage locally and manage it with an Jupyter Notebook connected to a Spark Cluster closely emulating a real but simple environment.
This setup uses mvn
to pull artefacts and transitive dependencies for Spark, e.g. Databricks Delta Lake, used as an example in this template, directly into the Spark's jars without any requirement for network requests from Spark, providing an effective template for the CI deployment for data processing pipelines and analytics in a secure or controlled setting.
Effortlessly dive in and unleash your data's potential, today!
- Mock Blob Storage: Mimics Blob Storage locally, enabling seamless integration with notebooks.
- Spark Cluster: Configured with Docker containers for distributed computing tasks and large-scale dataset processing. Dependencies are managed via the
infra-data-lake
pom file and pulled onto the repository viamvn
-based bashget_spark_deps.sh
. - PySpark Notebooks: Jupyter notebooks for interactive data exploration and analysis. These run in driver or cluster mode. There's an issue ticket open to implement Client/Cluster asynchronous programming (#3), describing the tools needed to enable this.
- CI/CD Heath Checks: Implemented using bash, GitHub Actions, and Docker Compose, CI health checks ensure services are built, up, and healthy before merging to a protected main.
Use make
or follow these steps to set up the environment via Just:
- Clone this repository.
- Ensure Docker is installed.
- Install just.
- Run
just deploy
. - Access Jupyter at
http://localhost:8890
with tokencanttouchthis
. - Start experimenting with data lakes, mock blob storage, and PySpark notebooks!
infra-data-lake/localhost
: Delta Lake and notebooks for local connectivity.infra-mock-blob-storage
: Local mock for Blob Storage.notebook-data-lake
: Contains notebooks for data exploration and analysis.
Commands should be run from the root of the repository or using Just.
Customize the template for your specific requirements and use cases. Since everything is hard coded for the moment, you probably want to find and replace the term orgname
to suit you.
Happy Coding! ✨