Technical assignment with Jupyter Notebooks using ETL Frameworks Pyspark or Pandas

An setup with Jupyter Spark AutoML h2o.ai and client libraries Delta.io, PyArrow and Kafka Ideal to use Jupyter and tools exploring data in environments Docker (Ubunto 18.04 LTS) or Windows10 WSL (Windows Subsystem Linux)

Setup Options

Option 1

 Run using Docker Desktop in your Laptop

   ./install-container.sh

Option 2

Run with Docker the container registred in repo datascience-fullstack

 Consume and run docker container with iptable port opening process

    ~ mac-u$ docker run  -it -p 9003:9003 -p 54321:54321 --cap-add=NET_ADMIN --name jupyter_container  gftjoao/datascience-fullstack-v1.0:latest

    
 Run Docker exec to start Jupyter after setup above finishes

    ~ mac-u$ docker exec -it jupyter_container /bin/bash -c "cd ; source .bashrc ; bash -x start-jupyter.sh ; sleep 4 ; cat notebooks/jupyter.log ; sleep infinity"

Option 3

Run setup in Windows 10 with WSL app Ubunto 18.04 LTS

 Consume package 

    ~ wsl-u$  cd ; git clone https://github.com/jpacerqueira/technical-test-Jupyter-Spark-Delta-Pandas.git


 Execute installation
    ~ wsl-u$  cd ; cp project_jupyter_container/library_tools/*.sh . ; bash -x anaconda_setup.sh

Issues and Workarrounds

Issue 1

It is recommended to default OpenJDK8 and not OpenJDK11 or JAVA8(Oracle with an License)

 - Installation : https://www.linuxuprising.com/2019/02/install-any-oracle-java-jdk-version-in.html
 - This installation resolve my issue here : https://github.com/jupyter/jupyter/issues/248

Issue 2

Mount additional driver in docker containers for optional JAVA8(Oracke)

 - Your local container may require to mount large files (git LFS) like Oracle JAVA installer

Issue 3

If Jupyter tools ( spark + h2o.ai + delta_lake:0.3) are not responsive in 1st time usage

            Please re-install in the following order 

           stop-jupyter ; install-jupyter-support-packs ; start-jupyter.sh

    (base) notebookuser@1662e83c8269:~$ pwd
    /home/notebookuser
    (base) notebookuser@1662e83c8269:~$ ls
    anaconda3  install-jupyter-support-packs.sh  java  knode_ds.err  knode_ds.out  library_tools  notebooks  python-additional-libraries  spark  start-jupyter.sh  stop-jupyter.sh
    (base) notebookuser@1662e83c8269:~$ netstat -anp | grep 9003
    (base) notebookuser@1662e83c8269:~$ bash -x stop-jupyter.sh 
    (base) notebookuser@1662e83c8269:~$ bash -x install-jupyter-support-packs.sh 
    (base) notebookuser@1662e83c8269:~$ bash -x start-jupyter.sh 
    (base) notebookuser@1662e83c8269:~$ tail -n 25 notebooks/jupyter.log

Licensing

Our "jupyter_container" docker container with jupyter SparkML and AutoML Ho2.ai is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Apache Spark, Apache Arrow, H2o.ai.jar and Open-JDK8 are licensed under the Apache 2.0 License . Python, PyArrow and pi py H2o.ai under the GNU GPL License

The end product here would benefit for an DevOps engineer with experience in docker-compose or Terraform, feel free to contribute.

Usefull to experiment in the latest frameworks :

Delta.io opensource DeltaLake

Delta Lake Sink readStream writeStream

Spark Structure Streaming Programming with Delta Late

If you are a proeficient Data Engineer/Scientist use as it is, improve it, fix it, share it back!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
images		images
java_tools		java_tools
library_tools		library_tools
notebooks		notebooks
Dockerfile		Dockerfile
README.md		README.md
install-container.sh		install-container.sh
setup-container-tools.sh		setup-container-tools.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical assignment with Jupyter Notebooks using ETL Frameworks Pyspark or Pandas

Setup Options

Option 1

Option 2

Option 3

Issues and Workarrounds

Issue 1

Issue 2

Issue 3

Licensing

About

Releases

Packages

Languages

jpacerqueira-zz/technical-test-Jupyter-Spark-Delta-Pandas

Folders and files

Latest commit

History

Repository files navigation

Technical assignment with Jupyter Notebooks using ETL Frameworks Pyspark or Pandas

Setup Options

Option 1

Option 2

Option 3

Issues and Workarrounds

Issue 1

Issue 2

Issue 3

Licensing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages