Amazon Emr local Dev env with Docker

Following the steps listed in this repository you can build a Docker image which simulates a cluster EMR used for ETL tasks by Mediaset Business Digital. The main feature of the built image is the ability to use AWS Glue Data Catalog as an external Hive Metastore.

The final Docker image contains:

Python 3.8
Spark 3.1.2
Hadoop 3.3
Hive 2.3.7

Build docker image

Before to start: install Docker.

Important note: if you are a MacOS user go to the Docker preferences, then select Resources, and under Advance section increase RAM to (at least) 4GB.

build the docker image mediaset-spark-aws-glue-demo-builder. When the build is completed you will find a Spark bundle artifact in ./dist directory.
```
make build-spark
```
build the final dev environment docker image called mediaset-spark-aws-glue-demo:python3.8-spark3.1.2
```
make build-dev-env
```
Before to use the image, configure the Glue Data Catalog adding the following section in the ./conf/hive-site.xml:
```
<property>
  <name>hive.metastore.glue.catalogid</name>
  <value>YOUR_AWS_ACCOUNT_ID</value>
</property>
```

Now you are ready to locally develop spark jobs querying Glue Data Catalogs using the docker image mediaset-spark-aws-glue-demo:python3.8-spark3.1.2.

Testing

Launch a standalone docker container, after setting the correct AWS credentials. (Add also -e AWS_SESSION_TOKEN=YOUR_AWS_SESSION_TOKEN if you need to set a specific role).

docker run -it --rm \
-p 4040:4040 \
-v /PROJECT_PATH/conf/hive-site.xml:/opt/spark/conf/hive-site.xml \
-e AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY \
-e AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY \
--name spark-env \
mediaset-spark-aws-glue-demo:python3.8-spark3.1.2 \
bash

Open the spark shell to verify the ability to connect to the Glue Data Catalog

# pyspark
>>> spark.sql("show databases").show()
>>> spark.sql("show tables in DB_NAME").show()

Open a Web Browser on http://localhost:4040 while keeping pyspark running to check the Spark Web UI.

Configuration

PyCharm setup

Open PyCharm Professional and import the Project.
Under File, choose Settings... (for Mac, under PyCharm, choose Preferences)
Under Settings, choose Project Interpreter. Click the gear icon, choose Show All.. from the drop-down menu.
Choose the + icon and create a new Docker interpreter selecting the image mediaset-spark-aws-glue-demo:python3.8-spark3.1.2 and press OK.
Edit the Run/Debug Configurations of the project to properly launch the docker image
Insert the Script path selecting the path of the module main.py contained in the project.
In Environment Variables add AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY;AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY.
1. Add also AWS_SESSION_TOKEN=YOUR_AWS_SESSION_TOKEN if you need to set a specific role.

In Docker container settings add the following keys in the Volume bindings section, replacing PROJECT_PATH with the project location on your computer:

Host path	Container path
/PROJECT_PATH/spark-events	/tmp/spark-events
/PROJECT_PATH/conf/log4j.properties	/opt/spark/conf/log4j.properties
/PROJECT_PATH/conf/hive-site.xml	/opt/spark/conf/hive-site.xml

Note: To dynamically configure a different Glue Data Catalog without re-compiling the docker image, update the following section in the ./conf/hive-site.xml within the project folder:

<property>
  <name>hive.metastore.glue.catalogid</name>
  <value>YOUR_AWS_ACCOUNT_ID</value>
</property>

Press Run button.
The Run console will show you the output of the main.py module which should list the databases available on the provided Glue Data Catalog.

Spark Log4j

The level of the loggers displayed in the PyCharm console can be controlled by modifying the conf/log4j.properties file in your project folder.

Spark History Server

Each job launched in the docker image, will store the relative Spark Event logs in the /tmp/spark-events on the docker image. You can bind this location on the host computer to persist the Spark logs to later review them with the Spark History Server.

To launch a persistent Spark History Server able to read all the Spark Event Logs generated, you should launch an independent docker container with the following command (replace PROJECT_PATH with the path on your local computer):

docker run -it --rm \
-p 18080:18080 \
-v /PROJECT_PATH/spark-events:/tmp/spark-events \
--name spark-history \
mediaset-spark-aws-glue-demo:python3.8-spark3.1.2

Open a Web Browser at the following address http://localhost:18080 to see all the Spark logs generated.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
conf		conf
scripts		scripts
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
Makefile		Makefile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Amazon Emr local Dev env with Docker

Build docker image

Testing

Configuration

PyCharm setup

Spark Log4j

Spark History Server

References

About

Uh oh!

Releases 2

Uh oh!

Languages

MDS-BD/aws-emr-local-dev-env-with-docker

Folders and files

Latest commit

History

Repository files navigation

Amazon Emr local Dev env with Docker

Build docker image

Testing

Configuration

PyCharm setup

Spark Log4j

Spark History Server

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages