This is the answer for the test of "Data Engineer"
Please contact me if you have any questions about the code or the environment. I am happy to explain more details. I use Docker
to minimize the chance there is an incompatibility in the environment, but Docker
would have different behaviours on different platforms, which I can not fully predict and test. I spend a lot of efforts, and I do not want to lose the chance to get the job. Thank you!
- docker
- docker-compose
- Ubuntu 20.04 in WSL2
- Clone the repository
- Run
docker-compose up
- You could see the output in the terminal
- To see the results you could run following bash scripts. You can interact with the database in the container that the python program runs.
# You are in the host machine running the docker-compose now
ssh root@localhost -p 4000 # password: passwd
# You are in the container now
PGPASSWORD=postgres psql -d postgres -U postgres -p 5432 -h postgres -c "SELECT * FROM user_logins"
- Use
Ctrl+C
to stop the whole docker-compose as well as the python program- It takes several seconds to wait for the program to clean up the queue and exit
- Unfortunately, there is a long existing problem of docker compose that it will stop recording the logs after the program exits. Additional information during clean-up stage is ignored.
- Python 3.10
- pip
- awscli-local
- awscli
- psycopg2
- cryptography
- Clone the repository
- Run with
python3 main.py --local
since the docker-compose version will redirect the network to the localstack container. You could see the output in the terminal - Use
Ctrl+C
to stop the program- It takes several seconds to wait for the program to clean up the queue and exit
- In a local environment, you can see the logs for cleaning after the
Ctrl+C
is pressed.
- To see the results you could run following bash scripts.
PGPASSWORD=postgres psql -d postgres -U postgres -p 5432 -h localhost -c "SELECT * FROM user_logins"
- I create a thread to read the message
- The thread execute the
awslocal sqs receive-message
command through bash usingsubprocess
to read the message - It uses the network name
localstack
to connect to the localstack container in the docker-compose
- I use a
Queue
to store the un-resolved information - It is a first-in-first-out data structure, which will resolve the information in the order of the time they are inserted
- Meanwhile,
Queue
in python is thread-safe, which is suitable for the multi-threading environment
- I use the AES encryption algorithm with the same key to mask the PII data
- It will generate the same encrypted data for the same PII data and different encrypted data for different PII data
- I use the
psycopg2
library to connect to the database - And I could execute the SQL command through the
cursor.execute()
function
Please refer to the How to run section
As it is presented in the docker-compose.yml
and the customized Dockerfile.awslocal
. I would use docker-compose
to deploy the application along with the database in production. The reason is that it is easy to use and maintain. It is also easy to scale up the application by adding more containers.
- Introduce more unit/integration tests
- Sorry that I don't have enough time to find the way to show the tests through
docker-compose
and write the tests.
- Sorry that I don't have enough time to find the way to show the tests through
- Use
argparse
to parse the arguments- argparse the information of the database, making it possible to insert into a different database without changing the code
- Refactor the code to make it more readable
- Introduce better and comprehensive error handling during different scenarios like the connection to the database fails
- Introduce better approaches to store the un-resolved information (e.g. file) in the queue when exiting the program
- Wrap the code into a class
- Wrap the global variables into class/functions and remove
global
keyword- Replace the global variables with
self.shutdown_flag = threading.Event()
to make it thread-safe
- Replace the global variables with
- Use a better crpyto library to mask the PII data
- Pass the arguments into the docker-compose and dockerfile instead of hard code them
- Fix the version of the software (e.g. python) in case of the incompatibility of the new version
- Increase the number of threads or processes in the python program
- Add more containers to the docker-compose to scale up the application
- Utilize Kubernetes to include more physical machines to scale up the application
- Introduce other technologies to make the application more scalable (e.g. Apache Spark, Apache Flink, etc.)
- Use the same key to decrypt the PII data
- The general format of message will not change. The only thing that could change is the body of the message.
- The generated message that is not in the standard format is useless and can be ignored
- The version of application like
a.b.c/a.b
could be converted to an integer numbera*pow(2,8) + b*pow(2,4) + c
since the version number is not too large and this approach will fit the value into the integer type in the database