This project sets up an AWS Lambda function for ETL (Extract, Transform, Load) operations using the DLT (Data Load Tool) framework.
graph LR
S[AWS Step Functions] -->|Schedule/Cron| B(AWS Lambda Function)
A[REST API] -->|Input| B
B -->|Extract| C{Data Processing}
C -->|Transform| D[Data Validation]
D -->|Load| E[(AWS RDS PostgreSQL)]
B -->|Log| F[CloudWatch Logs]
B -->|Metrics| G[CloudWatch Metrics]
- AWS Step Functions: Schedules the Lambda function execution.
- AWS Lambda Function: Runs the ETL process.
- REST API: Serves as the data source.
- AWS RDS PostgreSQL: Stores the processed data.
- AWS CloudWatch: Handles logging and metrics.
this repo is a serverless application that can be deployed with the Serverless Application Model (SAM) CLI. It also uses docker to build the lambda function image and push it to AWS ECR.
the python code is based on the dlt rest api source and the postgres destination and using rye to package the dependencies.
You will need to standup and RDS postgres instance and a lambda function that has a VPC access to the RDS instance - these can be created manually or with aws cloudformation. for the POC and keeping it simple the RDS instance is in the same VPC as the lambda function but in a different subnet and security group also the lambda function needs to have a policy to access the secrets manager - the RDS VPC is in the template.yml file at the moment but could be to use a more dynamic approach.
-
Create Secrets: Create two secrets in AWS Secrets Manager:
a. Database Credentials:
aws secretsmanager create-secret \ --name dev/janos \ --description "Database credentials for DLT function" \ --secret-string '{"username":"<your-username>","password":"<your-password>","host":"<your-host>","port":"<your-port>","database":"<your-database>"}'
b. API Credentials:
aws secretsmanager create-secret \ --name DLT_ApiCredentials \ --description "API credentials for DLT function" \ --secret-string '{"api_base_url":"<your-api-url>","api_token":"<your-api-token>"}'
-
Deploy the Stack: Use AWS SAM to deploy the stack:
sam build sam deploy
sam deploy --stack-name=dlt-etl-poc --resolve-image-repos --resolve-s3 --capabilities CAPABILITY_IAM
You can check out cloudformation to get the outputs of this command and the endpoint to call the api etc.
- Generate requirements.txt: We have to generate a requirements.txt file because the lambda function is packaged as an image and the dependencies are not available in the image.
rye list --all > src/data_aws_lambda_etl/requirements.txt
The Lambda function is configured with:
- Timeout: 30 seconds
- Memory: 1024 MB
- VPC access for RDS connectivity
- Permissions to access Secrets Manager
The Lambda function uses the following environment variables:
DLT_PROJECT_DIR
: "/tmp"DLT_DATA_DIR
: "/tmp"DLT_PIPELINE_DIR
: "/tmp"DATABASE_CREDENTIALS_SECRET_ARN
: ARN of the database credentials secretAPI_CREDENTIALS_SECRET_ARN
: ARN of the API credentials secret
For local development, use the AWS Lambda Python 3.12 runtime:
public.ecr.aws/lambda/python:3.12
- DLT REST API Source
- DLT PostgreSQL Destination
- DLT AWS Taktile Blog - example of using dlt with lambda in production
To delete a secret:
aws secretsmanager delete-secret \
--secret-id arn:aws:secretsmanager:<region>:<account-id>:secret:<secret-name> \
--force-delete-without-recovery