A simple, scalable use case utilizing Apache Airflow, Databricks, Delta Tables, & PySpark!
Check out the Airflow code »
Check out the Databricks code »
Download
·
Report Bug
·
Request Feature
Databricks is powerful, as is Apache Airflow. Together, they make a compelling use case for a well-rounded, all-you-need stack for many of your data pipeline needs. This project showcases the utilization of Databricks to extract, manipulate, and upsert data into its Delta Table infrastructure, all nicely wrapped (and automated!) with Apache Airflow.
- Astronomer as one of my many projects during my 2022 Winter/Spring internship.
Imagine a Data Analyst who works for an investment management firm, helping clients make good decisions about their investment portfolios. To do so, the Data Analyst retrieves market data regularly, and for each client provides an analysis of how the industries they are invested in perform.
The Data Analyst persists the transformed data from analyses, sends automated notifications to clients to take action when relevant, and keeps a dashboard up to date for them to check their investment health at a glance.
Let’s look into this Data Analysts workflow.
Step 1: Pass parameters from Airflow using notebook_params = portfolio
portfolio = {
"stocks": "MSFT AAPL IBM WMT SHOP GOOGL TSLA GME AMZN COST COKE CBRE NVDA AMD PG"
}
# Run the Databricks job and retrieve the job Run ID
run_databricks_job = DatabricksRunNowOperator(
task_id="Run_Databricks_Job",
databricks_conn_id=DATABRICKS_CONNECTION_ID,
job_id=137122987688189,
do_xcom_push=True,
notebook_params = portfolio
)
Step 2: Use dbutils.widgets.text(param, default_value) to load params pushed by Airflow into the Databricks notebook.
Step 1: Invoking the API, we pull today's market cap data from Yahoo Finance using the yfinance Python package.
Step 2: Aggregate today's market cap data by Industry Sector
Step 1: Transform the pandas dataframe into a Spark dataframe. Write that dataframe into a temporary Delta Table.
Step 2: Upsert the temp Delta Table (containing today's data) into the permanent Delta Table containing all previous historic data.
Going forward, you can now link this table to Tableau for analysis.
Step 1: Determining the Percentage Change from Day Prior
Step 2:
Exit the Databricks Notebook with output data, which is subsequently captured by Airflow and passed around via. XCOM.
Step 3: Ingesting results in Airflow: This data is picked up using the DatabricksHook and assigned to the variable model_uri.
@task
def Retreive_Databricks_Output(id):
# retrieve xcom data using DatabricksHook
databricks_hook = DatabricksHook()
model_uri = databricks_hook.get_run_output(id)['notebook_output']['result']
return model_uri
# Variable "Output" contains the xcom data from Databricks
retreive_databricks_output = Retreive_Databricks_Output(run_databricks_job.output['run_id'])
Step 4: Using the BranchPythonOperator to decide whether to notify
# Decide as to whether or not an email should be sent based on the content of Output
branching = BranchPythonOperator(
task_id='Check_if_Email_is_Needed',
op_args = [retreive_databricks_output],
python_callable=_split,
)
def _split(data):
if data == "No Email Required":
print("LOG: No big movers, no email was sent")
return 'No_Email_Required'
else:
return 'Send_Email'
Step 5: Send email notification.
# Send email containing the content of the xcom
mail = EmailOperator(
task_id='Send_Email',
to='[email protected]',
subject='Daily Movers',
html_content=retreive_databricks_output,
)
Distributed under Apache License 2.0. See LICENSE
for more information.
Amir Zahreddine - [email protected]
Project Link: https://github.com/AmirZahre/Data_Analyst_DAG
- Santona Tuli for being an awesome mentor towards my introduction to DevOps!
- The team @ Astronomer for help with any questions that arose while learning Airflow.
- Ran Aroussi for creating a fabulous Yahoo Finance API library for Python.