Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
vincentclaes committed Jan 29, 2021
1 parent 510469e commit b0d56ed
Showing 1 changed file with 48 additions and 47 deletions.
95 changes: 48 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,52 +26,51 @@ Ideas to be implemented can be found [below](#ideas)
pip install datajob
npm install -g aws-cdk

# Example

See the full code of the example [here](https://github.com/vincentclaes/datajob/tree/main/examples/data_pipeline_simple)
# Quickstart

## Configure the pipeline
**We have 3 scripts that we want to orchestrate sequentially and in parallel on AWS using Glue and Step Functions**.

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow


# the datajob_stack is the instance that will result in a cloudformation stack.
# we inject the datajob_stack object through all the resources that we want to add.
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:

# here we define 3 glue jobs with the datajob_stack object,
# a name and the relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)

# we instantiate a step functions workflow and add the sources
# we want to orchestrate. We got the orchestration idea from
# airflow where we use a list to run tasks in parallel
# and we use bit operator '>>' to chain the tasks in our workflow.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3

The definition of our pipeline can be found in `examples/data_pipeline_simple/datajob_stack.py`, and here below:

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow


# the datajob_stack is the instance that will result in a cloudformation stack.
# we inject the datajob_stack object through all the resources that we want to add.
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:

# here we define 3 glue jobs with the datajob_stack object,
# a name and the relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)

# we instantiate a step functions workflow and add the sources
# we want to orchestrate. We got the orchestration idea from
# airflow where we use a list to run tasks in parallel
# and we use bit operator '>>' to chain the tasks in our workflow.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3


## Deploy, Run and Destroy

## Deploy

Set the aws account number and the profile that contains your aws credentials (`~/.aws/credentials`) as environment variables:

Expand All @@ -83,14 +82,19 @@ Point to the configuration of the pipeline using `--config` and deploy
cd examples/data_pipeline_simple
datajob deploy --config datajob_stack.py

# Run
After running the `deploy` command, the code of the 3 tasks are deployed to a glue job and the glue jobs are orchestrated using step functions.
Go to the AWS console to the step functions service, look for `data-pipeline-simple` and click on "Start execution"

![DataPipelineSimple](assets/data-pipeline-simple.png)

Follow up on the progress. Once the pipeline is finished you can pull down the pipeline by using the command:
Follow up on the progress.

# Destroy

datajob destroy --stage dev --config datajob_stack.py
Once the pipeline is finished you can pull down the pipeline by using the command:

datajob destroy --config datajob_stack.py

As simple as that!

Expand All @@ -99,10 +103,6 @@ As simple as that!
> datajob cli prints out the commands it uses in the back to build the pipeline.
> If you want, you can use those.
cd examples/data_pipeline_simple
cdk deploy --app "python datajob_stack.py" -c stage=dev
cdk destroy --app "python datajob_stack.py" -c stage=dev

# Ideas

Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)
Expand All @@ -122,4 +122,5 @@ These are the ideas, we find interesting to implement;
- create sagemaker model
- create sagemaker endpoint
- expose sagemaker endpoint to the internet by levering lambda + api gateway

- create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob

0 comments on commit b0d56ed

Please sign in to comment.