Skip to content

Commit

Permalink
Merge pull request #12 from vincentclaes/refine-readme
Browse files Browse the repository at this point in the history
update readme
  • Loading branch information
vincentclaes authored Dec 14, 2020
2 parents fae5df9 + 7ef4529 commit ffb248b
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 10 deletions.
54 changes: 44 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,38 @@
# Datajob

Build and deploy a serverless data pipeline with no effort on AWS.
> Datajob is an MVP. Do not use this in production. <br/>
> [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!
- Deploy your code to a glue job
- Package your project and make it available on AWS
### Build and deploy a serverless data pipeline with no effort on AWS.

- Datajob uses exclusively serverless services.
- There is no custom or managed application needed to deploy and run your data pipeline on AWS!
- The main dependencies are [AWS CDK](https://github.com/aws/aws-cdk) and [Step Functions SDK for data science](https://github.com/aws/aws-step-functions-data-science-sdk-python)

Currently implemented:

- Deploy your code to a glue job.
- Package your project and make it available for your glue jobs.
- Orchestrate your pipeline using stepfunctions as simple as `task1 >> [task2,task3] >> task4`

Ideas to be implemented can be found [below](#ideas)


# Installation

datajob can be installed using pip. Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)!
Datajob can be installed using pip. <br/>
Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)!

pip install datajob
npm install -g aws-cdk

# Example

A simple data pipeline with 3 Glue python shell tasks that are executed both sequentially and in parallel.
See the full example [here](https://github.com/vincentclaes/datajob/tree/main/examples/data_pipeline_simple)
See the full code of the example [here](https://github.com/vincentclaes/datajob/tree/main/examples/data_pipeline_simple)

**We have 3 scripts that we want to orchestrate sequentially and in parallel on AWS using Glue and Step Functions**.

The definition of our pipeline can be found in `examples/data_pipeline_simple/datajob_stack.py`, and here below:

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
Expand All @@ -25,10 +41,12 @@ See the full example [here](https://github.com/vincentclaes/datajob/tree/main/ex

# the datajob_stack is the instance that will result in a cloudformation stack.
# we inject the datajob_stack object through all the resources that we want to add.

with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:

# here we define 3 glue jobs with the datajob_stack object,
# a name and the relative path to the source code.

task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
Expand All @@ -49,23 +67,36 @@ See the full example [here](https://github.com/vincentclaes/datajob/tree/main/ex
# we want to orchestrate. We got the orchestration idea from
# airflow where we use a list to run tasks in parallel
# and we use bit operator '>>' to chain the tasks in our workflow.

with StepfunctionsWorkflow(
datajob_stack=datajob_stack,
name="data-pipeline-simple",
) as sfn:
[task1, task2] >> task3


## Deploy and destroy
## Deploy, Run and Destroy

Deploy your pipeline using a unique identifier `--stage` and point to the configuration of the pipeline using `--config`
Set the aws account number and the profile that contains your aws credentials (`~/.aws/credentials`) as environment variables:

export AWS_DEFAULT_ACCOUNT=my-account-number
export AWS_PROFILE=my-profile

Deploy your pipeline using a unique identifier `--stage` and point to the configuration of the pipeline using `--config`

cd examples/data_pipeline_simple
datajob deploy --stage dev --config datajob_stack.py

After running the `deploy` command, the code of the 3 tasks are deployed to a glue job and the glue jobs are orchestrated using step functions.
Go to the AWS console to the step functions service, look for `data-pipeline-simple` and click on "Start execution"

![DataPipelineSimple](assets/data-pipeline-simple.png)

Follow up on the progress. Once the pipeline is finished you can pull down the pipeline by using the command:

datajob destroy --stage dev --config datajob_stack.py

As simple as that!

> Note: When using datajob cli to deploy a pipeline, we shell out to aws cdk.
> You can circumvent shelling out to aws cdk by running `cdk` explicitly.
Expand All @@ -78,6 +109,10 @@ Deploy your pipeline using a unique identifier `--stage` and point to the config

# Ideas

Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)

These are the ideas, we find interesting to implement;

- trigger a pipeline using the cli; `datajob run --pipeline my-simple-pipeline`
- implement a data bucket, that's used for your pipeline.
- add a time based trigger to the step functions workflow.
Expand All @@ -91,5 +126,4 @@ Deploy your pipeline using a unique identifier `--stage` and point to the config
- create sagemaker model
- create sagemaker endpoint
- expose sagemaker endpoint to the internet by levering lambda + api gateway

Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)
- create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob
Binary file added assets/data-pipeline-simple.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ffb248b

Please sign in to comment.