Merge pull request #26 from vincentclaes/15-fix-deps

fix dependencies + update example in readme
vincentclaes · Jan 29, 2021 · e51ba48 · e51ba48
2 parents 78d77f3 + e7f16ed
commit e51ba48
Show file tree

Hide file tree

Showing 4 changed files with 1,164 additions and 1,190 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,14 @@
 # Datajob
 
-> Datajob is an MVP. Do not use this in production. <br/>
-> [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!
-
-### Build and deploy a serverless data pipeline with no effort on AWS.
-
-- Datajob uses exclusively serverless services.
-- There is no custom or managed application needed to deploy and run your data pipeline on AWS!
-- The main dependencies are [AWS CDK](https://github.com/aws/aws-cdk) and [Step Functions SDK for data science](https://github.com/aws/aws-step-functions-data-science-sdk-python)
-
-Currently implemented:
+#### Build and deploy a serverless data pipeline with no effort on AWS.
 
 - Deploy your code to a glue job.
 - Package your project and make it available for your glue jobs.
 - Orchestrate your pipeline using stepfunctions as simple as `task1 >> [task2,task3] >> task4`
 
-Ideas to be implemented can be found [below](#ideas)
+> The main dependencies are [AWS CDK](https://github.com/aws/aws-cdk) and [Step Functions SDK for data science](https://github.com/aws/aws-step-functions-data-science-sdk-python) <br/>
+> Ideas to be implemented can be found [below](#ideas) <br/>
+> [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!
 
 
 # Installation
@@ -26,56 +19,51 @@ Ideas to be implemented can be found [below](#ideas)
     pip install datajob
     npm install -g aws-cdk
 
-# Example
-
-See the full code of the example [here](https://github.com/vincentclaes/datajob/tree/main/examples/data_pipeline_simple)
+# Quickstart
 
+## Configure the pipeline
 **We have 3 scripts that we want to orchestrate sequentially and in parallel on AWS using Glue and Step Functions**.
 
-The definition of our pipeline can be found in `examples/data_pipeline_simple/datajob_stack.py`, and here below:
-
     from datajob.datajob_stack import DataJobStack
     from datajob.glue.glue_job import GlueJob
     from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow
 
 
     # the datajob_stack is the instance that will result in a cloudformation stack.
     # we inject the datajob_stack object through all the resources that we want to add.
-
     with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:
 
         # here we define 3 glue jobs with the datajob_stack object,
         # a name and the relative path to the source code.
-
         task1 = GlueJob(
             datajob_stack=datajob_stack,
             name="task1",
-            path_to_glue_job="data_pipeline_simple/task1.py",
+            job_path="data_pipeline_simple/task1.py",
         )
         task2 = GlueJob(
             datajob_stack=datajob_stack,
             name="task2",
-            path_to_glue_job="data_pipeline_simple/task2.py",
+            job_path="data_pipeline_simple/task2.py",
         )
         task3 = GlueJob(
             datajob_stack=datajob_stack,
             name="task3",
-            path_to_glue_job="data_pipeline_simple/task3.py",
+            job_path="data_pipeline_simple/task3.py",
         )
 
         # we instantiate a step functions workflow and add the sources
         # we want to orchestrate. We got the orchestration idea from
         # airflow where we use a list to run tasks in parallel
         # and we use bit operator '>>' to chain the tasks in our workflow.
-
         with StepfunctionsWorkflow(
-            datajob_stack=datajob_stack,
-            name="data-pipeline-simple",
+            datajob_stack=datajob_stack, name="data-pipeline-simple"
         ) as sfn:
             [task1, task2] >> task3
 
+The definition of our pipeline can be found in `examples/data_pipeline_simple/datajob_stack.py`, and here below:
 
-## Deploy, Run and Destroy
+
+## Deploy
 
 Set the aws account number and the profile that contains your aws credentials (`~/.aws/credentials`) as environment variables:
 
@@ -87,14 +75,19 @@ Point to the configuration of the pipeline using `--config` and deploy
     cd examples/data_pipeline_simple
     datajob deploy --config datajob_stack.py
 
+# Run
 After running the `deploy` command, the code of the 3 tasks are deployed to a glue job and the glue jobs are orchestrated using step functions.
 Go to the AWS console to the step functions service, look for `data-pipeline-simple` and click on "Start execution"
 
 ![DataPipelineSimple](assets/data-pipeline-simple.png)
 
-Follow up on the progress. Once the pipeline is finished you can pull down the pipeline by using the command:
+Follow up on the progress.
+
+# Destroy
 
-    datajob destroy --stage dev --config datajob_stack.py
+Once the pipeline is finished you can pull down the pipeline by using the command:
+
+    datajob destroy --config datajob_stack.py
 
 As simple as that!
 
@@ -103,10 +96,6 @@ As simple as that!
 > datajob cli prints out the commands it uses in the back to build the pipeline.
 > If you want, you can use those.
 
-    cd examples/data_pipeline_simple
-    cdk deploy --app  "python datajob_stack.py"  -c stage=dev
-    cdk destroy --app  "python datajob_stack.py"  -c stage=dev
-
 # Ideas
 
 Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)
@@ -126,4 +115,5 @@ These are the ideas, we find interesting to implement;
     - create sagemaker model
     - create sagemaker endpoint
     - expose sagemaker endpoint to the internet by levering lambda + api gateway
+
 - create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob
diff --git a/examples/data_pipeline_simple/datajob_stack.py b/examples/data_pipeline_simple/datajob_stack.py
@@ -33,8 +33,3 @@
         datajob_stack=datajob_stack, name="data-pipeline-simple"
     ) as sfn:
         [task1, task2] >> task3
-
-# to package and deploy this stack execute
-# cdk deploy --app  "python datajob_stack.py"  -c stage=dev
-# or use the datajob cli
-# datajob deploy --stage dev --config datajob_stack.py