Skip to content

Commit

Permalink
Merge pull request #124 from vincentclaes/add-sagemaker-to-readme-1
Browse files Browse the repository at this point in the history
Add sagemaker to readme
  • Loading branch information
vincentclaes authored Aug 11, 2021
2 parents c4944ea + f3f4929 commit c106284
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 18 deletions.
28 changes: 25 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,25 @@
</br>

- Create and deploying code to python shell / pyspark **AWS Glue jobs**.
- Create, orchestrate and trigger **Sagemaker Traning Jobs and Processing Jobs**.
- Use **AWS Sagemaker** to create ML Models.
- Orchestrate the above jobs using **AWS Stepfunctions** as simple as `task1 >> task2`
- Let us [know](https://github.com/vincentclaes/datajob/discussions) **what you want to see next**.

</br>

> Dependencies are [AWS CDK](https://github.com/aws/aws-cdk) and [Step Functions SDK for data science](https://github.com/aws/aws-step-functions-data-science-sdk-python) <br/>
<div align="center">

:rocket: :new: :rocket:
</br>
</br>
[Check our new example of an End-to-end Machine Learning Pipeline with Glue, Sagemaker and Stepfunctions](examples/ml_pipeline_end_to_end)
</br>
</br>
:rocket: :new: :rocket:

</br></br>

</div>

</br>

Expand Down Expand Up @@ -88,7 +100,7 @@ cd examples/data_pipeline_simple
cdk deploy --app "python datajob_stack.py"
```

### Run
### Execute

```shell script
datajob execute --state-machine data-pipeline-simple-workflow
Expand All @@ -103,6 +115,16 @@ The terminal will show a link to the step functions page to follow up on your pi
cdk destroy --app "python datajob_stack.py"
```

# Examples

- [Data pipeline with parallel steps](./examples/data_pipeline_parallel/)
- [Data pipeline for processing big data using PySpark](./examples/data_pipeline_pyspark/)
- [Data pipeline where you package and ship your project as a wheel](./examples/data_pipeline_with_packaged_project/)
- [Machine Learning pipeline where we combine glue jobs with sagemaker](examples/ml_pipeline_end_to_end)

All our examples are in [./examples](./examples)


# Functionality

<details>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ML Pipeline Scikitlearn
# End to End Machine Learning Pipeline

> This example is an implementation of datajob of [an official aws sagemaker example.](https://github.com/aws/amazon-sagemaker-examples/blob/master/step-functions-data-science-sdk/machine_learning_workflow_abalone/machine_learning_workflow_abalone.ipynb)
Expand All @@ -17,7 +17,10 @@ we have 5 steps in our ML pipeline:

## Deploy

cd examples/ml_pipeline_abalone
> !!! Don't forget to pull down the sagemaker endpoint that is created at the end of the pipeline.

cd examples/ml_pipeline_end_to_end
export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=eu-west-1
cdk deploy --app "python datajob_stack.py" --require-approval never
Expand All @@ -31,7 +34,7 @@ we have 5 steps in our ML pipeline:
arn:aws:cloudformation:eu-west-1:077590795309:stack/datajob-ml-pipeline-abalone/e179ec30-f45a-11eb-9731-02575f1b7adf


execute the ml pipeline
## Execute

datajob execute --state-machine datajob-ml-pipeline-abalone-workflow

Expand All @@ -56,6 +59,11 @@ In the end a sagemaker endpoint is created:
![sagemaker-endpoint.png](assets/sagemaker-endpoint.png)

In our example the name of the endpoint is `datajob-ml-pipeline-abalone-create-endpoint-20210803T165017`
Pull down the sagemaker endpoint by executing the following command:

## Destroy

cdk destroy --app "python datajob_stack.py"

Don't forget to pull down the sagemaker endpoint:

aws sagemaker delete-endpoint --endpoint-name datajob-ml-pipeline-abalone-create-endpoint-20210803T165017
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,12 @@
with DataJobStack(scope=app, id="datajob-ml-pipeline-abalone") as djs:

sagemaker_default_role = get_default_sagemaker_role(datajob_stack=djs)
sagemaker_session = sagemaker.Session(
boto_session=boto3.session.Session(region_name=djs.env.region)
)
sagemaker_default_bucket_uri = (
f"s3://{sagemaker_session.default_bucket()}/datajob-ml-pipeline-abalone"
)

train_path = f"{sagemaker_default_bucket_uri}/train/abalone.train"
validation_path = f"{sagemaker_default_bucket_uri}/validation/abalone.validation"
test_path = f"{sagemaker_default_bucket_uri}/test/abalone.test"
train_path = f"s3://{djs.context.data_bucket_name}/train/abalone.train"
validation_path = (
f"s3://{djs.context.data_bucket_name}/validation/abalone.validation"
)
test_path = f"s3://{djs.context.data_bucket_name}/test/abalone.test"

prepare_dataset_step = GlueJob(
datajob_stack=djs,
Expand All @@ -48,8 +44,7 @@
train_instance_count=1,
train_instance_type="ml.m4.4xlarge",
train_volume_size=5,
output_path=f"{sagemaker_default_bucket_uri}/single-xgboost",
sagemaker_session=sagemaker_session,
output_path=f"s3://{djs.context.data_bucket_name}/single-xgboost",
)

xgb.set_hyperparameters(
Expand Down

0 comments on commit c106284

Please sign in to comment.