Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS SageMaker Unified Studio Workflow Operator #45726

Merged
merged 45 commits into from
Mar 4, 2025
Merged

Conversation

agupta01
Copy link
Contributor

@agupta01 agupta01 commented Jan 16, 2025

Description

Adds an operator used for executing Jupyter Notebooks, Querybooks, and Visual ETL jobs within the context of a SageMaker Unified Studio project.

SageMaker Unified Studio (SMUS) supports development of Airflow DAGs (called "workflows" within the product) that are run on an MWAA cluster managed by the project. These workflows have the ability to orchestrate the execution of Unified Studio artifacts that can connect to data assets stored in a SMUS project.

Implementation-wise, these notebooks are executed on a SageMaker Training Job running a SageMaker Distribution environment within the context of a SMUS project.

Components

  • SageMakerNotebookOperator: this operator allows users to execute Unified Studio artifacts within the context of their project.
  • SageMakerNotebookHook: this hook provides a wrapper around the notebook execution
  • SageMakerNotebookSensor: this sensor waits on status updates from the notebook execution
  • SageMakerNotebookJobTrigger: this trigger activates when the notebook execution completes

Usage

Note that this operator introduces a dependency on the SageMaker Studio SDK https://www.pypi.org/project/sagemaker-studio

with DAG(...) as dag:
    ...
    run_notebook = SageMakerNotebookOperator(
        task_id="initial",
        input_config={"input_path": <notebook_path_in_s3>, "input_params": {}},
        output_config={"output_formats": ["NOTEBOOK"]},
        wait_for_completion=True,
        waiter_delay=5,
    )
   ...

Testing

MWAA uses python 3.11 and postgres backend, so we will set those values for all tests.

Unit tests

breeze testing core-tests -p 3.11 -b postgres providers/amazon/tests/provider_tests/amazon/aws/*/test_sagemaker_unified_studio.py

System tests

Ensure a properly configured SageMaker Unified Domain and Project as indicated in the example_sagemaker_unified_studio.py file. Also ensure AWS credentials are populated and up to date. Then, populate the DOMAIN_ID, PROJECT_ID, ENVIRONMENT_ID, and S3_PATH in files/airflow-breeze-config/variables.env and run:

breeze testing system-tests -p 3.11 -b postgres --forward-credentials --test-timeout 3600 providers/amazon/tests/system/amazon/aws/example_sagemaker_unified_studio.py

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 16, 2025
Copy link

boring-cyborg bot commented Jan 16, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@agupta01 agupta01 changed the title Add AWS SageMaker Unified Studio Notebook Operator Add AWS SageMaker Unified Studio Workflow Operator Jan 29, 2025
@agupta01 agupta01 marked this pull request as ready for review February 11, 2025 23:00
@o-nikolas
Copy link
Contributor

Python 3.10 builds failing due to python dependency issues.

  #45 0.924 Using Python 3.10.16 environment at: /usr/local
  #45 2.028    Building apache-airflow @ file:///opt/airflow
  #45 3.403       Built apache-airflow @ file:///opt/airflow
  #45 4.229   × No solution found when resolving dependencies:
  #45 4.229   ╰─▶ Because the current Python version (3.10.16) does not satisfy
  #45 4.229       Python>=3.11 and sagemaker-studio==1.0.7 depends on Python>=3.11, we can
  #45 4.229       conclude that sagemaker-studio==1.0.7 cannot be used.
  #45 4.229       And because only sagemaker-studio<=1.0.7 is available and
  #45 4.229       apache-airflow[devel-ci]==3.0.0.dev0 depends on sagemaker-studio>=1.0.7,
  #45 4.229       we can conclude that apache-airflow[devel-ci]==3.0.0.dev0 cannot be
  #45 4.229       used.
  #45 4.229       And because only apache-airflow[devel-ci]==3.0.0.dev0 is available
  #45 4.229       and you require apache-airflow[devel-ci], we can conclude that your
  #45 4.229       requirements are unsatisfiable.
  #45 ERROR: process "/bin/bash -o pipefail -o errexit -o nounset -o nolog -c bash /scripts/docker/install_airflow.sh" did not complete successfully: exit code: 1

@agupta01 agupta01 requested a review from o-nikolas February 27, 2025 23:34
@o-nikolas o-nikolas merged commit 9939b1b into apache:main Mar 4, 2025
148 checks passed
Copy link

boring-cyborg bot commented Mar 4, 2025

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

shahar1 pushed a commit to shahar1/airflow that referenced this pull request Mar 5, 2025
Adds an operator used for executing Jupyter Notebooks, Querybooks, and Visual ETL jobs within the context of a SageMaker Unified Studio project.
---------

Co-authored-by: Niko Oliveira <[email protected]>
agupta01 added a commit to agupta01/airflow that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants