Merge pull request #64 from getindata/release-0.5.0

Release 0.5.0
getindata · Aug 11, 2023 · 58a26ad · 58a26ad
2 parents a040b3c + 9b84f03
commit 58a26ad
Show file tree

Hide file tree

Showing 35 changed files with 2,843 additions and 1,389 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.4.1
+current_version = 0.5.0
 
 [bumpversion:file:pyproject.toml]
 

diff --git a/.copier-answers.yml b/.copier-answers.yml
@@ -7,7 +7,7 @@ description: Kedro plugin with Azure ML Pipelines support
 docs_url: https://kedro-azureml.readthedocs.io/
 full_name: Kedro Azure ML Pipelines plugin
 github_url: https://github.com/getindata/kedro-azureml
-initial_version: 0.4.1
+initial_version: 0.5.0
 keywords:
 - kedro
 - mlops

diff --git a/.github/workflows/tests_and_publish.yml b/.github/workflows/tests_and_publish.yml
@@ -134,6 +134,7 @@ jobs:
           find "../dist" -name "*.tar.gz" | xargs -I@ cp @ kedro-azureml.tar.gz
           echo -e "\n./kedro-azureml.tar.gz\n" >> src/requirements.txt
           echo -e "kedro-docker\n" >> src/requirements.txt
+          echo -e "openpyxl\n" >> src/requirements.txt  # temp fix for kedro-datasets issues with optional packages
           sed -i '/kedro-telemetry/d' src/requirements.txt
           echo $(cat src/requirements.txt)
           pip install -r src/requirements.txt
@@ -150,6 +151,7 @@ jobs:
           cp ../tests/conf/${{ matrix.e2e_config }}/azureml.yml conf/base/azureml.yml
           sed -i 's/{container_registry}/${{ secrets.REGISTRY_LOGIN_SERVER }}/g' conf/base/azureml.yml
           sed -i 's/{image_tag}/${{ matrix.e2e_config }}/g' conf/base/azureml.yml
+          
           cat conf/base/azureml.yml
 
       - name: Login via Azure CLI

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,13 @@
 
 ## [Unreleased]
 
+## [0.5.0] - 2023-08-11
+
+-   [🚀 New dataset] Added support for `AzureMLAssetDataSet` based on Azure ML SDK v2 (fsspec) by [@tomasvanpottelbergh](https://github.com/tomasvanpottelbergh) & [@froessler](https://github.com/fdroessler)
+-   [📝 Docs] Updated datasets docs with sections
+-   Bumped minimal required Kedro version to \`0.18.11
+-   [⚠️ Deprecation warning] - starting from `0.4.0` the plugin is not compatible with ARM macOS versions due to internal azure dependencies (v1 SDKs). V1 SDK-based datasets will be removed in the future
+
 ## [0.4.1] - 2023-05-04
 
 -   [📝 Docs] Revamp the quickstart guide in documentation
@@ -62,7 +69,9 @@
 
 -   Initial plugin release
 
-[Unreleased]: https://github.com/getindata/kedro-azureml/compare/0.4.1...HEAD
+[Unreleased]: https://github.com/getindata/kedro-azureml/compare/0.5.0...HEAD
+
+[0.5.0]: https://github.com/getindata/kedro-azureml/compare/0.4.1...0.5.0
 
 [0.4.1]: https://github.com/getindata/kedro-azureml/compare/0.4.0...0.4.1
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -60,7 +60,12 @@
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
 
-autodoc_mock_imports = ["azureml", "pandas", "backoff", "cloudpickle"]
+autodoc_mock_imports = [
+    "azureml",
+    "pandas",
+    "backoff",
+    "cloudpickle",
+]
 
 # -- Options for HTML output -------------------------------------------------
 

diff --git a/docs/source/05_data_assets.rst b/docs/source/05_data_assets.rst
@@ -1,11 +1,15 @@
 Azure Data Assets
 =================
 
-``kedro-azureml`` adds support for two new datasets that can be used in the Kedro catalog, the ``AzureMLFileDataSet``
-and the ``AzureMLPandasDataSet`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
+``kedro-azureml`` adds support for two new datasets that can be used in the Kedro catalog. Right now we support both Azure ML v1 SDK (direct Python) and Azure ML v2 SDK (fsspec-based) APIs.
+
+**For v2 API (fspec-based)** - use ``AzureMLAssetDataSet`` that enables to use Azure ML v2-sdk Folder/File datasets for remote and local runs.
+
+**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataSet`` and the ``AzureMLPandasDataSet`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
 Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
 other dataset in Kedro.
 
+
 Apart from these, ``kedro-azureml`` also adds the ``AzureMLPipelineDataSet`` which is used to pass data between
 pipeline nodes when the pipeline is run on Azure ML and the *pipeline data passing* feature is enabled.
 By default, data is then saved and loaded using the ``PickleDataSet`` as underlying dataset.
@@ -24,15 +28,37 @@ For details on usage, see the :ref:`API Reference` below
 API Reference
 -------------
 
-.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataSet
+Pipeline data passing
+^^^^^^^^^^^^^
+
+⚠️ Cannot be used when run locally.
+
+.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataSet
     :members:
 
 -----------------
 
-.. autoclass:: kedro_azureml.datasets.AzureMLFileDataSet
+
+V2 SDK
+^^^^^^^^^^^^^
+Use the dataset below when you're using Azure ML SDK v2 (fsspec-based).
+
+✅ Can be used for both remote and local runs.
+
+.. autoclass:: kedro_azureml.datasets.asset_dataset.AzureMLAssetDataSet
+    :members:
+
+V1 SDK
+^^^^^^^^^^^^^
+Use the datasets below when you're using Azure ML SDK v1 (direct Python).
+
+⚠️ Deprecated - will be removed in future version of `kedro-azureml`.
+
+.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataSet
     :members:
 
 -----------------
 
-.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataSet
+.. autoclass:: kedro_azureml.datasets.AzureMLFileDataSet
     :members:
+
diff --git a/kedro_azureml/__init__.py b/kedro_azureml/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "0.4.1"
+__version__ = "0.5.0"
 
 import warnings
 

diff --git a/kedro_azureml/auth/__init__.py b/kedro_azureml/auth/__init__.py
diff --git a/kedro_azureml/auth/utils.py b/kedro_azureml/auth/utils.py
@@ -0,0 +1,74 @@
+import os
+from functools import cached_property
+
+from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
+from azureml.core import Datastore, Run, Workspace
+from azureml.exceptions import UserErrorException
+
+
+def get_azureml_credentials():
+    try:
+        # On a AzureML compute instance, the managed identity will take precedence,
+        # while it does not have enough permissions.
+        # So, if we are on an AzureML compute instance, we disable the managed identity.
+        is_azureml_managed_identity = "MSI_ENDPOINT" in os.environ
+        credential = DefaultAzureCredential(
+            exclude_managed_identity_credential=is_azureml_managed_identity
+        )
+        # Check if given credential can get token successfully.
+        credential.get_token("https://management.azure.com/.default")
+    except Exception:
+        # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
+        credential = InteractiveBrowserCredential()
+    return credential
+
+
+def get_workspace(*args, **kwargs) -> Workspace:
+    """
+    Get an AzureML workspace.
+
+    Args:
+        *args: Positional arguments to pass to the Workspace constructor.
+        **kwargs: Keyword arguments to pass to the Workspace constructor.
+    """
+    if args or kwargs:
+        workspace = Workspace(*args, **kwargs)
+    else:
+        try:
+            # if running on azureml compute instance
+            workspace = Workspace.from_config()
+        except UserErrorException:
+            try:
+                # if running on azureml compute cluster.
+                workspace = Run.get_context().experiment.workspace
+            except AttributeError as e:
+                raise UserErrorException(
+                    "Could not connect to AzureML workspace."
+                ) from e
+    return workspace
+
+
+class AzureMLDataStoreMixin:
+    def __init__(self, workspace_args, azureml_datastore=None, workspace=None):
+        self._workspace_instance = workspace
+        self._azureml_datastore_name = azureml_datastore
+        self._workspace_args = workspace_args or dict()
+
+    @cached_property
+    def _workspace(self) -> Workspace:
+        return self._workspace_instance or get_workspace(**self._workspace_args)
+
+    @cached_property
+    def _azureml_datastore(self) -> str:
+        return (
+            self._azureml_datastore_name or self._workspace.get_default_datastore().name
+        )
+
+    @cached_property
+    def _datastore_container_name(self) -> str:
+        ds = Datastore.get(self._workspace, self._azureml_datastore)
+        return ds.container_name
+
+    @cached_property
+    def _azureml_path(self):
+        return f"abfs://{self._datastore_container_name}/"
diff --git a/kedro_azureml/cli.py b/kedro_azureml/cli.py
@@ -2,9 +2,11 @@
 import logging
 import os
 from pathlib import Path
-from typing import List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple
 
 import click
+from kedro.framework.cli.project import LOAD_VERSION_HELP
+from kedro.framework.cli.utils import _split_load_versions
 from kedro.framework.startup import ProjectMetadata
 
 from kedro_azureml.cli_functions import (
@@ -206,6 +208,14 @@ def init(
     multiple=True,
     help="Environment variables to be injected in the steps, format: KEY=VALUE",
 )
+@click.option(
+    "--load-versions",
+    "-lv",
+    type=str,
+    default="",
+    help=LOAD_VERSION_HELP,
+    callback=_split_load_versions,
+)
 @click.pass_obj
 @click.pass_context
 def run(
@@ -218,6 +228,7 @@ def run(
     params: str,
     wait_for_completion: bool,
     env_var: Tuple[str],
+    load_versions: Dict[str, str],
 ):
     """Runs the specified pipeline in Azure ML Pipelines; Additional parameters can be passed from command line.
     Can be used with --wait-for-completion param to block the caller until the pipeline finishes in Azure ML.
@@ -236,7 +247,9 @@ def run(
 
     mgr: KedroContextManager
     extra_env = parse_extra_env_params(env_var)
-    with get_context_and_pipeline(ctx, image, pipeline, params, aml_env, extra_env) as (
+    with get_context_and_pipeline(
+        ctx, image, pipeline, params, aml_env, extra_env, load_versions
+    ) as (
         mgr,
         az_pipeline,
     ):
@@ -302,6 +315,20 @@ def run(
     default="pipeline.yaml",
     help="Pipeline YAML definition file.",
 )
+@click.option(
+    "--env-var",
+    type=str,
+    multiple=True,
+    help="Environment variables to be injected in the steps, format: KEY=VALUE",
+)
+@click.option(
+    "--load-versions",
+    "-lv",
+    type=str,
+    default="",
+    help=LOAD_VERSION_HELP,
+    callback=_split_load_versions,
+)
 @click.pass_obj
 def compile(
     ctx: CliContext,
@@ -310,10 +337,15 @@ def compile(
     pipeline: str,
     params: list,
     output: str,
+    env_var: Tuple[str],
+    load_versions: Dict[str, str],
 ):
     """Compiles the pipeline into YAML format"""
     params = json.dumps(p) if (p := parse_extra_params(params)) else ""
-    with get_context_and_pipeline(ctx, image, pipeline, params, aml_env) as (
+    extra_env = parse_extra_env_params(env_var)
+    with get_context_and_pipeline(
+        ctx, image, pipeline, params, aml_env, extra_env, load_versions
+    ) as (
         _,
         az_pipeline,
     ):
@@ -342,14 +374,14 @@ def compile(
 @click.option(
     "--az-input",
     "azure_inputs",
-    type=(str, click.Path(exists=True, file_okay=False, dir_okay=True)),
+    type=(str, click.Path(exists=True, file_okay=True, dir_okay=True)),
     multiple=True,
     help="Name and path of Azure ML Pipeline input",
 )
 @click.option(
     "--az-output",
     "azure_outputs",
-    type=(str, click.Path(exists=True, file_okay=False, dir_okay=True)),
+    type=(str, click.Path(exists=True, file_okay=True, dir_okay=True)),
     multiple=True,
     help="Name and path of Azure ML Pipeline output",
 )

diff --git a/kedro_azureml/cli_functions.py b/kedro_azureml/cli_functions.py
@@ -23,6 +23,7 @@ def get_context_and_pipeline(
     params: str,
     aml_env: Optional[str] = None,
     extra_env: Dict[str, str] = {},
+    load_versions: Dict[str, str] = {},
 ):
     with KedroContextManager(
         ctx.metadata.package_name, ctx.env, parse_extra_params(params, True)
@@ -50,11 +51,13 @@ def get_context_and_pipeline(
             ctx.env,
             mgr.plugin_config,
             mgr.context.params,
+            mgr.context.catalog,
             aml_env,
             docker_image,
             params,
             storage_account_key,
             extra_env,
+            load_versions,
         )
         az_pipeline = generator.generate()
         yield mgr, az_pipeline

diff --git a/kedro_azureml/client.py b/kedro_azureml/client.py
@@ -1,15 +1,14 @@
 import json
 import logging
-import os
 from contextlib import contextmanager
 from pathlib import Path
 from tempfile import TemporaryDirectory
 from typing import Callable, Optional
 
 from azure.ai.ml import MLClient
 from azure.ai.ml.entities import Job
-from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
 
+from kedro_azureml.auth.utils import get_azureml_credentials
 from kedro_azureml.config import AzureMLConfig
 
 logger = logging.getLogger(__name__)
@@ -23,19 +22,7 @@ def _get_azureml_client(subscription_id: Optional[str], config: AzureMLConfig):
         "workspace_name": config.workspace_name,
     }
 
-    try:
-        # On a AzureML compute instance, the managed identity will take precedence,
-        # while it does not have enough permissions.
-        # So, if we are on an AzureML compute instance, we disable the managed identity.
-        is_azureml_managed_identity = "MSI_ENDPOINT" in os.environ
-        credential = DefaultAzureCredential(
-            exclude_managed_identity_credential=is_azureml_managed_identity
-        )
-        # Check if given credential can get token successfully.
-        credential.get_token("https://management.azure.com/.default")
-    except Exception:
-        # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
-        credential = InteractiveBrowserCredential()
+    credential = get_azureml_credentials()
 
     with TemporaryDirectory() as tmp_dir:
         config_path = Path(tmp_dir) / "config.json"

diff --git a/kedro_azureml/datasets/__init__.py b/kedro_azureml/datasets/__init__.py
@@ -1,3 +1,4 @@
+from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataSet
 from kedro_azureml.datasets.file_dataset import AzureMLFileDataSet
 from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataSet
 from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataSet
@@ -8,6 +9,7 @@
 
 __all__ = [
     "AzureMLFileDataSet",
+    "AzureMLAssetDataSet",
     "AzureMLPipelineDataSet",
     "AzureMLPandasDataSet",
     "KedroAzureRunnerDataset",