[DataCatalog]: Catalog to config #4329

ElenaKhaustova · 2024-11-13T15:39:19Z

Description

Implement KedroDataCatalog.to_config()method as a part of catalog serialization/deserialization feature #3932

Context

Requirements:

Catalog already has from_config, so KedroDataCatalog.to_config() have to output configuration further used with the existing KedroDataCatalog.from_config() method to load it. method

kedro/kedro/io/kedro_data_catalog.py

Line 268 in 9464dc7

def from_config(
We want to solve this problem at the framework level and avoid existing datasets' modifications where possible.

Implementation

Solution description

We consider 3 different ways of loading datasets:

Lazy datasets loaded from the config — in this case, we store the dataset configuration at the catalog level; the dataset object is not instantiated.
Materialized datasets loaded from the config — we store the dataset configuration at the catalog level and use dataset.from_config() method to instantiate dataset which calls the underlying dataset constructor.
Materialized datasets added to the catalog — instantiated datasets' objects are passed to the catalog, dataset configuration is not stored at the catalog level.

1 - can be solved at the catalog level
2 and 3 require retrieving dataset configuration from instantiated dataset object

Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)

Use AbstractDataset.__init_subclass__ which allows to change the behavior of subclasses from inside the AbstractDataset: https://docs.python.org/3/reference/datamodel.html#customizing-class-creation
Create a decorator for child init
Save the original child init
Replace original child init with decorated init calling post init where we save call args: https://docs.python.org/3/library/inspect.html#inspect.getcallargs
Save call args at the level of AbstractDataset in the _init_args field.
Implement AbstractDataset.to_config() to retrieve configuration from the instantiated dataset object based on the object's _init_args.

Implement `KedroDataCatalog.to_config`

Once 2 and 3 are solved, we can implement a common solution at the catalog level. For that, we need to consider cases when we work with lazy and materialized datasets and retrieve configuration from the catalog or using AbstractDataset.to_config().

After the configuration is retrieved, we need to "unresolve" the credentials and keep them in a separate dictionary, as we did when instantiating the catalog. For that CatalogConfigResolver.unresolve_config_credentials() method can be implemented to undo the result of CatalogConfigResolver._resolve_config_credentials().

Excluding parameters and `MemoryDataset`s

We need to exclude MemoryDatasets as well as parameters

Not covered cases

Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:
- Connection (from google.oauth2.credentials import Credentials) - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/pandas/gbq_dataset.html#GBQQueryDataset
- type[AbstractDataset] - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-
  5.1.0/_modules/kedro_datasets/partitions/incremental_dataset.html#IncrementalDataset
- Solution will require extending parent AbstractDataset.to_config() at the dataset level to serialize those objects. Can be addressed one by one in a separate PRs.
LambdaDataset - not the case anymore since Can we remove LambdaDataset? #4292
SharedMemoryDataset - not expected to be saved and loaded.
Modifications of datasets in the catalog except for replacement - we briefly discussed it with @idanov and agreed we do not plan to consider this case for now as we still insist on avoiding modifying datasets' properties in the catalog but rather replacing them.

Issues blocking further implementation

Introduces extra corner cases: Inconsistency when setting version via versioned flag and dataset parameter #4326 - currently solved the problem by adding logic to update VERSIONED_FLAG_KEY if version is provided.
It's not clear what save_version should we save and load back: Discrepancy between setting save_version via catalog constructor and when passing datasets #4327 - needs a dicussion.

Tested with

Lazy datasets loaded from the config
Materialized datasets loaded from the config
Materialized datasets added to the catalog
CachedDataset, PartitionedDataset, IncrementalDataset, MemoryDataset and various other kedro datasets
Credentials
Datasets factories
Transcoding
Versioning

How to test

#4329 (comment)

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-11-16T09:41:05Z

Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:

Wouldn't it be possible to force datasets to only have static, primitive properties in the __init__ method so that serialising them is trivial?

For example, rather than having

class GBQQueryDataset:
    def __init__(self, ...):
        ...
        self._credentials = google.oauth2.credentials.Credentials(**credentials)
        self._client = google.cloud.bigquery.Client(credentials=self._credentials)

    def _exists(self) -> bool:
        table_ref = self._client...

we do

class GBQQueryDataset(pydantic.BaseModel):
    credentials: dict[str, str]

    def _get_client(self) -> google.cloud.bigquery.Client:
        return bigquery.Client(credentials=google.oauth2.credentials.Credentials(**self.credentials))

    def _exists(self) -> bool:
        table_ref = self._get_client()...

?

(I picked Pydantic here given that there's prior art but dataclasses would work similarly)

ElenaKhaustova · 2024-11-18T10:37:22Z

Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:

Wouldn't it be possible to force datasets to only have static, primitive properties in the __init__ method so that serialising them is trivial?

That would be an ideal option, as a common solution would work out of the box without corner cases. However, it would require more significant changes on the datasets' side.

As a temporal solution without breaking change, we can try extending parent AbstractDataset.to_config() at the dataset level for those datasets and serializing such objects. However, I cannot guarantee that we'll be able to cover all the cases.

ElenaKhaustova · 2024-11-20T13:26:40Z

Test example

from kedro.io import KedroDataCatalog, Version
from kedro_datasets.pandas import ExcelDataset


config = {
    "cached_ds": {
        "type": "CachedDataset",
        "versioned": "true",
        "dataset": {
            "type": "pandas.CSVDataset",
            "filepath": "data/01_raw/reviews.csv",
            "credentials": "cached_ds_credentials",
        },
        "metadata": [1, 2, 3]
    },
    "cars": {
        "type": "pandas.CSVDataset",
        "filepath": "data/01_raw/reviews.csv"
    },
    "{dataset_name}": {
        "type": "pandas.CSVDataset",
        "filepath": "data/01_raw/{dataset_name}.csv"
    },
    "boats": {
        "type": "pandas.CSVDataset",
        "filepath": "data/01_raw/companies.csv",
        "credentials": "boats_credentials",
        "save_args": {
            "index": False
        }
    },
    "cars_ibis": {
        "type": "ibis.FileDataset",
        "filepath": "data/01_raw/reviews.csv",
        "file_format": "csv",
        "table_name": "cars",
        "connection": {
            "backend": "duckdb",
            "database": "company.db"
        },
        "load_args": {
            "sep": ",",
            "nullstr": "#NA"
        },
        "save_args": {
            "sep": ",",
            "nullstr": "#NA"
        }
    },
}

credentials = {
    "boats_credentials": {
        "client_kwargs": {
            "aws_access_key_id": "<your key id>",
            "aws_secret_access_key": "<your secret>"
        }
    },
    "cached_ds_credentials": {
        "test_key": "test_val"
    },
}

version = Version(
    load="fake_load_version.csv",  # load exact version
    save=None,  # save to exact version
)

versioned_dataset = ExcelDataset(
    filepath="data/01_raw/shuttles.xlsx", load_args={"engine": "openpyxl"}, version=version
)


def main():
    catalog = KedroDataCatalog.from_config(config, credentials)
    _ = catalog["reviews"]
    catalog["versioned_dataset"] = versioned_dataset
    catalog["memory_dataset"] = "123"
    print("-" * 20, "Catalog", "-" * 20)
    print(catalog, "\n")
    print("-" * 20, "Catalog to config", "-" * 20)

    _config, _credentials, _load_version, _save_version = catalog.to_config()
    print(_config, "\n")
    print(_credentials, "\n")
    print(_load_version, "\n")
    print(_save_version, "\n")
    print("-" * 20, "Catalog from config", "-" * 20)

    _catalog = KedroDataCatalog.from_config(_config, _credentials, _load_version, _save_version)
    # Materialize datasets
    for ds in _catalog.values():
        pass
    print(_catalog, "\n")
    print("-" * 20, "Catalog from config to config", "-" * 20)

    _config, _credentials, _load_version, _save_version = _catalog.to_config()
    print(_config, "\n")
    print(_credentials, "\n")
    print(_load_version, "\n")
    print(_save_version, "\n")


if __name__ == "__main__":
    main()

ElenaKhaustova · 2024-11-29T21:52:58Z

Solved in #4323

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Nov 13, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Nov 13, 2024

ElenaKhaustova self-assigned this Nov 13, 2024

ElenaKhaustova added this to Kedro Framework Nov 13, 2024

This was referenced Nov 13, 2024

Catalog to config #4323

Merged

[DataCatalog]: Save/load catalog obtained from to_config #4330

Closed

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

Closed

ElenaKhaustova moved this to To Do in Kedro Framework Nov 13, 2024

ElenaKhaustova mentioned this issue Nov 14, 2024

Inconsistency when setting version via versioned flag and dataset parameter #4326

Closed

This was referenced Nov 27, 2024

Validate datasets versions #4347

Merged

kedro-datasets: Datasets accept non-primitive parameters in the __init__ kedro-org/kedro-plugins#950

Open

ElenaKhaustova closed this as completed Nov 29, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 29, 2024

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report #4358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Catalog to config #4329

[DataCatalog]: Catalog to config #4329

ElenaKhaustova commented Nov 13, 2024 •

edited

Loading

astrojuanlu commented Nov 16, 2024

ElenaKhaustova commented Nov 18, 2024

ElenaKhaustova commented Nov 20, 2024 •

edited

Loading

ElenaKhaustova commented Nov 29, 2024

[DataCatalog]: Catalog to config #4329

[DataCatalog]: Catalog to config #4329

Comments

ElenaKhaustova commented Nov 13, 2024 • edited Loading

Description

Context

Implementation

Solution description

Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)

Implement KedroDataCatalog.to_config

Excluding parameters and MemoryDatasets

Not covered cases

Issues blocking further implementation

Tested with

How to test

astrojuanlu commented Nov 16, 2024

ElenaKhaustova commented Nov 18, 2024

ElenaKhaustova commented Nov 20, 2024 • edited Loading

Test example

ElenaKhaustova commented Nov 29, 2024

ElenaKhaustova commented Nov 13, 2024 •

edited

Loading

Implement `KedroDataCatalog.to_config`

Excluding parameters and `MemoryDataset`s

ElenaKhaustova commented Nov 20, 2024 •

edited

Loading