-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog]: Catalog to config #4329
Comments
Wouldn't it be possible to force datasets to only have static, primitive properties in the For example, rather than having class GBQQueryDataset:
def __init__(self, ...):
...
self._credentials = google.oauth2.credentials.Credentials(**credentials)
self._client = google.cloud.bigquery.Client(credentials=self._credentials)
def _exists(self) -> bool:
table_ref = self._client... we do class GBQQueryDataset(pydantic.BaseModel):
credentials: dict[str, str]
def _get_client(self) -> google.cloud.bigquery.Client:
return bigquery.Client(credentials=google.oauth2.credentials.Credentials(**self.credentials))
def _exists(self) -> bool:
table_ref = self._get_client()... ? (I picked Pydantic here given that there's prior art but dataclasses would work similarly) |
That would be an ideal option, as a common solution would work out of the box without corner cases. However, it would require more significant changes on the datasets' side. As a temporal solution without breaking change, we can try extending parent |
Test examplefrom kedro.io import KedroDataCatalog, Version
from kedro_datasets.pandas import ExcelDataset
config = {
"cached_ds": {
"type": "CachedDataset",
"versioned": "true",
"dataset": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/reviews.csv",
"credentials": "cached_ds_credentials",
},
"metadata": [1, 2, 3]
},
"cars": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/reviews.csv"
},
"{dataset_name}": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/{dataset_name}.csv"
},
"boats": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/companies.csv",
"credentials": "boats_credentials",
"save_args": {
"index": False
}
},
"cars_ibis": {
"type": "ibis.FileDataset",
"filepath": "data/01_raw/reviews.csv",
"file_format": "csv",
"table_name": "cars",
"connection": {
"backend": "duckdb",
"database": "company.db"
},
"load_args": {
"sep": ",",
"nullstr": "#NA"
},
"save_args": {
"sep": ",",
"nullstr": "#NA"
}
},
}
credentials = {
"boats_credentials": {
"client_kwargs": {
"aws_access_key_id": "<your key id>",
"aws_secret_access_key": "<your secret>"
}
},
"cached_ds_credentials": {
"test_key": "test_val"
},
}
version = Version(
load="fake_load_version.csv", # load exact version
save=None, # save to exact version
)
versioned_dataset = ExcelDataset(
filepath="data/01_raw/shuttles.xlsx", load_args={"engine": "openpyxl"}, version=version
)
def main():
catalog = KedroDataCatalog.from_config(config, credentials)
_ = catalog["reviews"]
catalog["versioned_dataset"] = versioned_dataset
catalog["memory_dataset"] = "123"
print("-" * 20, "Catalog", "-" * 20)
print(catalog, "\n")
print("-" * 20, "Catalog to config", "-" * 20)
_config, _credentials, _load_version, _save_version = catalog.to_config()
print(_config, "\n")
print(_credentials, "\n")
print(_load_version, "\n")
print(_save_version, "\n")
print("-" * 20, "Catalog from config", "-" * 20)
_catalog = KedroDataCatalog.from_config(_config, _credentials, _load_version, _save_version)
# Materialize datasets
for ds in _catalog.values():
pass
print(_catalog, "\n")
print("-" * 20, "Catalog from config to config", "-" * 20)
_config, _credentials, _load_version, _save_version = _catalog.to_config()
print(_config, "\n")
print(_credentials, "\n")
print(_load_version, "\n")
print(_save_version, "\n")
if __name__ == "__main__":
main() |
Solved in #4323 |
Description
Implement
KedroDataCatalog.to_config()
method as a part of catalog serialization/deserialization feature #3932Context
Requirements:
from_config
, soKedroDataCatalog.to_config()
have to output configuration further used with the existingKedroDataCatalog.from_config()
method to load it. methodkedro/kedro/io/kedro_data_catalog.py
Line 268 in 9464dc7
Implementation
Solution description
We consider 3 different ways of loading datasets:
dataset.from_config()
method to instantiate dataset which calls the underlying dataset constructor.1 - can be solved at the catalog level
2 and 3 require retrieving dataset configuration from instantiated dataset object
Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)
AbstractDataset.__init_subclass__
which allows to change the behavior of subclasses from inside theAbstractDataset
: https://docs.python.org/3/reference/datamodel.html#customizing-class-creationAbstractDataset
in the_init_args
field.AbstractDataset.to_config()
to retrieve configuration from the instantiated dataset object based on the object's_init_args
.Implement
KedroDataCatalog.to_config
Once 2 and 3 are solved, we can implement a common solution at the catalog level. For that, we need to consider cases when we work with lazy and materialized datasets and retrieve configuration from the catalog or using
AbstractDataset.to_config()
.After the configuration is retrieved, we need to "unresolve" the credentials and keep them in a separate dictionary, as we did when instantiating the catalog. For that
CatalogConfigResolver.unresolve_config_credentials()
method can be implemented to undo the result ofCatalogConfigResolver._resolve_config_credentials()
.Excluding parameters and
MemoryDataset
sWe need to exclude
MemoryDataset
s as well asparameters
Not covered cases
Connection
(from google.oauth2.credentials import Credentials
) - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/pandas/gbq_dataset.html#GBQQueryDatasettype[AbstractDataset]
- https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/partitions/incremental_dataset.html#IncrementalDataset
AbstractDataset.to_config()
at the dataset level to serialize those objects. Can be addressed one by one in a separate PRs.LambdaDataset
- not the case anymore since Can we removeLambdaDataset
? #4292SharedMemoryDataset
- not expected to be saved and loaded.Issues blocking further implementation
versioned
flag and dataset parameter #4326 - currently solved the problem by adding logic to updateVERSIONED_FLAG_KEY
ifversion
is provided.save_version
should we save and load back: Discrepancy between settingsave_version
via catalog constructor and when passing datasets #4327 - needs a dicussion.Tested with
CachedDataset
,PartitionedDataset
,IncrementalDataset
,MemoryDataset
and various other kedro datasetsHow to test
#4329 (comment)
The text was updated successfully, but these errors were encountered: