-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support functionalities to enhance task traceability with metadata for dependency search. #450
base: master
Are you sure you want to change the base?
Conversation
gokart/utils.py
Outdated
K = TypeVar('K') | ||
|
||
|
||
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can use both Generics and isinstance at the same time, code would be below.
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
if isinstance(items, dict):
return {k: map_flattenable_items(v, func) for k, v in items.items()}
if isinstance(str):
return items
if isinstance(items, Iterable[T]):
return [map_flattenable_items(i, func) for i in items]
return func(items)
gokart/task.py
Outdated
@dataclass | ||
class _RequiredTaskOutput: | ||
task_name: str | ||
output_path: str | ||
|
||
_required_task_outputs = map_flattenable_items( | ||
self.requires(), | ||
func=lambda task: map_flattenable_items( | ||
task.output(), func=lambda output: _RequiredTaskOutput(task_name=task.get_task_family(), output_path=output.path()) | ||
), | ||
) | ||
required_task_outputs: dict[str, str] | None = None | ||
if isinstance(_required_task_outputs, list): | ||
required_task_outputs = {r.task_name: r.output_path for r in _required_task_outputs} | ||
elif isinstance(_required_task_outputs, dict): | ||
required_task_outputs = _required_task_outputs | ||
else: | ||
required_task_outputs = ( | ||
{_required_task_outputs.task_name: _required_task_outputs.output_path} if isinstance(_required_task_outputs, _RequiredTaskOutput) else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[imo]
It would become more readable with extracting this section into a method, which returns required_task_outputs
.
gokart/target.py
Outdated
lock_at_dump: bool = True, | ||
task_params: dict[str, str] | None = None, | ||
custom_labels: dict[str, Any] | None = None, | ||
required_task_outputs: dict[str, str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[imo]
This parameter seems to be just a metadata. But its name may indicate that it effects the functionality of the method or the class's attribute. It would be better to rename for avoiding such a misleading.
LGTM |
@mamo3gr Thank you for your thoughtful comments. I'm gonna deal with all of them. |
gokart/utils.py
Outdated
if isinstance(items, str): | ||
return items # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, T
means `str, so you should apply func for this.
if isinstance(items, str): | |
return items # type: ignore | |
if isinstance(items, str): | |
return func(items) # type: ignore |
gokart/utils.py
Outdated
K = TypeVar('K') | ||
|
||
|
||
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3.13/library/functions.html#map
python original map define map(function, iterable)
, so you must suit python's manner.
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: | |
def map_flattenable_items(func: Callable[[T], K], items: FlattenableItems[T]) -> FlattenableItems[K]: |
gokart/utils.py
Outdated
if isinstance(items, str): | ||
return items # type: ignore | ||
if isinstance(items, Iterable): | ||
return [map_flattenable_items(i, func) for i in items] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pass tuple[T]
, it should returns tuple[K]
. But, in this implementation, this case is not cared for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, could you add testcase?
continue | ||
merged_labels[label_name] = label_value | ||
merged_labels: dict[str, str] = {} | ||
for normalized_label in normalized_labels_list[:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for normalized_label in normalized_labels_list[:]: | |
for normalized_label in normalized_labels_list: |
…uiredTaskOutput]]
gokart/gcs_obj_metadata_client.py
Outdated
if isinstance(required_task_outputs, tuple): | ||
return tuple(required_task_output.serialize() for required_task_output in required_task_outputs) | ||
if isinstance(required_task_outputs, Iterable): | ||
return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple is one type of Iterable, and python has many iterable type, e.x) list, set, tuple, so on. You can write better like the following.
if isinstance(required_task_outputs, tuple): | |
return tuple(required_task_output.serialize() for required_task_output in required_task_outputs) | |
if isinstance(required_task_outputs, Iterable): | |
return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]) | |
if isinstance(required_task_outputs, Iterable): | |
iter_type = type(required_task_outputs) | |
return iter_type(_list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitagry
At first, basically, I think your suggestion is acceptable.
But some iterable object cannot be dump as json like set.
Any iterable object can be iteratable, so basically they are kind of list, and in most cases, even if some iterable objects would be serialized as list, I think in most cases we don't have significant problems.
So I think this implementation might be ok.
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
Yes, you are correct, and you can change iterable to list,
And, What I actually meant was regarding the utils.map_flattenable_items method. I'll make another comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for replying.
I deal with your suggested changes.
if isinstance(items, str): | ||
return func(items) # type: ignore | ||
if isinstance(items, Iterable): | ||
return [map_flattenable_items(func, i) for i in items] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return [map_flattenable_items(func, i) for i in items] | |
return map(lambda item: map_flattenable_items(func, i), items) |
Related works
#445
#446
#448
What does PR do?
In this Pull Request, I implement a metadata attribution feature that enables searching for tasks dependent on specific tasks executed with a given parameter set.
Why is this needed?
Gokart caches the execution results and parameter states of each task in GCS. As shown in the Related Works section, various metadata are attached to each GCS object to enhance traceability. A common use case is searching for tasks that depend on a specific task executed with a given parameter set. Currently, Gokart does not support searching and tracing task dependencies from GCS metadata. This PR introduces this functionality.
Pre-Requisists
The focus of this PR is embedding the necessary metadata to allow the CLI to search for specific dependencies. The search functionality itself will be implemented on the CLI side (CLI: https://github.com/TlexCypher/gcs-metadog).
Checklist
CI is passing
Code formatting follows project standards.
Necessary tests have been added.
Existing tests pass.