Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support functionalities to enhance task traceability with metadata for dependency search. #450

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

TlexCypher
Copy link
Contributor

@TlexCypher TlexCypher commented Mar 5, 2025

Related works

#445
#446
#448

What does PR do?

In this Pull Request, I implement a metadata attribution feature that enables searching for tasks dependent on specific tasks executed with a given parameter set.

Why is this needed?

Gokart caches the execution results and parameter states of each task in GCS. As shown in the Related Works section, various metadata are attached to each GCS object to enhance traceability. A common use case is searching for tasks that depend on a specific task executed with a given parameter set. Currently, Gokart does not support searching and tracing task dependencies from GCS metadata. This PR introduces this functionality.

Pre-Requisists

The focus of this PR is embedding the necessary metadata to allow the CLI to search for specific dependencies. The search functionality itself will be implemented on the CLI side (CLI: https://github.com/TlexCypher/gcs-metadog).

Checklist

CI is passing
Code formatting follows project standards.
Necessary tests have been added.
Existing tests pass.

@TlexCypher TlexCypher marked this pull request as draft March 5, 2025 09:24
gokart/utils.py Outdated
K = TypeVar('K')


def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can use both Generics and isinstance at the same time, code would be below.

def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
    if isinstance(items, dict):
        return {k: map_flattenable_items(v, func) for k, v in items.items()}
    if isinstance(str):
        return items
    if isinstance(items, Iterable[T]):
        return [map_flattenable_items(i, func) for i in items]
    return func(items)

@TlexCypher TlexCypher changed the title WIP: Feat/nestmode Support functionalities to add metadata to enable searching for tasks dependent on specific tasks executed with a given parameter set. Mar 5, 2025
@TlexCypher TlexCypher changed the title Support functionalities to add metadata to enable searching for tasks dependent on specific tasks executed with a given parameter set. Support functionalities to add metadata for searching tasks dependent on specific tasks executed with a given parameter set. Mar 5, 2025
@TlexCypher TlexCypher changed the title Support functionalities to add metadata for searching tasks dependent on specific tasks executed with a given parameter set. Support functionalities to enhance task traceability with metadata for dependency search. Mar 5, 2025
@TlexCypher TlexCypher marked this pull request as ready for review March 5, 2025 12:28
gokart/task.py Outdated
Comment on lines 368 to 387
@dataclass
class _RequiredTaskOutput:
task_name: str
output_path: str

_required_task_outputs = map_flattenable_items(
self.requires(),
func=lambda task: map_flattenable_items(
task.output(), func=lambda output: _RequiredTaskOutput(task_name=task.get_task_family(), output_path=output.path())
),
)
required_task_outputs: dict[str, str] | None = None
if isinstance(_required_task_outputs, list):
required_task_outputs = {r.task_name: r.output_path for r in _required_task_outputs}
elif isinstance(_required_task_outputs, dict):
required_task_outputs = _required_task_outputs
else:
required_task_outputs = (
{_required_task_outputs.task_name: _required_task_outputs.output_path} if isinstance(_required_task_outputs, _RequiredTaskOutput) else None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[imo]
It would become more readable with extracting this section into a method, which returns required_task_outputs.

gokart/target.py Outdated
lock_at_dump: bool = True,
task_params: dict[str, str] | None = None,
custom_labels: dict[str, Any] | None = None,
required_task_outputs: dict[str, str] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[imo]
This parameter seems to be just a metadata. But its name may indicate that it effects the functionality of the method or the class's attribute. It would be better to rename for avoiding such a misleading.

@mamo3gr
Copy link
Contributor

mamo3gr commented Mar 6, 2025

LGTM
I made some comments for improving code readability and leave them to your own choice to apply.

@TlexCypher
Copy link
Contributor Author

@mamo3gr Thank you for your thoughtful comments. I'm gonna deal with all of them.

gokart/utils.py Outdated
Comment on lines 80 to 81
if isinstance(items, str):
return items # type: ignore
Copy link
Member

@kitagry kitagry Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, T means `str, so you should apply func for this.

Suggested change
if isinstance(items, str):
return items # type: ignore
if isinstance(items, str):
return func(items) # type: ignore

gokart/utils.py Outdated
K = TypeVar('K')


def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/3.13/library/functions.html#map

python original map define map(function, iterable), so you must suit python's manner.

Suggested change
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
def map_flattenable_items(func: Callable[[T], K], items: FlattenableItems[T]) -> FlattenableItems[K]:

gokart/utils.py Outdated
if isinstance(items, str):
return items # type: ignore
if isinstance(items, Iterable):
return [map_flattenable_items(i, func) for i in items]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pass tuple[T], it should returns tuple[K]. But, in this implementation, this case is not cared for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, could you add testcase?

continue
merged_labels[label_name] = label_value
merged_labels: dict[str, str] = {}
for normalized_label in normalized_labels_list[:]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for normalized_label in normalized_labels_list[:]:
for normalized_label in normalized_labels_list:

@TlexCypher TlexCypher requested a review from kitagry March 6, 2025 07:58
@TlexCypher TlexCypher requested review from kitagry and mamo3gr March 6, 2025 10:17
Comment on lines 134 to 137
if isinstance(required_task_outputs, tuple):
return tuple(required_task_output.serialize() for required_task_output in required_task_outputs)
if isinstance(required_task_outputs, Iterable):
return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple is one type of Iterable, and python has many iterable type, e.x) list, set, tuple, so on. You can write better like the following.

Suggested change
if isinstance(required_task_outputs, tuple):
return tuple(required_task_output.serialize() for required_task_output in required_task_outputs)
if isinstance(required_task_outputs, Iterable):
return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs])
if isinstance(required_task_outputs, Iterable):
iter_type = type(required_task_outputs)
return iter_type(_list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitagry
At first, basically, I think your suggestion is acceptable.
But some iterable object cannot be dump as json like set.
Any iterable object can be iteratable, so basically they are kind of list, and in most cases, even if some iterable objects would be serialized as list, I think in most cases we don't have significant problems.
So I think this implementation might be ok.
How do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

Yes, you are correct, and you can change iterable to list,

And, What I actually meant was regarding the utils.map_flattenable_items method. I'll make another comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for replying.
I deal with your suggested changes.

@TlexCypher TlexCypher requested a review from kitagry March 13, 2025 06:46
if isinstance(items, str):
return func(items) # type: ignore
if isinstance(items, Iterable):
return [map_flattenable_items(func, i) for i in items]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return [map_flattenable_items(func, i) for i in items]
return map(lambda item: map_flattenable_items(func, i), items)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants