Support functionalities to enhance task traceability with metadata for dependency search. #450

TlexCypher · 2025-03-05T09:24:15Z

Related works

What does PR do?

In this Pull Request, I implement a metadata attribution feature that enables searching for tasks dependent on specific tasks executed with a given parameter set.

Why is this needed?

Gokart caches the execution results and parameter states of each task in GCS. As shown in the Related Works section, various metadata are attached to each GCS object to enhance traceability. A common use case is searching for tasks that depend on a specific task executed with a given parameter set. Currently, Gokart does not support searching and tracing task dependencies from GCS metadata. This PR introduces this functionality.

Pre-Requisists

The focus of this PR is embedding the necessary metadata to allow the CLI to search for specific dependencies. The search functionality itself will be implemented on the CLI side (CLI: https://github.com/TlexCypher/gcs-metadog).

Checklist

CI is passing
Code formatting follows project standards.
Necessary tests have been added.
Existing tests pass.

…uld be handled.

TlexCypher · 2025-03-05T09:27:26Z

gokart/utils.py

+K = TypeVar('K')
+
+
+def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:


If we can use both Generics and isinstance at the same time, code would be below.

def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: if isinstance(items, dict): return {k: map_flattenable_items(v, func) for k, v in items.items()} if isinstance(str): return items if isinstance(items, Iterable[T]): return [map_flattenable_items(i, func) for i in items] return func(items)

mamo3gr · 2025-03-06T02:38:46Z

gokart/task.py

+        @dataclass
+        class _RequiredTaskOutput:
+            task_name: str
+            output_path: str
+
+        _required_task_outputs = map_flattenable_items(
+            self.requires(),
+            func=lambda task: map_flattenable_items(
+                task.output(), func=lambda output: _RequiredTaskOutput(task_name=task.get_task_family(), output_path=output.path())
+            ),
+        )
+        required_task_outputs: dict[str, str] | None = None
+        if isinstance(_required_task_outputs, list):
+            required_task_outputs = {r.task_name: r.output_path for r in _required_task_outputs}
+        elif isinstance(_required_task_outputs, dict):
+            required_task_outputs = _required_task_outputs
+        else:
+            required_task_outputs = (
+                {_required_task_outputs.task_name: _required_task_outputs.output_path} if isinstance(_required_task_outputs, _RequiredTaskOutput) else None
+            )


[imo]
It would become more readable with extracting this section into a method, which returns required_task_outputs.

mamo3gr · 2025-03-06T02:59:59Z

gokart/target.py

+        lock_at_dump: bool = True,
+        task_params: dict[str, str] | None = None,
+        custom_labels: dict[str, Any] | None = None,
+        required_task_outputs: dict[str, str] | None = None,


[imo]
This parameter seems to be just a metadata. But its name may indicate that it effects the functionality of the method or the class's attribute. It would be better to rename for avoiding such a misleading.

mamo3gr · 2025-03-06T03:04:53Z

LGTM
I made some comments for improving code readability and leave them to your own choice to apply.

TlexCypher · 2025-03-06T04:31:16Z

@mamo3gr Thank you for your thoughtful comments. I'm gonna deal with all of them.

kitagry · 2025-03-06T04:34:03Z

gokart/utils.py

+    if isinstance(items, str):
+        return items  # type: ignore


In this case, T means `str, so you should apply func for this.

Suggested change

if isinstance(items, str):

return items # type: ignore

if isinstance(items, str):

return func(items) # type: ignore

kitagry · 2025-03-06T04:36:49Z

gokart/utils.py

+K = TypeVar('K')
+
+
+def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:


https://docs.python.org/3.13/library/functions.html#map

python original map define map(function, iterable), so you must suit python's manner.

Suggested change

def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:

def map_flattenable_items(func: Callable[[T], K], items: FlattenableItems[T]) -> FlattenableItems[K]:

kitagry · 2025-03-06T04:40:26Z

gokart/utils.py

+    if isinstance(items, str):
+        return items  # type: ignore
+    if isinstance(items, Iterable):
+        return [map_flattenable_items(i, func) for i in items]


When pass tuple[T], it should returns tuple[K]. But, in this implementation, this case is not cared for.

And, could you add testcase?

kitagry · 2025-03-06T04:42:08Z

gokart/gcs_obj_metadata_client.py

-                continue
-            merged_labels[label_name] = label_value
+        merged_labels: dict[str, str] = {}
+        for normalized_label in normalized_labels_list[:]:


Suggested change

for normalized_label in normalized_labels_list[:]:

for normalized_label in normalized_labels_list:

gokart/task.py

…uiredTaskOutput]]

kitagry · 2025-03-07T03:29:09Z

gokart/gcs_obj_metadata_client.py

+        if isinstance(required_task_outputs, tuple):
+            return tuple(required_task_output.serialize() for required_task_output in required_task_outputs)
+        if isinstance(required_task_outputs, Iterable):
+            return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs])


tuple is one type of Iterable, and python has many iterable type, e.x) list, set, tuple, so on. You can write better like the following.

Suggested change

if isinstance(required_task_outputs, tuple):

return tuple(required_task_output.serialize() for required_task_output in required_task_outputs)

if isinstance(required_task_outputs, Iterable):

return _list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs])

if isinstance(required_task_outputs, Iterable):

iter_type = type(required_task_outputs)

return iter_type(_list_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]))

@kitagry
At first, basically, I think your suggestion is acceptable.
But some iterable object cannot be dump as json like set.
Any iterable object can be iteratable, so basically they are kind of list, and in most cases, even if some iterable objects would be serialized as list, I think in most cases we don't have significant problems.
So I think this implementation might be ok.
How do you think?

Sorry for the late reply.

Yes, you are correct, and you can change iterable to list,

And, What I actually meant was regarding the utils.map_flattenable_items method. I'll make another comment.

Thank you for replying.
I deal with your suggested changes.

kitagry · 2025-03-21T01:36:09Z

gokart/utils.py

+    if isinstance(items, str):
+        return func(items)  # type: ignore
+    if isinstance(items, Iterable):
+        return [map_flattenable_items(func, i) for i in items]


Suggested change

return [map_flattenable_items(func, i) for i in items]

return map(lambda item: map_flattenable_items(func, i), items)

…hashed should be list.

TlexCypher · 2025-04-17T09:33:15Z

@kitagry Sorry for late actions.
I accept your suggested changes.
Could you review this PR again?

mski-iksm · 2025-04-21T15:35:02Z

gokart/task.py

+from gokart.required_task_output import RequiredTaskOutput
+from gokart.utils import map_flattenable_items
+
+if sys.version_info < (3, 13):


Maybe this part is not needed?

hiro-o918 · 2025-04-22T00:44:26Z

examples/param.ini

+local_temporary_directory=./resource/tmp
+
+[core]
+logging_conf_file=logging.ini


[nits]
add end of newline

hiro-o918 · 2025-04-22T00:45:16Z

gokart/gcs_obj_metadata_client.py

        patched_metadata = GCSObjectMetadataClient._get_patched_obj_metadata(
            copy.deepcopy(original_metadata),
            task_params,
            custom_labels,
+            required_task_outputs if required_task_outputs else None,


It seems to be redundant

Suggested change

required_task_outputs if required_task_outputs else None,

required_task_outputs,

hiro-o918 · 2025-04-22T00:52:52Z

gokart/gcs_obj_metadata_client.py

@@ -101,23 +107,49 @@ def _get_patched_obj_metadata(
        # However, users who utilize custom_labels are no longer expected to search using the labels generated from task parameters.
        # Instead, users are expected to search using the labels they provided.
        # Therefore, in the event of a key conflict, the value registered by the user-provided labels will take precedence.
-        _merged_labels = GCSObjectMetadataClient._merge_custom_labels_and_task_params_labels(normalized_task_params_labels, normalized_custom_labels)
+        normalized_labels = (


[imo]
I prefer this because of readability

Suggested change

normalized_labels = (

normalized_labels = [normalized_custom_labels, normalized_task_params_labels]

if not required_task_outputs

normalized_labels.append({'__required_task_outputs': json.dumps(GCSObjectMetadataClient._get_serialized_string(required_task_outputs))})

mski-iksm · 2025-04-22T01:27:16Z

gokart/gcs_obj_metadata_client.py

+        merged_labels: dict[str, str] = {}
+        for normalized_label in normalized_labels_list[:]:
+            for label_name, label_value in normalized_label.items():
+                if len(label_value) == 0:


[MUST] This code may fail, since it seems to assume that label_value is str.

I prefer checking if it is str, and then check the length as,

isinstance(label_value, str) and len(label_value)==0

Thank you for reviewing my code!
In my opinion, type checking is not necessary, because GCSObjectMetadataClient._normalize_labels convert all values stored in dictionary into string.
So, label_value definitely is string.

@TlexCypher
Then maybe the input normalized_labels_list: list[dict[str, Any]] should be normalized_labels_list: list[dict[str, str]] ?

mski-iksm · 2025-04-22T01:32:10Z

gokart/gcs_obj_metadata_client.py

-                continue
-            merged_labels[label_name] = label_value
+        merged_labels: dict[str, str] = {}
+        for normalized_label in normalized_labels_list[:]:


[weak-IMO]

for normalized_label in normalized_labels_list: for label_name, label_value in normalized_label.items(): if len(label_value) == 0:

I thought this part a bit difficult to understand, since it is deeply nested.

It may get better if you extract for label_name, label_value in... part as a separate function, and apply it with a functools.reduce().

However, current code is OK though. :)

Thank you for great suggestion!

For this specific task of merging labels, the simple nested loop is likely more readable and Pythonic than using functools.reduce.

While reduce can be used, in this scenario, the straightforward nested loop (or perhaps the alternative 'flattening' approach) probably offers better clarity and maintainability.

How do you think?

I preferred reduce approach, because it express the motivation of making merged_labels earlier, which makes the first time reader easier to understand.

merged_labels = reduce(...)

In the nested loop, you need to read to L.147 to understand the motivation of building merged_labels.

However, both approach is OK, since this is relatively small loop nest. :)

mski-iksm

@TlexCypher
I've made some comments but mainly LGTM! Thank you for your contribution!

hirosassa

Commented!

hirosassa · 2025-04-25T23:06:52Z

gokart/gcs_obj_metadata_client.py

+        def _iterable_flatten(nested_list: Iterable) -> list[str]:
+            flattened_list: list[str] = []
+            for item in nested_list:
+                if isinstance(item, Iterable):
+                    flattened_list.extend(_iterable_flatten(item))
+                else:
+                    flattened_list.append(item)
+            return flattened_list


How about using Iterator like below. This can be (maybe) reduce temporal memory usage in some case and it looks a bit elegant.

Suggested change

def _iterable_flatten(nested_list: Iterable) -> list[str]:

flattened_list: list[str] = []

for item in nested_list:

if isinstance(item, Iterable):

flattened_list.extend(_iterable_flatten(item))

else:

flattened_list.append(item)

return flattened_list

def _iterable_flatten(nested_list: Iterable) -> Iterator[str]:

for item in nested_list:

if isinstance(item, Iterable):

yield from _iterable_flatten(item)

else:

yield item

and we should change L130 as

return list(_iterable_flatten([GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]))

荒木太一 added 6 commits March 4, 2025 13:42

WIP: End to implement the logic to gather the required task output path.

79a2881

WIP: success to add output path in nest mode, but some other case sho…

0cfe7ee

…uld be handled.

WIP: no ci apply.

3eee422

feat: fix to pass labels and has_seen_keys.

ec3bf4f

feat: fix conflicts

22a69d0

CI: apply ruff and mypy

08e3f59

TlexCypher marked this pull request as draft March 5, 2025 09:24

TlexCypher commented Mar 5, 2025

View reviewed changes

feat: add implementation of nest mode.

9b19a1c

TlexCypher changed the title ~~WIP: Feat/nestmode~~ Support functionalities to add metadata to enable searching for tasks dependent on specific tasks executed with a given parameter set. Mar 5, 2025

TlexCypher changed the title ~~Support functionalities to add metadata for searching tasks dependent on specific tasks executed with a given parameter set.~~ Support functionalities to enhance task traceability with metadata for dependency search. Mar 5, 2025

TlexCypher marked this pull request as ready for review March 5, 2025 12:28

TlexCypher mentioned this pull request Mar 5, 2025

Feat/nest mode TlexCypher/gcs-metadog#1

Merged

mamo3gr reviewed Mar 6, 2025

View reviewed changes

kitagry requested changes Mar 6, 2025

View reviewed changes

feat: deal with kitagry comments.

accbf1d

TlexCypher requested a review from kitagry March 6, 2025 07:58

kitagry reviewed Mar 6, 2025

View reviewed changes

gokart/task.py Outdated Show resolved Hide resolved

荒木太一 added 6 commits March 6, 2025 18:21

feat: Remove CLI dependencies.

6719f4d

feat: remove redundant statements.

0bcc16c

feat: change serialization expression for single FlattenableItems[Req…

5c41035

…uiredTaskOutput]]

CI: fix test and apply CI.

0b951ab

feat: fix mypy error.

10795a2

feat: refactoring make _list_flatten inner function.

32b4343

TlexCypher requested review from kitagry and mamo3gr March 6, 2025 10:17

feat: fix nits miss and add __ prefix to avoid conflicts.

6f70a41

kitagry reviewed Mar 7, 2025

View reviewed changes

feat: rename _list_flatten

637f5da

TlexCypher requested a review from kitagry March 13, 2025 06:46

Merge: fix conflicts.

b607926

kitagry reviewed Mar 21, 2025

View reviewed changes

TlexCypher added 4 commits April 17, 2025 18:23

Merge: fix conflicts.

a8059a1

feat: convert map object to list, any iterable objects that would be …

27b1abd

…hashed should be list.

Merge remote-tracking branch 'origin/master' into feat/nestmode

5ac1c4d

Merge remote-tracking branch 'origin/feat/nestmode' into feat/nestmode

f4479da

TlexCypher requested a review from kitagry April 17, 2025 09:32

mski-iksm reviewed Apr 21, 2025

View reviewed changes

hiro-o918 reviewed Apr 22, 2025

View reviewed changes

mski-iksm reviewed Apr 22, 2025

View reviewed changes

TlexCypher added 2 commits April 23, 2025 07:40

feat: add new line to end of param.ini

e71833b

feat: remove redundant expressions

46aabcf

TlexCypher requested review from mski-iksm and hiro-o918 April 23, 2025 22:24

Merge branch 'master' into feat/nestmode

7bde3b0

hirosassa approved these changes Apr 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support functionalities to enhance task traceability with metadata for dependency search. #450

Support functionalities to enhance task traceability with metadata for dependency search. #450

TlexCypher commented Mar 5, 2025 •

edited

Loading

TlexCypher Mar 5, 2025

mamo3gr Mar 6, 2025

mamo3gr Mar 6, 2025

mamo3gr commented Mar 6, 2025

TlexCypher commented Mar 6, 2025

kitagry Mar 6, 2025 •

edited

Loading

kitagry Mar 6, 2025

kitagry Mar 6, 2025

kitagry Mar 6, 2025

kitagry Mar 6, 2025

kitagry Mar 7, 2025

TlexCypher Mar 13, 2025

kitagry Mar 21, 2025

TlexCypher Mar 21, 2025

kitagry Mar 21, 2025

TlexCypher commented Apr 17, 2025

mski-iksm Apr 21, 2025

hiro-o918 Apr 22, 2025

hiro-o918 Apr 22, 2025

hiro-o918 Apr 22, 2025

mski-iksm Apr 22, 2025

TlexCypher Apr 22, 2025

mski-iksm Apr 25, 2025

mski-iksm Apr 22, 2025

TlexCypher Apr 22, 2025

mski-iksm Apr 25, 2025 •

edited

Loading

mski-iksm left a comment

hirosassa left a comment

hirosassa Apr 25, 2025 •

edited

Loading

		K = TypeVar('K')


		def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:

	def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
	def map_flattenable_items(func: Callable[[T], K], items: FlattenableItems[T]) -> FlattenableItems[K]:

	for normalized_label in normalized_labels_list[:]:
	for normalized_label in normalized_labels_list:

	return [map_flattenable_items(func, i) for i in items]
	return map(lambda item: map_flattenable_items(func, i), items)

	required_task_outputs if required_task_outputs else None,
	required_task_outputs,

-        normalized_labels = (
+        normalized_labels = [normalized_custom_labels, normalized_task_params_labels]
+        if not required_task_outputs
+            normalized_labels.append({'__required_task_outputs': json.dumps(GCSObjectMetadataClient._get_serialized_string(required_task_outputs))})

Support functionalities to enhance task traceability with metadata for dependency search. #450

Are you sure you want to change the base?

Support functionalities to enhance task traceability with metadata for dependency search. #450

Conversation

TlexCypher commented Mar 5, 2025 • edited Loading

Related works

What does PR do?

Why is this needed?

Pre-Requisists

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mamo3gr commented Mar 6, 2025

TlexCypher commented Mar 6, 2025

kitagry Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TlexCypher commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mski-iksm Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

mski-iksm left a comment

Choose a reason for hiding this comment

hirosassa left a comment

Choose a reason for hiding this comment

hirosassa Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

TlexCypher commented Mar 5, 2025 •

edited

Loading

kitagry Mar 6, 2025 •

edited

Loading

mski-iksm Apr 25, 2025 •

edited

Loading

hirosassa Apr 25, 2025 •

edited

Loading