job status in pipes #17857

krgn · 2023-11-09T15:21:35Z

krgn
Nov 9, 2023

Hi all!

we at Staffbase have been porting our Spark jobs to pipes (running in k8s) with some good overall results. I have observed something that could become a problem down the line, which I'd like to raise here to see if this intended behavior and how to potentially mitigate it.

For instance, when a Spark driver pod OOMs, typically the dagster context is not aware of this and will show the job as having completed successfully. I'm investigating how to report this back up the chain from the spark-submit process, but noticed in dagster logs that, if pipes is never fully initialized, dagster seems to think the job was successful too.

I can see the following message in the logs:

[pipes] did not receive any messages from external process. Check stdout / stderr logs from the external process if possible.
PipesS3ContextInjector: Attempted to inject context via a temporary file in s3. Expected PipesS3ContextLoader to be explicitly passed to open_dagster_pipes in the external process.
PipesS3MessageReader: Attempted to read messages from S3 bucket dagster-pipes. Expected PipesS3MessageWriter to be explicitly passed to open_dagster_pipes in the external process.

From the user perspective, it would seem that if no back-and-forth communication channel was established, dagster should consider the job as failed, as this could otherwise lead to things failing silently.

What is the rationale behind the current semantics? Is there a "strict mode" where we can enforce this? Would it be possible to extend this even further to the notion that a job should be considered failed if no materialization was reported in the end?

Thanks!!

jamiedemaria · 2023-11-09T18:26:39Z

jamiedemaria
Nov 9, 2023
Maintainer

I think this is worthy of a bug report, filing one here #17868

1 reply

krgn Nov 10, 2023
Author

Thank you 🙏🏼

alangenfeld · 2023-11-10T16:16:01Z

alangenfeld
Nov 10, 2023
Maintainer

The thinking so far has been

the underlying compute delegation is responsible for (and capable of) determining if the core computation has failed . So the spark-diver OOM reporting as success is exceptional relative to the integrations we have done so far.
pipes would be used to enrich computations, so if the core computation is "successful" but something went awry with the pipes metadata we should not fail (at least by default)

So definitely agree we need some opt-in "strict mode" , will focus #17868 on that.

We may shift around what the right default behavior is as we continue to get feedback.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job status in pipes #17857

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

job status in pipes #17857

krgn Nov 9, 2023

Replies: 2 comments · 1 reply

jamiedemaria Nov 9, 2023 Maintainer

krgn Nov 10, 2023 Author

alangenfeld Nov 10, 2023 Maintainer

krgn
Nov 9, 2023

Replies: 2 comments 1 reply

jamiedemaria
Nov 9, 2023
Maintainer

krgn Nov 10, 2023
Author

alangenfeld
Nov 10, 2023
Maintainer