Runs are taking a long time to de-queue #25047

nsteins · 2024-10-03T21:29:44Z

nsteins
Oct 3, 2024

On our dagster deployment, we are seeing some issues where runs are being submitted on time, but are taking a long time (several hours) to be de-queued by the k8s launcher. Our previous deployment was stable with 24 dequeue workers, and no max on concurrent jobs. The problems started around the time we turned on one hourly schedule ( of ~1000 schedules already on the deployment ), but it's not clear to me if that would be a trigger. I increased the number of dequeue workers to 36, with no alleviation of the problem.

Between 00 and 06 UTC, many of our schedules materialize twice as many partitions - which appears to be the origin of the queue back up as de-queue times slowly rise from a few seconds to multiple hours, when the queue reaches a steady state around 12 UTC. I have attempted to diagnose the issue using py-spy, but dagster appears to stop de-queuing runs entirely while py-spy is running on the daemon instance. The daemon instance is using 37% of req CPU and 27% of req mem.

I suspect the global logging lock is a contributor to the problem: #23933 (reply in thread), but I am unsure what could have made a difference from a previously stable deployment. Are there any options to diagnose the run queue process?

gibsondan · 2024-10-03T21:48:48Z

gibsondan
Oct 3, 2024
Maintainer

If you think the problem is related to the global logger lock issue, I would suggest trying on Dagster 1.8.10 being released today, which has a fix for that issue. I’ve never heard of py-spy affecting the process being measured in that way - hopefully the logger fix helps, because diagnosing a problem like this without profiling information sounds very challenging.

6 replies

nsteins Nov 26, 2024
Author

@gibsondan I spent some more time on this problem today and was able to successfully profile during a time when the queue was not backed up. It appears that the run coordinator is spending most of its time waiting on the dequeue workers, and the workers are spending most of their time on _acquireLock. There also appears to be a loop between the coordinator and the dequeue workers where the next iteration does not evaluate until all the runs are submitted.

I saw there was some discussion in #24635 about whether or not to patch the underlying kubernetes api as well. Did that change go out in 1.8.10? Would an alternative run coordinator, such as celery-k8s help address this bottleneck?

speedscope-server-daemon-queue2.json

gibsondan Nov 26, 2024
Maintainer

Yes, we patched the k8s client in 1.8.10 - is that speedscope from a daemon that is running on dagster 1.8.10 or higher?

gibsondan Nov 26, 2024
Maintainer

If you were on that version, I would expect the speedscope that you posted to include the patched version of __deserialize_model from our codebase:

dagster/python_modules/libraries/dagster-k8s/dagster_k8s/client.py

Line 99 in d598f82

def __deserialize_model(self, data, klass):

But instead I am seeing it use the unpatched version: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api_client.py#L620

nsteins Nov 26, 2024
Author

yes, this is running on 1.8.10, the results from pip freeze:

dagster==1.8.10 dagster-aws==0.24.10 dagster-azure==0.24.10 dagster-celery==0.24.10 dagster-celery-k8s==0.24.10 dagster-gcp==0.24.10 dagster-graphql==1.8.10 dagster-k8s==0.24.10 dagster-pandas==0.24.10 dagster-pipes==1.8.10 dagster-postgres==0.24.10 dagster-webserver==1.8.10

I see 'PatchedApiClient' in my /usr/local/lib/python3.10/site-packages/dagster_k8s/client.py file, so it seems like it should be used...

gibsondan Nov 26, 2024
Maintainer

I was able to figure out what is going on here and have a fixed lined up to ensure that it actually uses the patched api client everywhere we expect it to - apologies for the confusion there. The thanksgiving holiday is going to delay things a bit, but we should have a fix all lined up for 1.9.4 release slated for December 5th.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runs are taking a long time to de-queue #25047

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Runs are taking a long time to de-queue #25047

nsteins Oct 3, 2024

Replies: 1 comment · 6 replies

gibsondan Oct 3, 2024 Maintainer

nsteins Nov 26, 2024 Author

gibsondan Nov 26, 2024 Maintainer

gibsondan Nov 26, 2024 Maintainer

nsteins Nov 26, 2024 Author

gibsondan Nov 26, 2024 Maintainer

nsteins
Oct 3, 2024

Replies: 1 comment 6 replies

gibsondan
Oct 3, 2024
Maintainer

nsteins Nov 26, 2024
Author

gibsondan Nov 26, 2024
Maintainer

gibsondan Nov 26, 2024
Maintainer

nsteins Nov 26, 2024
Author

gibsondan Nov 26, 2024
Maintainer