Schedules are slow to evaluate #23933

nsteins · 2024-08-26T22:47:07Z

nsteins
Aug 26, 2024

As we've scaled up the number of assets and schedules in our dagster deployment, we've found that the evaluation time for schedule ticks can be very long, which is a problem since we are trying to run some jobs once every 5 minutes. At their best, schedule ticks are taking ~40 seconds to evaluate, but when there are many concurrent ticks (which we see at the top of the hour), it can take 10-15 minutes to evaluate a single schedule tick.

We've tried increasing the number of workers for the schedules, increasing cpu request limits for the daemon-server and upgrading the db instance to increase the # of cpus. Any advice on where to optimize or where to search for bottlenecks?

Answered by gibsondan

Sep 20, 2024

And #24635 fixes something similar in our own code that was modeled after the k8s api client's model deserialization that I can also see appearing in those speedscopes.

I do think you could try running with fewer threads and see if that improves contention here. You could also see if things get better if you try running the daemon at a higher log level to reduce logger contention (There is a --log-level argument to the "dagster-daemon run" command, although I think we may not expose it directly in the helm chart)

View full answer

gibsondan · 2024-08-28T19:24:31Z

gibsondan
Aug 28, 2024
Maintainer

Hi @nsteins - running py-spy (#14771) on your daemon and your code server while schedules are running slowly could help give some insight into the source of the bottleneck here. If it's possible to include a speedscope output here from those processes while things are running slowly, we could take a closer look.

Every schedule tick needs to run code on your code server task, so increasing the amount of CPU available to that process/task/container would be one possibility to consider.

19 replies

gibsondan Sep 20, 2024
Maintainer

88.00% 88.00% 107.6s 107.6s _acquireLock (logging/init.py)

This is really interesting and suggests that the majority of the time is actually being spent contending over the global python logging module's lock.

nsteins Sep 20, 2024
Author

Is that possibly related to the large number of threads? we have increased the number of workers for the scheduler, sensor, and run queue

gibsondan Sep 20, 2024
Maintainer

This is not at all where I thought this thread would go, but this appears to be at least part of what is going on here (just filed this issue): kubernetes-client/python#2284

gibsondan Sep 20, 2024
Maintainer

And #24635 fixes something similar in our own code that was modeled after the k8s api client's model deserialization that I can also see appearing in those speedscopes.

I do think you could try running with fewer threads and see if that improves contention here. You could also see if things get better if you try running the daemon at a higher log level to reduce logger contention (There is a --log-level argument to the "dagster-daemon run" command, although I think we may not expose it directly in the helm chart)

Answer selected by nsteins

nsteins Sep 20, 2024
Author

Thank you very much for your help in diagnosing this!

I am currently trying disabling the schedule submit threadpool as I noticed in the py-spy logs that the schedule daemon worker threads were mostly waiting, while the submit worker threads were the ones submitting run requests, which is the time-limiting process due to the acquire locks. I will continue to test different configurations with the worker threads, but this appears to be helping things already.

gibsondan Sep 20, 2024
Maintainer

sounds good, thanks for reporting this. It seems like this is a longstanding known issue with the k8s python client (it's possible there were other sources of contention, but this certainly jumps out as a likely culprit). We're evaluating the pros and cons of patching the client to work around it.

CPAPI-104 Sep 26, 2024

@gibsondan Any ETA on when the fix you mentioned here will be merged and released?

Our sensors should be running every 60 seconds, but it's taking 6.5 minutes

gibsondan Sep 26, 2024
Maintainer

I think we can include it in our open source release next week. You may want to verify with profiling that that is actually the cause of your issue, I would only expect it to be affecting you if you have many concurrent threads launching runs over the kubernetes api in your daemon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedules are slow to evaluate #23933

{{title}}

Replies: 1 comment 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Schedules are slow to evaluate #23933

nsteins Aug 26, 2024

Replies: 1 comment · 19 replies

gibsondan Aug 28, 2024 Maintainer

gibsondan Sep 20, 2024 Maintainer

nsteins Sep 20, 2024 Author

gibsondan Sep 20, 2024 Maintainer

gibsondan Sep 20, 2024 Maintainer

nsteins Sep 20, 2024 Author

gibsondan Sep 20, 2024 Maintainer

CPAPI-104 Sep 26, 2024

gibsondan Sep 26, 2024 Maintainer

nsteins
Aug 26, 2024

Replies: 1 comment 19 replies

gibsondan
Aug 28, 2024
Maintainer

gibsondan Sep 20, 2024
Maintainer

nsteins Sep 20, 2024
Author

gibsondan Sep 20, 2024
Maintainer

gibsondan Sep 20, 2024
Maintainer

nsteins Sep 20, 2024
Author

gibsondan Sep 20, 2024
Maintainer

gibsondan Sep 26, 2024
Maintainer