-
As we've scaled up the number of assets and schedules in our dagster deployment, we've found that the evaluation time for schedule ticks can be very long, which is a problem since we are trying to run some jobs once every 5 minutes. At their best, schedule ticks are taking ~40 seconds to evaluate, but when there are many concurrent ticks (which we see at the top of the hour), it can take 10-15 minutes to evaluate a single schedule tick. We've tried increasing the number of workers for the schedules, increasing cpu request limits for the daemon-server and upgrading the db instance to increase the # of cpus. Any advice on where to optimize or where to search for bottlenecks? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 19 replies
-
Hi @nsteins - running py-spy (#14771) on your daemon and your code server while schedules are running slowly could help give some insight into the source of the bottleneck here. If it's possible to include a speedscope output here from those processes while things are running slowly, we could take a closer look. Every schedule tick needs to run code on your code server task, so increasing the amount of CPU available to that process/task/container would be one possibility to consider. |
Beta Was this translation helpful? Give feedback.
And #24635 fixes something similar in our own code that was modeled after the k8s api client's model deserialization that I can also see appearing in those speedscopes.
I do think you could try running with fewer threads and see if that improves contention here. You could also see if things get better if you try running the daemon at a higher log level to reduce logger contention (There is a --log-level argument to the "dagster-daemon run" command, although I think we may not expose it directly in the helm chart)