Replies: 1 comment 6 replies
-
If you think the problem is related to the global logger lock issue, I would suggest trying on Dagster 1.8.10 being released today, which has a fix for that issue. I’ve never heard of py-spy affecting the process being measured in that way - hopefully the logger fix helps, because diagnosing a problem like this without profiling information sounds very challenging. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
On our dagster deployment, we are seeing some issues where runs are being submitted on time, but are taking a long time (several hours) to be de-queued by the k8s launcher. Our previous deployment was stable with 24 dequeue workers, and no max on concurrent jobs. The problems started around the time we turned on one hourly schedule ( of ~1000 schedules already on the deployment ), but it's not clear to me if that would be a trigger. I increased the number of dequeue workers to 36, with no alleviation of the problem.
Between 00 and 06 UTC, many of our schedules materialize twice as many partitions - which appears to be the origin of the queue back up as de-queue times slowly rise from a few seconds to multiple hours, when the queue reaches a steady state around 12 UTC. I have attempted to diagnose the issue using py-spy, but dagster appears to stop de-queuing runs entirely while py-spy is running on the daemon instance. The daemon instance is using 37% of req CPU and 27% of req mem.
I suspect the global logging lock is a contributor to the problem: #23933 (reply in thread), but I am unsure what could have made a difference from a previously stable deployment. Are there any options to diagnose the run queue process?
Beta Was this translation helpful? Give feedback.
All reactions