Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Resource group behaviour and Cluster memory metrics #25226

Open
nikita-sheremet-java-developer opened this issue Mar 5, 2025 · 0 comments

Comments

@nikita-sheremet-java-developer

The configuration:

  1. Trino coordinator, 16 CPU, maxHeapSize 63000M
  2. Trino worker 16 CPU, maxHeapSize 50000M

Resource group config:

{
  "rootGroups": [
    {
      "name": "global",
      "schedulingPolicy": "fair",
      "hardConcurrencyLimit": 1000,
      "softMemoryLimit": "100%",
      "maxQueued": 20,
      "jmxExport": true,
      "subGroups": [
        {
          "name": "common",
          "hardConcurrencyLimit": 100,
          "maxQueued": 100,
          "softMemoryLimit": "40%",
          "jmxExport": true,
          "subGroups": [
            {
              "name": "etl",
              "hardConcurrencyLimit": 100,
              "maxQueued": 100,
              "softMemoryLimit": "10%",
              "jmxExport": true
            },
            {
              "name": "analytics",
              "hardConcurrencyLimit": 100,
              "maxQueued": 100,
              "softMemoryLimit": "10%",
              "jmxExport": true
            }
          ]
        }
      ]
    }
  ]
}

Test Scenario:
I send 8 queries, 4queries per group. Queries are send with 20 seconds delay. E.g. 2 queries run at same time one per group. Then 20 seconds delay and new 2 queries statred. etc.
One query takes about 100Gb memory

Scenarios with auto scalling

First scenario

  1. There are 4 workers. I submit 8 queries. I got 2 queries running and 6 queries in queue.
  2. Tirno workers increased to 10. So here I got more memory
  3. 6 queued queries become runnig
  4. Queries finished
  5. I replay scenario with queries submit but now with 10 workers at the begining
  6. I expect that 6-8 queries will run but I got 2 queries running and 6 planned.

10 workers * 50000M = 488,28 Gb, 10% is 48Gb. So if one query take more all other must be queued. But why after autoscalling 6 queries become running? not one, not two, but all queued?

Second scenario
After I scaled cluster from 4 to 10 workers:

  1. Trino UI shows 10 workers
  2. I killed 6 workes and wait trino UI shows 4 nodes
  3. k8s restored 6 workers - trino UI shown 10 nodes
  4. I killed 6 workes again and k8s restored there workes.
  5. Tirno UI shown 10 nodes

See image atttached. Pod count stay the same (I think k8s did not react to quick pod decreasing) But any way if pod count the same or less memory must be same or less but must not increase.

Expected result:

  1. trino_memory_ClusterMemoryManager_ClusterMemoryBytes shows that there is ~500Gb
  2. When I submit queries 2 queries running and 6 in queue

Actual result:

  1. trino_memory_ClusterMemoryManager_ClusterMemoryBytes shows that there is ~1Tb. After about 30-40 minutes this metric rteturned to 500Gb. Again Trino UI shows only 10 wrokers
  2. Resource groups allows to more then 2 queries running. Do not remeber well but ~6 quieries run.

My assumption that Resources groups looks to trino_memory_ClusterMemoryManager_ClusterMemoryBytes metrics and there 2 problems are related. But who knows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant