ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334

shoeffner · 2024-10-30T11:36:38Z

Proposal

Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties. For instance, the topology view needs only a subset of the allocation information (ram, cpu, node, maybe job name and namespace – in the best case this could even be "redacted") and the basic information about nodes to be meaningful.

Use-cases

Our users want to know the current resource allocation across nodes (best they'd even see used GPUs, but that's a different issue).
However, they only get access to a limited number of namespaces which results in an incomplete view. For some users, the cluster looks almost empty when they navigate to http://localhost:4646/ui/topology, making the view useless at best but potentially even misleading.

The main reason is that they want to have an overview of all the resources to get a better idea why their allocations could not be scheduled and how much is currently in use to make better decisions (e.g., attempt to schedule on a different node type or try with less RAM, etc.).

Attempted Solutions

To view http://localhost:4646/ui/topology, one needs to have the following minimal permissions (topology-read.policy.hcl in the example below):

namespace "*" {
  capabilities = ["read-job"]
}
node {
  policy = "read"
}

To test this, run an ACL-enabled server:

nomad agent -dev -acl-enabled

Bootstrap ACLs

nomad acl bootstrap
export NOMAD_TOKEN=<bootstrapped token>
nomad acl policy apply -description "allows to read topology" topology-read topology-read.policy.hcl 
nomad acl token create -name topology-reader -policy topology-read
NOMAD_TOKEN=<new token> nomad ui -authenticate

You can change the topology-read.policy.hcl and re-apply it to try out other combinations, but in general I found out that these are the required policies to read the topology.

However, granting read-job capabilities across all namespaces is exactly what we do not want to do, as some projects are confidential and should not leak job specs (which might contain secrets or other confidential data – despite best efforts of educating about and promoting use of Vault secrets, template stanzas, etc.).

Alternative Solution

We are currently also considering setting up a service which gets the topology-read policy and essentially filters the json outputs such that we can provide our own visualization to our users.
Since we already have an additional service to simplify scheduling of common jobs and fetching nomad job logs via OpenSearch, this could be an option, although it would require some additional service to be set up and maintained (as that service works with the nomad token for API access and does not have any "service" permissions in the background).

The text was updated successfully, but these errors were encountered:

jrasell · 2024-11-04T10:27:35Z

Hi @shoeffner and thanks for raising this issue and including the great detail. To start I wanted to detail the API endpoints that the topology page uses, some of which are used for filtering options:

/v1/allocations in order to list all allocations within the target region
/v1/nodes in order to list all nodes within the target region.
/v1/node/pools in order to list all node pools in the target region and provide filtering and info panels
/v1/namespaces in order to list all namespaces in the target region and provide filtering and info panels
/v1/job/{jobID} this is used when highlighting an allocation on a node, to see more detail about it

The namespace capability read-job importantly provides access to the allocation list API endpoint which is how we calculate the resource consumption (allocated resources) on each node.

Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties.

Adding such a policy or capability would likely also require new API and RPC endpoints. My initial feeling here is that this is not something we would want to allow and would open the door for similar future requests to expose certain job specification fields to custom capabilities outside read-job.

I think your alternative solution makes sense here. Another option would be to utilize the telemetry that Nomad emits around client resource allocation within a tool such as Prometheus/Grafana. This could be as simple as showing nomad.client.unallocated.cpu and nomad.client.unallocated.memory for each client. The JQ below shows how this can be pulled out manually to check:

$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("client.unallocated.cpu"))'
$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("client.unallocated.memory"))'

I'll keep this issue open for a couple of days, please feel free to add any comments or questions.

shoeffner · 2024-11-12T10:17:42Z

Thanks for your thoughtful reply, @jrasell . I had not thought about the metrics, that makes sense and would not require any permissions to read jobs, so that's a good idea and we'll have a look at that, thanks! It's likely easier to implement a consumer for those than to have a service layer in the middle.

I understand and agree that this feature could potentially open up more requests for fine-grained APIs and permissions, so we will go with the metrics endpoint.

Thanks for taking the time to review this and your response!

shoeffner added the type/enhancement label Oct 30, 2024

jrasell added this to Nomad - Community Issues Triage Oct 30, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Oct 30, 2024

jrasell added the stage/waiting-reply label Nov 4, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 8, 2024

jrasell closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2024

github-project-automation bot moved this from Triaging to Done in Nomad - Community Issues Triage Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334

ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334

shoeffner commented Oct 30, 2024

jrasell commented Nov 4, 2024

shoeffner commented Nov 12, 2024

ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334

ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334

Comments

shoeffner commented Oct 30, 2024

Proposal

Use-cases

Attempted Solutions

Alternative Solution

jrasell commented Nov 4, 2024

shoeffner commented Nov 12, 2024