-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access #24334
Comments
Hi @shoeffner and thanks for raising this issue and including the great detail. To start I wanted to detail the API endpoints that the topology page uses, some of which are used for filtering options:
The namespace capability
Adding such a policy or capability would likely also require new API and RPC endpoints. My initial feeling here is that this is not something we would want to allow and would open the door for similar future requests to expose certain job specification fields to custom capabilities outside I think your alternative solution makes sense here. Another option would be to utilize the telemetry that Nomad emits around client resource allocation within a tool such as Prometheus/Grafana. This could be as simple as showing
I'll keep this issue open for a couple of days, please feel free to add any comments or questions. |
Thanks for your thoughtful reply, @jrasell . I had not thought about the metrics, that makes sense and would not require any permissions to read jobs, so that's a good idea and we'll have a look at that, thanks! It's likely easier to implement a consumer for those than to have a service layer in the middle. I understand and agree that this feature could potentially open up more requests for fine-grained APIs and permissions, so we will go with the metrics endpoint. Thanks for taking the time to review this and your response! |
Proposal
Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties. For instance, the topology view needs only a subset of the allocation information (ram, cpu, node, maybe job name and namespace – in the best case this could even be "redacted") and the basic information about nodes to be meaningful.
Use-cases
Our users want to know the current resource allocation across nodes (best they'd even see used GPUs, but that's a different issue).
However, they only get access to a limited number of namespaces which results in an incomplete view. For some users, the cluster looks almost empty when they navigate to http://localhost:4646/ui/topology, making the view useless at best but potentially even misleading.
The main reason is that they want to have an overview of all the resources to get a better idea why their allocations could not be scheduled and how much is currently in use to make better decisions (e.g., attempt to schedule on a different node type or try with less RAM, etc.).
Attempted Solutions
To view http://localhost:4646/ui/topology, one needs to have the following minimal permissions (
topology-read.policy.hcl
in the example below):To test this, run an ACL-enabled server:
Bootstrap ACLs
You can change the
topology-read.policy.hcl
and re-apply it to try out other combinations, but in general I found out that these are the required policies to read the topology.However, granting
read-job
capabilities across all namespaces is exactly what we do not want to do, as some projects are confidential and should not leak job specs (which might contain secrets or other confidential data – despite best efforts of educating about and promoting use of Vault secrets, template stanzas, etc.).Alternative Solution
We are currently also considering setting up a service which gets the topology-read policy and essentially filters the json outputs such that we can provide our own visualization to our users.
Since we already have an additional service to simplify scheduling of common jobs and fetching nomad job logs via OpenSearch, this could be an option, although it would require some additional service to be set up and maintained (as that service works with the nomad token for API access and does not have any "service" permissions in the background).
The text was updated successfully, but these errors were encountered: