Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Using ACL, add a new purge-job capability so that we can restrict purging of our jobs #24147

Closed
DamianArado opened this issue Oct 8, 2024 · 10 comments

Comments

@DamianArado
Copy link

Proposal

Currently, we have no control over the purging of our jobs on Nomad.

Use-cases

We want to restrict some developers' access to purge jobs on Nomad. Ideally, we want to achieve this using ACLs. Please add a new purge-job capability so that we can achieve our objective. Also, when a job is stopped on Nomad, it gets removed. We don't desire this to happen because sometimes, a dev is fixing bugs on that job and keeps it stopped for a while. After fixing the bugs, he starts the job back again so that it can fetch the new code now. Please ask if you want more clarification on our use case.

Attempted Solutions

No solution found yet: https://stackoverflow.com/questions/79053941/how-can-i-deny-users-the-capability-to-purge-jobs-from-nomad-ui/79054009#79054009

@schmichael
Copy link
Member

We want to restrict some developers' access to purge jobs on Nomad. Ideally, we want to achieve this using ACLs.

We intend to make our ACLs more fine grained, but do you mean an ACL representing nomad job stop or nomad job stop -purge specifically? If purge specifically, I'm curious what your use case is. That API should be rarely needed, and I can't think why it would need a different ACL than nomad job stop.

Also, when a job is stopped on Nomad, it gets removed. We don't desire this to happen because sometimes, a dev is fixing bugs on that job and keeps it stopped for a while. After fixing the bugs, he starts the job back again so that it can fetch the new code now.

Good news! Nomad v1.9.0 which is in beta will allow you to tag versions of jobs which prevents them from being purged/garbage-collected! #24055 is the PR.

That being said it's unclear to me why when the developer prefers restarting a stopped job to running the job again. It should be the nomad job run ... command either way. I'm not sure how the job being garbage collected impacts your workflow.

@DamianArado
Copy link
Author

We intend to make our ACLs more fine grained, but do you mean an ACL representing nomad job stop or nomad job stop -purge specifically? If purge specifically, I'm curious what your use case is. That API should be rarely needed, and I can't think why it would need a different ACL than nomad job stop

We want to implement a capability using ACL that will enable us to allow only specific people to be able to purge a job. Currently, there is no such capability supported by Nomad. Ideally, a purge-job capability would help us a lot in protecting our jobs from being purged by devs who are only allowed to view them.

When someone removes a job from Nomad, we don't know who did it. I also tried using the Events API (the JobDeregistered event type, to be specific), but it does not satisfy our use case since this event also comes in case we stop any job.

I think if you guys can introduce an event type like JobPurged that gets triggered when we purge a job, it would also be helpful since our main concern is to keep our jobs running and to be able to ensure that no job gets purged without being in our notice. We will then trigger notification through this event so that we can know which jobs were purged on which day.

I hope you understand our use case now.

Good news! Nomad v1.9.0 which is in beta will allow you to tag versions of jobs which prevents them from being purged/garbage-collected! #24055 is the PR.

Our jobs get updated regularly so tagging versions won't help us since it would eventually be incremented sooner or later.

That being said it's unclear to me why when the developer prefers restarting a stopped job to running the job again. It should be the nomad job run ... command either way. I'm not sure how the job being garbage collected impacts your workflow.

We use nomad-pack to deploy our jobs (thousands of allocations and jobs) since we don't want our devs to create .hcl themselves (Thanks to nomad-pack!).

If a job gets purged without being in our notice, many things that are dependent on them would fail leading to customer escalations and potential loss in revenue. Hence, we want to be notified when a job gets purged in order to protect and ensure our critical jobs keep running at all times.

Please let me know if you need any further explanations. Thanks!

@tgross
Copy link
Member

tgross commented Oct 14, 2024

@DamianArado that's all pretty clear. I do want to make sure you know that if you stop a job it becomes eligible for garbage collection after job_gc_threshold passes (4 hours by default). So stopping users from using the -purge flag doesn't keep jobs from disappearing from state.

@DamianArado
Copy link
Author

@DamianArado that's all pretty clear. I do want to make sure you know that if you stop a job it becomes eligible for garbage collection after job_gc_threshold passes (4 hours by default). So stopping users from using the -purge flag doesn't keep jobs from disappearing from state.

Yes, I'm aware of this. We can increase this threshold.

Still, any of these would be quite helpful for us:

  1. Restrict purging of jobs using ACL.
  2. Getting a JobPurged event through Events API.

@schmichael
Copy link
Member

Ideally, a purge-job capability would help us a lot in protecting our jobs from being purged by devs who are only allowed to view them.

submit-job (register and delete/purge) vs read-job already allow you to give users readonly access to jobs.

...it would also be helpful since our main concern is to keep our jobs running and to be able to ensure that no job gets purged without being in our notice.

This is what I'm still curious about: you say that your concern is to keep jobs running but the feature request is about jobs that are not running (already stopped and eligible for purge/gc).

By "running" do you mean "registered in Nomad and available in the Nomad API/UI for retrieval/editing"?

If so this is possible today for service jobs by scaling them to 0 with nomad job scale $job_name 0. The job will have a dead status but not be garbage collected. This behavior is not well documented and still receives updates (e.g. UI support just shipped in Nomad v1.9.0 with #23591), so I can understand why you wouldn't reach for it!

In fact I think the scale-job ACL capability ends up being precisely the fine-grained "can start/stop a job but not purge it" permission you're looking for!

I think if you guys can introduce an event type like JobPurged that gets triggered when we purge a job, it would also be helpful

This seems like a reasonable feature request. If job-scale solves this issue, can you open a new issue specifically for JobPurged and we can close this one? The more focused and concise tickets are, the easier they are to implement!

If a job gets purged without being in our notice, many things that are dependent on them would fail

You said the magic work ("dependent") so I'll link the original (ancient!) issue for that to capture your use case: #545

@DamianArado
Copy link
Author

submit-job (register and delete/purge) vs read-job already allow you to give users readonly access to jobs.

With read-job, they'll still be able to view the job on UI and it's logs right?

This is what I'm still curious about: you say that your concern is to keep jobs running but the feature request is about jobs that are not running (already stopped and eligible for purge/gc).

Okay, let me clarify this: By running, I mean these jobs should not be removed by unauthorized devs who are only allowed to view them.

By "running" do you mean "registered in Nomad and available in the Nomad API/UI for retrieval/editing"?

We want to keep these jobs running on Nomad.

If so this is possible today for service jobs by scaling them to 0 with nomad job scale $job_name 0. The job will have a dead status but not be garbage collected. This behavior is not well documented and still receives updates (e.g. UI support just shipped in Nomad v1.9.0 with #23591), so I can understand why you wouldn't reach for it!
In fact I think the scale-job ACL capability ends up being precisely the fine-grained "can start/stop a job but not purge it" permission you're looking for!

Will this job keep running on Nomad? Most of the time, our jobs have more than 1 allocation. So, will all these allocations keep running? If yes, then it will satisfy our use case.

This seems like a reasonable feature request. If job-scale solves this issue, can you open a new issue specifically for JobPurged and we can close this one? The more focused and concise tickets are, the easier they are to implement!

I don't know if job-scale can solve our issue, let me first know your response. However, I can give you 2 possible features that we desire:

  1. Ability to give tokens to devs to view job on UI/CLI, and start/stop it from UI/CLI but not purge it from UI/CLI.
  2. Receiving a JobPurged event indicating the details of job that got purged just like JobRegistered/JobDeregistered.

You said the magic work ("dependent") so I'll link the original (ancient!) issue for that to capture your use case: #545

This won't completely solve the issue as many things that are dependent on our Nomad jobs run outside Nomad as well.
Thanks a lot for your time and attention!

@DamianArado
Copy link
Author

@schmichael
Are there any updates about whether this will be picked up in the future?

We just want to have a way to log which all jobs got purged on Nomad.

@tgross
Copy link
Member

tgross commented Dec 6, 2024

@DamianArado I think we were trying pin down what you meant by "running" here, but I'm reasonably satisfied at this point. For clarity, the job object that you're trying to prevent from being GC'd has 3 states: pending, running, and dead (stopped). jobs only transition to dead when all allocations for the job are stopped and the count>0. Job's are only eligible for GC when they are dead.

  • job stop -purge stops the job, stops all the allocations, and GC's the job.
  • If instead you scale down a job to 0, the count==0, so a job with stopped allocations will not get GC'd.
  1. Ability to give tokens to devs to view job on UI/CLI, and start/stop it from UI/CLI but not purge it from UI/CLI.

So with the information above, as @schmichael noted, this is covered today by the scale-job ACL (+ read-job).

  1. Receiving a JobPurged event indicating the details of job that got purged just like JobRegistered/JobDeregistered.

The event stream API already has a JobDeregister event, but it doesn't currently include the Purge boolean flag that's present in the RPC request. I was a little surprised by this but it has to do with where we're hooking the state store transaction for the event stream. It's probably possible to plumb that flag all the way down into the event stream though.

As @schmichael requested, I'll open a new issue for this.

@tgross
Copy link
Member

tgross commented Dec 6, 2024

See: #24618

@DamianArado
Copy link
Author

Thanks! @tgross

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants