-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nomad workload information to CNI args #24319
Conversation
ed60f2b
to
745b279
Compare
Heyhey, thanks for the PR! One reason we structured the new(ish) cni {
args = {"key": "val", ...}
} was to leave the door open for something like this alongside the explicit user-provided args: cni {
nomad_args = true
} which would magically transform into some standard args that CNI plugin authors could depend on (like requested in #23830!) I'm more inclined to provide an opt-in mechanism like that. What do you think? Although I haven't written many CNI plugins myself (a couple), I also would prefer separate args instead of a JSON blob, and I think following the same convention as environment variables would be nice too, for consistency. Following that ^, this task could be accomplished with jobspec mutation at registration time, which would turn cni {
args = {
"NOMAD_NAMESPACE": "${NOMAD_NAMESPACE}",
"NOMAD_JOB_ID": "${NOMAD_JOB_ID}",
... etc ...
}
} and that would get interpolated just as user-provided args do, so wouldn't need changes all the way down in allocrunner. How does that strike you @apollo13? |
I think @gulducat is on the right track here, with one suggestion:
We probably don't need/want to do that mutation at registration time, because we have the values available on the client as part of the |
In that case, the new Alternatively / additionally, the opt-in toggle could be in |
That's a good point. The cluster admin is nominally the owner of things like what CNI plugins are available. Maybe just make it opt-out on the client? |
Works for me, I haven't written any yet I just wanted some PR to get the discussion started. I am fine with separate env variables instead.
Do we win much by making it possible to opt-out at all? I am asking because the more options the more to test and given that we set |
745b279
to
7439f05
Compare
Pushed a commit with env variables. Are there any others we should add? Task imo makes no sense since the networking information is shared by all tasks in a group. |
Aside from "plugins may not properly follow the spec" I can think of only one Nomad reason to make this opt-out-able (or even opt-in). Job IDs (and group labels) can have job "semi;colon" {
type = "batch"
group "et;tu;groupname" {
network {
mode = "bridge"
}
task "t" {
driver = "docker"
config {
image = "hello-world"
}
}
}
}
That job would suddenly break if ID/group were sent in CNI_ARGS (bridge CNI would barf; see comment from cni_args PR: #23538 (comment)). Maybe we shouldn't need to protect users from this tomfoolery, and it may be exceedingly rare (or never happen!), but it could cause a bad day for someone. I would love to be able to say "just change the job ID" but who knows what wacky conventions people have wrapped up in their IDs, so it would be nice to have some kind of opt-out mechanism, either in jobspec or client config. In this weird example, the "error" is in the spec, only affects this one job, so opting out would make sense to include there. But people may have plugins that handle it poorly, which would be a client/node concern... I'm on the fence. Do y'all have any more thoughts here? As for other args to include, I think this is a nice set to start with, and folks can request more later if the need arises. |
Sigh, and according to the spec there seems to be no way to escape that. We could escape it in json though ;)
Agreed. Sadly the non-restrictiveness of jobnames has already caused quite a few problems elsewhere.
But then we'd need it on the job-level again and not on the client I fear. Btw, this is completely ugly but mostly a matter of documentation. We could encode the semicolon using |
7439f05
to
ed2ccaa
Compare
We kicked this around internally, and maybe this is a nice simple solution that puts the onus on job authors to not use semicolons, instead of plugin authors to parse them: Only set the CNI arg if the value does not contain a How does that sound to you @apollo13 ? |
And what is the plugin supposed to do then? Throw it's hands up in the air or just panic and not assign an IP to the container 🗡️ Joking aside, I think that would be an even harder sell. If a job author uses ";" in the name and the plugin does something different because it has no name (ie network policies can't be applied) then who is supposed to figure out what went wrong and how? Sounds like a debugging nightmare. |
If it could degrade gracefully, depending on the plugin, then maybe "do something different", but I would exit with a non-zero exit code and output a message with whatever required info is missing (job ID and/or group name). We'd have documentation specifying these limits, maybe log WARN in Nomad agent logs that we're skipping the value, and if the CNI plugin makes a fuss and fails then that gets surfaced too. e.g. the existing
and it also shows up as a task event in alloc status:
I'll grant you the log is rather verbose for a short message from the plugin, but it's pretty discoverable, eh? |
Works for me, just wanted to be sure that this really what we want. That said, I personally don't have any use for jobnames containing a semicolon and would probably welcome a stricter validation there (yeah backwards compat is not going to allow for that -- though I wonder if it might be possible to make it stricter over multiple releases).
…On Tue, Nov 5, 2024, at 22:19, Daniel Bennett wrote:
If it could degrade gracefully, depending on the plugin, then maybe "do
something different", but I would exit with a non-zero exit code and
output a message with whatever required info is missing (job ID and/or
group name). We'd have documentation specifying these limits, maybe log
WARN in Nomad agent logs that we're skipping the value, and if the CNI
plugin makes a fuss and fails then that gets surfaced too.
e.g. the existing `bridge` plugin fails if it encounters a
spec-breaking `;` and that shows up in Nomad client agent logs:
> [ERROR] allocrunner/alloc_runner.go:383: client.alloc_runner: prerun failed: alloc_id=e42a7676-31f6-ca3c-2d4d-9945e82b3759 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon""
>
and it also shows up as a task event in alloc status:
> Recent Events:
> Time Type Description
> 2024-11-05T14:57:29-05:00 Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon"
>
I'll grant you the log is rather verbose for a short message from the
plugin, but it's pretty discoverable, eh?
—
Reply to this email directly, view it on GitHub
<#24319 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAT5C3ZWNDG2HOXDCBBE5LZ7EY5NAVCNFSM6AAAAABQZ3C7HOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGE3TAMBSGE>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yeah I think the safe thing for today that will also work for 99.9% (totally guessing at that number) of jobs is to:
and otherwise always provide these defaults (except job id instead of name). Bonus discoverability option would communicate directly to users earlier on, too:
That all work for you @apollo13 ? You up for implementing these changes? 😁 |
Works for me yeah, should be able to implement it as well. No promises on the time-frame though since I am swamped with other work currently and don't want to commit to anything… |
@gulducat Found some time after all (weekends and such ;)). Feel free to adjust the PR as needed, especially if it is less work than playing back & forth with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey thanks, this looks great!
I'm happy to ship your version as-is, but I toyed around with it a bit to see how it felt. feel free to take or leave my suggestions as you like.
Your changes look good imo, I especially like that we don't have duplicate the env names now. |
Co-authored-by: Daniel Bennett <[email protected]>
8b8ba8e
to
d8a9359
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title & description update, let me know if anything else is missing. |
Thanks @apollo13! |
Tihs PR adds nomad workload information to CNI_ARGS when a
cni/*
is used (not for thebridge
network which also uses CNI). This information can be used by CNI plugins to correlate the assigned IP to nomad workloads/allocations (among other things).Fixes #23830