Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Cincinnati updates work with ostree containers #1263

Open
jlebon opened this issue Jul 21, 2022 · 29 comments
Open

Making Cincinnati updates work with ostree containers #1263

jlebon opened this issue Jul 21, 2022 · 29 comments
Assignees
Labels
area/bootable-containers Related to the bootable containers effort.

Comments

@jlebon
Copy link
Member

jlebon commented Jul 21, 2022

This is split out of #1219 to discuss specifically how we'll make system updates work on hosts using CoreOS layering. Quoting:

  • Updates via Zincati (update barriers; update graphs)
    • Zincati/Cincinnati offer us a "safe" path to traverse when deploying
      updates to systems. When following a container image in a registry the
      user is following whatever is latest. Work still needs to be done to
      get back the added value from Zincati, into the CoreOS Layering workflow.

Let's discuss ideas on how to address this gap.

@jlebon
Copy link
Member Author

jlebon commented Jul 21, 2022

(Originally posted by @cgwalters in #1219 (comment))

But regarding zincati specifically:

half-baked strawman: embed barriers in the container image

We encode "epochs"/barriers in like this:

  • quay.io/coreos-assembler/fcos:stable-v0
  • quay.io/coreos-assembler/fcos:stable-v1

We embed metadata in the container images that says that quay.io/coreos-assembler/fcos:stable-v1 is the successor to quay.io/coreos-assembler/fcos:stable-v0.

This is related to ostreedev/ostree#874

@jlebon
Copy link
Member Author

jlebon commented Jul 22, 2022

Another half-baked proposal:

Tag-based updates

Zincati itself doesn't have much OSTree knowledge. Its main output is telling rpm-ostree which version to deploy. This model could carry over in the CoreOS layering world with the right semantics in place:

  1. We change the FCOS pipeline to also tag released images with the version. E.g. we'd have both the moving tag quay.io/fedora/fedora-coreos:stable and also the fixed tag quay.io/fedora/fedora-coreos:36.20220703.3.1.
  2. We require whatever tool users have to rebuild their layered images to retain the same base tag. E.g. quay.io/jlebon/my-layered-fedora-coreos:36.20220703.3.1.
  3. We keep Zincati configured to follow the canonical update graph but in CoreOS layering mode, it notices that the host refspec is a container image and so knows to pass the version tag to rpm-ostree rather than the OSTree checksum. I.e. instead of
    rpm-ostree deploy --lock-finalization --skip-branch-check revision=${checksum}
    
    it would run e.g.:
    rpm-ostree deploy --lock-finalization tag=${version}
    
    It will happily block there and periodically retry until it succeeds if the corresponding container image hasn't been built yet.
  4. We teach rpm-ostree deploy to handle this.

Rolling out out-of-cycle changes

When users modify their layered content, they might not want to wait until the next FCOS release to roll it out. What I don't want is Zincati learning to speak to container registries. Instead, we can have it periodically ask rpm-ostree to check if updates to that tag are available.

There's a similarity here with client-side RPM layering from repos: new RPM versions won't actually be updated until the next release or if one explicitly does rpm-ostree upgrade. One could imagine in the future a Zincati knob to also force it to periodically ask rpm-ostree to check for layered RPM updates and e.g. rollout if a CVE is fixed.

@jlebon
Copy link
Member Author

jlebon commented Jul 22, 2022

Rolling out out-of-cycle changes

I've updated this section of the proposal based on discussions OOB with @lucab.

@dustymabe
Copy link
Member

What I don't want is Zincati learning to speak to container registries.

I'm not so sure. I don't feel like this is super heavyweight IMO. I guess it gets more complicated if there is authentication involved.

@cgwalters
Copy link
Member

[jlebon] What I don't want is Zincati learning to speak to container registries.
[dusty] I'm not so sure. I don't feel like this is super heavyweight IMO. I guess it gets more complicated if there is authentication involved.

What's cool is: I specifically split out Rust bindings to use skopeo into a dependency of ostree to support use cases exactly like this! (That project is also already being used by at least one project not published on crates.io.

(As of recently there is apparently now https://github.com/confidential-containers/image-rs/blob/main/docs/design.md which explicitly cites the proxy code)

@cgwalters
Copy link
Member

This all said - storing things in registries today that aren't actually runnable containers is awkward. But, OCI Artifacts is coming to make that better.

@jlebon
Copy link
Member Author

jlebon commented Jul 25, 2022

What I don't want is Zincati learning to speak to container registries.

I'm not so sure. I don't feel like this is super heavyweight IMO. I guess it gets more complicated if there is authentication involved.

The reason I said this was that I didn't want yet another thing pulling in a container stack, but containers-image-proxy-rs is a good counterpoint. Though even then, rpm-ostree obviously needs to know how to do this, so it seems cleaner to me to have Zincati use an rpm-ostree API to check for updates instead of directly checking it itself. It also allows abstracting over the delivery method (containers vs ostree + RPMs).

@cgwalters cgwalters changed the title Making updates work with CoreOS layering Making updates work with ostree containers Oct 2, 2022
@cgwalters
Copy link
Member

There was some out of band discussion on this as it relates to Fedora IoT and the idea of supporting Cincinnati there too; the more I think about this the more I feel like it makes sense to entirely fold the core functionality of zincati into rpm-ostree at some point.

(The whole thing we did with "update drivers" is really complex, and while I think it's still logically something we want for the general complex case, it'd be a lot more obvious from a UX point of view to have e.g. rpm-ostree upgrade do and/or output something more useful)

@cgwalters cgwalters added the area/bootable-containers Related to the bootable containers effort. label Oct 3, 2022
cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 2, 2022
This is part of coreos/fedora-coreos-tracker#1263

If we're booted into a container image, then instead of looking
for the special `fedora-coreos.stream` ostree commit metadata,
we can do the much more obvious and natural thing of looking at the
container image tag.
@cgwalters
Copy link
Member

Some work on this in coreos/zincati#878

@cgwalters cgwalters changed the title Making updates work with ostree containers Making Cincinnati updates work with ostree containers Nov 18, 2022
cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 18, 2022
This is part of coreos/fedora-coreos-tracker#1263

If we're booted into a container image, then instead of looking
for the special `fedora-coreos.stream` ostree commit metadata,
we can do the much more obvious and natural thing of looking at the
container image tag.
cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 18, 2022
This is part of coreos/fedora-coreos-tracker#1263

If we're booted into a container image, then instead of looking
for the special `fedora-coreos.stream` ostree commit metadata,
we can do the much more obvious and natural thing of looking at the
container image tag.
cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 19, 2022
This is part of coreos/fedora-coreos-tracker#1263

We don't yet have an official stance on how zincati and custom
container images interact.  Today, zincati just crash loops.
This changes things so that we gracefully exit if we detect
the booted system is using a container image origin.

(The code here isn't quite as clean as it could be; calling
 `std::process::exit()` in the middle of the call chain isn't
  elegant but doing better would require plumbing through an
  `Option<T>` through many layers)
cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Nov 19, 2022
cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 21, 2022
This is part of coreos/fedora-coreos-tracker#1263

We don't yet have an official stance on how zincati and custom
container images interact.  Today, zincati just crash loops.
This changes things so that we exit (but still with an error) if we detect
the booted system is using a container image origin.

One nicer thing here is that the unit status is also updated, e.g.
`systemctl status zincati` will show:

`Status: "Automatic updates disabled; booted into container image ..."`
cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Jan 11, 2023
@cgwalters cgwalters self-assigned this Jan 11, 2023
jlebon pushed a commit to coreos/coreos-assembler that referenced this issue Jan 11, 2023
@cgwalters
Copy link
Member

I think this currently blocks on #1367

@bgilbert
Copy link
Contributor

Re things the graph gives us, I'd split the second one and add one more:

2a. Scheduled rollouts over a defined time interval.
2b. The ability to pause or terminate a rollout.
3. Deadend releases which can't upgrade. We've used this exactly once, and I'm not aware of a pressing use case for it in the initial design. But e.g. we've had conversations in the past about forcing nodes older than X to reprovision from scratch.

Re dropping the barrier releases, I'm not so concerned about the runtime cost of carrying upgrade code (it's usually a shell script, with a unit that can check for a stamp file), but I do think regressions are a concern. We'd need to get better at carrying tests for upgrading from a specified older release. Key rotation seems harder to solve though.

@cgwalters
Copy link
Member

cgwalters commented Jan 20, 2023

2a/2b can also be done by having the registry server itself do this, right? (Now, an interesting topic here is whether we'd want to somehow apply the same policies to clients fetching it as a container image via podman/docker/kube and not to boot directly)
In this implementation model, perhaps we have registry.osupdates.fedoraproject.org for example that's a "smart" server that redirects (304) to quay.io for blobs for example.

  1. is something doable via the container image metadata itself, no? There's actually an ostree EOL metadata key already...but we should I think standardize a metadata key for this and have it emit an error.

@jlebon
Copy link
Member Author

jlebon commented Jan 20, 2023

In this implementation model, perhaps we have registry.osupdates.fedoraproject.org for example that's a "smart" server that redirects (304) to quay.io for blobs for example.

The redirect approach sounds interesting. If we want to keep the current wariness stuff (which is how the phasing happens), we'd have to somehow have the request include a stable UUID, e.g. as an HTTP header.

  1. is something doable via the container image metadata itself, no? There's actually an ostree EOL metadata key already...but we should I think standardize a metadata key for this and have it emit an error.

The ostree EOL metadata key addresses the case where you know at build time that it's going to be the last commit in the stream. Deadend releases are usually understood to be deadends after the fact.

Unless you mean including information about past deadend releases in the metadata of new images we push on that stream, until the deadend releases go EOL. It's a bit of a hack, but nice in its simplicity (and anyway, deadend releases should be quite rare, so I don't expect that metadata to grow out of control). I guess another approach is keeping it as a separate OCI artifact.

@cgwalters
Copy link
Member

The redirect approach sounds interesting. If we want to keep the current wariness stuff (which is how the phasing happens), we'd have to somehow have the request include a stable UUID, e.g. as an HTTP header.

Yes, this would require a bit of work in the containers/image (and proxy) stack to request adding a header, but that seems straightforward.

Unless you mean including information about past deadend releases in the metadata of new images we push on that stream, until the deadend releases go EOL.

Right, we can push a new image that just changes the manifest, but leaves all the blobs the same. (Right now, the bootc stack actually will reboot in the metadata-only case, but we can optimize that...)

@jlebon
Copy link
Member Author

jlebon commented Jan 20, 2023

Unless you mean including information about past deadend releases in the metadata of new images we push on that stream, until the deadend releases go EOL.

Right, we can push a new image that just changes the manifest, but leaves all the blobs the same. (Right now, the bootc stack actually will reboot in the metadata-only case, but we can optimize that...)

That's not quite what I meant. :) I mean putting information about old deadends into new images and carrying it for a while. Changing manifests we've already pushed for the same version doesn't feel right. (Edit: well... I don't know, maybe it's called for given that deadends are exceptional events. But it feels funny changing the digest of an existing image.)

@bgilbert
Copy link
Contributor

The redirection server could be generic over derived images too: rollout.updates.fedoraproject.org/quay.io/fedora/fedora-coreos could be our base image, and rollout.updates.fedoraproject.org/quay.io/bgilbert/derived could apply rollout logic to a derived image. Perhaps the rollout start/duration could be driven by labels, allowing custom derives to pick their own rollout schedules? Pausing/restarting rollouts would be a manifest change (...which of course would change the hash). Registries could choose to implement this functionality natively, avoiding the redirection server, and we could ship a redirection server container for use with private registries.

...also this reinvents orchestration a bit, which doesn't feel great. In that sense, Zincati/Cincinnati is closer to the existing k8s model, where there's an external orchestrator deciding what to pull. The label approach may be easier to deploy though.

If we want to keep the current wariness stuff (which is how the phasing happens), we'd have to somehow have the request include a stable UUID, e.g. as an HTTP header.

I thought I remembered that at one point the FCOS Cincinnati handshake was changed to avoid sending a client UUID, since we wanted to avoid uniquely identifying the client. Instead, either the conditional update would be encoded into the graph so that the wariness could be applied client-side, or the client would pick a wariness for each rollout and send it to the server. Looking at the code, apparently that never happened in the default case, and we are indeed sending a UUID to the server.

Perhaps we should avoid perpetuating that design, though, and the client should send a wariness value instead of a UUID. While a sufficiently fine-grained wariness would uniquely identify the client for the period between successive updates, it wouldn't be a long-term identifier.

@lucab
Copy link
Contributor

lucab commented Jan 21, 2023

The reference for the existing protocol is https://coreos.github.io/zincati/development/cincinnati/protocol/#graph-api, while Zincati configuration details are at https://coreos.github.io/zincati/usage/agent-identity/#identity-configuration.

On the two points above about UUIDs and wariness:

  • the client can optionally send a wariness value. If it is present in the request, the server obey this. Otherwise it computes one for the request. By default Zincati doesn't assign a value, but it can send one if it configured by the user.
  • the client can optionally send a UUID. The protocol does not require it, though it helps calculating a wariness value which is sticky to a client (instead of per-request). Zincati sends an app-specific ID by default, it can be overridden through configuration, and is also used for reboot-lock management. So it is possible to identify a client (by design), serve a custom graph, and cross-link with reboot management. But it shouldn't be possible to track back the machine-id from it.

cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Feb 14, 2023
Fedora CoreOS is not yet using containers by default for updates;
xref coreos/fedora-coreos-tracker#1263
etc.

Consequently, when one boots a FCOS system and wants
to rebase to a custom image, one ends up downloading the entire image,
including the parts of FCOS that you already have.

This changes things so that when we generate disk images by default,
we write the *layer refs* of the component parts - but we
delete the "merged" container image ref.

The semantics here will be:

- Only a tiny amount of additional data used by default;
  the layer refs are just metadata, the bulk of the data still
  lives in regular file content.
- When a FCOS system auto-updates to its by-default usage of
  an ostree commit, the unused layer refs will be garbage collected.
- But, as noted above when rebasing to a container image instead,
  if the target container image reuses some of those layers (as
  we expect when rebasing FCOS to a FCOS-derived container) then
  we don't need to redownload them - we only download what the user
  provided.

Hence, this significantly improves rebasing to container images,
with basically no downsides.

The alternative code path to actually deploy *as a container*
remains off by default.  When that is enabled, `rpm-ostree upgrade`
fetches a container by default, which is a distinct thing.
cgwalters added a commit to coreos/coreos-assembler that referenced this issue Mar 8, 2023
Fedora CoreOS is not yet using containers by default for updates;
xref coreos/fedora-coreos-tracker#1263
etc.

Consequently, when one boots a FCOS system and wants
to rebase to a custom image, one ends up downloading the entire image,
including the parts of FCOS that you already have.

This changes things so that when we generate disk images by default,
we write the *layer refs* of the component parts - but we
delete the "merged" container image ref.

The semantics here will be:

- Only a tiny amount of additional data used by default;
  the layer refs are just metadata, the bulk of the data still
  lives in regular file content.
- When a FCOS system auto-updates to its by-default usage of
  an ostree commit, the unused layer refs will be garbage collected.
- But, as noted above when rebasing to a container image instead,
  if the target container image reuses some of those layers (as
  we expect when rebasing FCOS to a FCOS-derived container) then
  we don't need to redownload them - we only download what the user
  provided.

Hence, this significantly improves rebasing to container images,
with basically no downsides.

The alternative code path to actually deploy *as a container*
remains off by default.  When that is enabled, `rpm-ostree upgrade`
fetches a container by default, which is a distinct thing.
@cgwalters
Copy link
Member

cgwalters commented May 2, 2023

I re-read this thread and I still find myself somewhat unconvinced that we need to require an update graph instead of just fetching from a container tag. It definitely has value, but comes with a lot of operational complexity as a cost.

Of all of the things discussed here, I think we need to take a decision on whether or not we try to require barriers in the future. My vote is no, we just carry the upgrade code for longer, say a year (two fedora majors).

@cgwalters
Copy link
Member

Moving the thread re bootloaders from #1485 (comment)

To be clear, what I was just trying to say is I found it slightly confusing to have to dig through git history to find the failing systemd unit; if we'd tried to do something without barriers, then the unit would have probably had a "stamp file" approach of e.g. ConditionPathExists=/var/lib/bootloader-updated-once.stamp etc. (Or, better driven natively into bootupd like bootupctl update --if-older-than)

I would agree that it seems hard to solve this particular problem without barriers.


That said, there is a whole other new thing we could add here, which is a mechanism to pull and execute a container image before trying to apply an OS update.

That would align with what we're doing effectively in OCP with the MCO, and be an extremely flexible and powerful escape hatch. Basically zincati (or maybe rpm-ostree) would do e.g. podman run --privileged --rm quay.io/fedora/fedora-coreos-preupgrade:latest before trying to actually apply an update.

@dustymabe
Copy link
Member

Moving the thread re bootloaders from #1485 (comment)

To be clear, what I was just trying to say is I found it slightly confusing to have to dig through git history to find the failing systemd unit; if we'd tried to do something without barriers, then the unit would have probably had a "stamp file" approach of e.g. ConditionPathExists=/var/lib/bootloader-updated-once.stamp etc. (Or, better driven natively into bootupd like bootupctl update --if-older-than)

Right. We sometimes use the stamp file approach and don't super aggressively remove the unit until the next barrier is created (i.e. when we put in migration code we don't always have to do a barrier at the same time like we had to here).

I would agree that it seems hard to solve this particular problem without barriers.

Thanks. This problem was super tricky and we were lucky we had barriers and bootupd to help us get out of the position we were in. We needed them both.

That said, there is a whole other new thing we could add here, which is a mechanism to pull and execute a container image before trying to apply an OS update.

That would align with what we're doing effectively in OCP with the MCO, and be an extremely flexible and powerful escape hatch. Basically zincati (or maybe rpm-ostree) would do e.g. podman run --privileged --rm quay.io/fedora/fedora-coreos-preupgrade:latest before trying to actually apply an update.

I've thought about this problem before and brought it up. In my mind it would just be extra code we shipped in the OSTree commit we build today that would know how to handle/do migrations depending on different factors. i.e. the OSTree gets downloaded (from a container registry or OSTree repo, doesn't matter) and special migration code gets extracted from it and run before. This code could choose to block the upgrade or allow it to continue, etc.

I don't really see why this would need to be a separate container image (just seems like more work IMO).

@cgwalters
Copy link
Member

I don't really see why this would need to be a separate container image (just seems like more work IMO).

Pre-upgrade logic could have different dependencies than the OS, for one. And we wouldn't pay the storage cost of carrying them on disk after completion.

@dustymabe
Copy link
Member

Pre-upgrade logic could have different dependencies than the OS, for one.

Right, in which case we could just have the migration code call podman? Though that is one thing to think about too. At least for FCOS users they can choose between docker or podman and just running the container engine usually will "initialize" some things, which may be undesirable.

@cgwalters
Copy link
Member

OK, moving this to ostreedev/ostree#2855 and I think with that, we can stop requiring barriers in many cases. That could use some further analysis, but going through a few of them (e.g. the recent bootloader one, or the iptables one) I am pretty sure it'd be viable.

@jlebon
Copy link
Member Author

jlebon commented May 4, 2023

I'm really not a fan of pre-upgrade code execution, but I agree it's a way out of this. Between hooks and barriers, I'd indeed prefer the former. If we do this, I'd like to see tight policies and maintenance around what we do in there. (E.g. we can say we drop workarounds there after X months.)

I'd agree with @dustymabe re. container runtimes. I think all our migration code so far has been not too complex scripts. It's not ideal, but it also has minimal impact on the node state, and dependency concerns are much less of an issue.

@cgwalters
Copy link
Member

My inclination BTW is to try to drop the rollouts and wariness etc. and keep it super simple - the client tracks a container image tag, fetched once a day by default.

Anyone who wants to do anything more than that (replicating something like current "wariness") can mirror the OS update containers on their own to a registry and update it on their own timeframe and schedule. Which is what they already need to know how to do for application containers! Because there's no zincati/cincinnati for podman/kubelet (or for dnf/RPMs for that matter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootable-containers Related to the bootable containers effort.
Projects
None yet
Development

No branches or pull requests

5 participants