Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Graceful Shutdown #416

Conversation

jerpeter1
Copy link

If the kubelet where the the sriov pod is running has gracefulShutdown configured, we'll delay in preStop for a while if and while /tmp/sriov-delay-shutdown exists, up to a maximum wait that's less than 15 minutes (which is now specified in the daemonset pod's terminationGracePeriodSeconds).

I wrote this as a possible alternative approach to the implementation in this PR, that was reverted last year.

@github-actions
Copy link

github-actions bot commented Mar 7, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 8, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@jerpeter1 jerpeter1 force-pushed the support-graceful-shutdown branch from 08ff040 to f947aa4 Compare March 9, 2023 04:46
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 4371184613

  • 0 of 26 (0.0%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 25.64%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/daemon.go 0 26 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/daemon/daemon.go 2 40.81%
Totals Coverage Status
Change from base Build 4363097788: -0.03%
Covered Lines: 1944
Relevant Lines: 7582

💛 - Coveralls

@jerpeter1
Copy link
Author

/test-all

@jerpeter1
Copy link
Author

/assign @bn222

@jerpeter1
Copy link
Author

/assign @SchSeba

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@jerpeter1
Copy link
Author

/test-all

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
memory: 10Mi
volumeMounts:
- name: cnibin
mountPath: /host/opt/cni/bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run make manifests bundle too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not.

Here's what I get when I run make manifests bundle:

[jerpeter@fedora sriov-network-operator-upstream]$ make manifests bundle
/home/jerpeter/src/sriov-network-operator-upstream/bin/controller-gen "crd:crdVersions={v1}" webhook paths="./..." output:crd:artifacts:config=./config/crd/bases
cp ./config/crd/bases/* ./deployment/sriov-network-operator/crds/
operator-sdk generate kustomize manifests --interactive=false -q
cd config/manager && /home/jerpeter/src/sriov-network-operator-upstream/bin/kustomize edit set image controller=controller:latest
/home/jerpeter/src/sriov-network-operator-upstream/bin/kustomize build config/manifests | operator-sdk generate bundle -q --overwrite --version 4.7.0
Error: accumulating resources: accumulation err='accumulating resources from '../samples': '/home/jerpeter/src/sriov-network-operator-upstream/config/samples' must resolve to a file': couldn't make target for path '/home/jerpeter/src/sriov-network-operator-upstream/config/samples': unable to find one of 'kustomization.yaml', 'kustomization.yml' or 'Kustomization' in directory '/home/jerpeter/src/sriov-network-operator-upstream/config/samples'
INFO[0000] Creating bundle.Dockerfile
INFO[0000] Creating bundle/metadata/annotations.yaml
INFO[0000] Bundle metadata generated suceessfully
make: *** [Makefile:173: bundle] Error 1

[jerpeter@fedora sriov-network-operator-upstream]$ git status
On branch support-graceful-shutdown
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   bundle.Dockerfile
	modified:   config/manager/kustomization.yaml

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	bundle/
	config/manifests/

no changes added to commit (use "git add" and/or "git commit -a")

[jerpeter@fedora sriov-network-operator-upstream]$ git diff
diff --git a/bundle.Dockerfile b/bundle.Dockerfile
index bf295a04..7b73150d 100644
--- a/bundle.Dockerfile
+++ b/bundle.Dockerfile
@@ -1,16 +1,20 @@
 FROM scratch

+# Core bundle labels.
 LABEL operators.operatorframework.io.bundle.mediatype.v1=registry+v1
 LABEL operators.operatorframework.io.bundle.manifests.v1=manifests/
 LABEL operators.operatorframework.io.bundle.metadata.v1=metadata/
 LABEL operators.operatorframework.io.bundle.package.v1=sriov-network-operator
 LABEL operators.operatorframework.io.bundle.channels.v1=alpha
-LABEL operators.operatorframework.io.bundle.channel.default.v1=
-LABEL operators.operatorframework.io.metrics.builder=operator-sdk-v1.0.1
+LABEL operators.operatorframework.io.metrics.builder=operator-sdk-v1.12.0+git
 LABEL operators.operatorframework.io.metrics.mediatype.v1=metrics+v1
-LABEL operators.operatorframework.io.metrics.project_layout=go.kubebuilder.io/v2
-LABEL operators.operatorframework.io.test.config.v1=tests/scorecard/
+LABEL operators.operatorframework.io.metrics.project_layout=go.kubebuilder.io/v3
+
+# Labels for testing.
 LABEL operators.operatorframework.io.test.mediatype.v1=scorecard+v1
+LABEL operators.operatorframework.io.test.config.v1=tests/scorecard/

+# Copy files to locations specified by labels.
 COPY bundle/manifests /manifests/
 COPY bundle/metadata /metadata/
+COPY bundle/tests/scorecard /tests/scorecard/
diff --git a/config/manager/kustomization.yaml b/config/manager/kustomization.yaml
index 2bcd3eea..5e793dd1 100644
--- a/config/manager/kustomization.yaml
+++ b/config/manager/kustomization.yaml
@@ -5,6 +5,12 @@ generatorOptions:
   disableNameSuffixHash: true

 configMapGenerator:
-- name: manager-config
-  files:
+- files:
   - controller_manager_config.yaml
+  name: manager-config
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+images:
+- name: controller
+  newName: controller
+  newTag: latest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please commit that too in a separate commit.

Please fix

make: *** [Makefile:173: bundle] Error 1 first

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerpeter1 any news on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had resolved this and pushed a branch, but that doesn't appear to be the case. Today, I'm getting the same error running make manifests bundle from the head of the current master branch (without any of my changes):

[jerpeter@fedora sriov-network-operator-upstream]$ git rev-parse --short HEAD
990648ee
[jerpeter@fedora sriov-network-operator-upstream]$ git reset --hard upstream/master
HEAD is now at 990648ee Merge pull request #451 from zeeke/exclude-topology-conformance-test
[jerpeter@fedora sriov-network-operator-upstream]$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
[jerpeter@fedora sriov-network-operator-upstream]$ make manifests bundle
/home/jerpeter/src/sriov-network-operator-upstream/bin/controller-gen "crd:crdVersions={v1}" webhook paths="./..." output:crd:artifacts:config=./config/crd/bases
cp ./config/crd/bases/* ./deployment/sriov-network-operator/crds/
operator-sdk generate kustomize manifests --interactive=false -q
cd config/manager && /home/jerpeter/src/sriov-network-operator-upstream/bin/kustomize edit set image controller=controller:latest
/home/jerpeter/src/sriov-network-operator-upstream/bin/kustomize build config/manifests | operator-sdk generate bundle -q --overwrite --version 4.7.0
Error: unable to find one of 'kustomization.yaml', 'kustomization.yml' or 'Kustomization' in directory '/home/jerpeter/src/sriov-network-operator-upstream/config/manifests'
INFO[0000] Creating bundle.Dockerfile
INFO[0000] Creating bundle/metadata/annotations.yaml
INFO[0000] Bundle metadata generated successfully
make: *** [Makefile:180: bundle] Error 1
[jerpeter@fedora sriov-network-operator-upstream]$ operator-sdk version
operator-sdk version: "v1.30.0", commit: "b794fe909abc1affa1f28cfb75ceaf3bf79187e6", kubernetes version: "v1.26.0", go version: "go1.20.5", GOOS: "linux", GOARCH: "amd64"

Any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've investigated this a bit, I think we don't support csv in upstream. I like other's chime in too but I think we should remove make bundle completely from upstream (only).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SchSeba We need to get this one merged. Please chime in wrt to csv u/s support. I think we should remove make bundle to avoid confusion. Then I'm ok to merge this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove make bundle to avoid confusion.

Is that going to happen in another PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @jerpeter1 please prep a PR that does exactly that. We can get that one merged first.

If the kubelet where the the sriov pod is running has gracefulShutdown
configured, we'll delay in preStop for a while if and while /tmp/sriov-delay-shutdown
exists, up to a maximum wait of 10 minutes (less than the 15 minutes which is specified in
the daemonset pod's terminationGracePeriodSeconds).
@jerpeter1 jerpeter1 force-pushed the support-graceful-shutdown branch from 4a4a629 to f6f872a Compare May 10, 2023 03:07
@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@@ -93,6 +93,7 @@ spec:
volumeMounts:
- name: cnibin
mountPath: /host/opt/cni/bin
terminationGracePeriodSeconds: 900
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need 15 minutes to terminate ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i think kubelet ShutdownGracePeriod is intended to be in seconds not minutes.

so in reboot/shutdown we would probably time out and forcefully killed.

@@ -1,12 +1,36 @@
#!/bin/bash

chroot_path="/proc/1/root"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

this what /host before.

touch "$chroot_path/var/log/sriov-delay-start"
while [ $(( $(date +%s) - $start )) -lt $wait_time ]; do
if [ ! -f "$delay_shutdown_path" ]; then # don't have to wait anymore
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of NOT exiting this loop through this break statement ?

@@ -1,12 +1,36 @@
#!/bin/bash

chroot_path="/proc/1/root"
delay_shutdown_path="$chroot_path/tmp/sriov-delay-shutdown"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prestop shares the same file system as the container its defined on no?
if yes, why do you need to write this file on the host file system?

# 10 minutes - this should be shorter than the time that is specifed for the
# terminationGracePeriodSeconds in the daemonset's pod spec, so that everything
# else in the preStop hook has time to run and the Pod can be terminated properly.
wait_time=600
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont like this assumption, this should be configurable, value provided to config daemon and propagated here.

(e.g the controller reads terminationGracePeriod value and adds env var to the daemon to use)

@@ -52,6 +52,9 @@ const (
// maxUpdateBackoff is the maximum time to react to a change as we back off
// in the face of errors.
maxUpdateBackoff = 60 * time.Second

// the presence of this file indicates that the sriov shutdown should be delayed
delayShutdownPath = "/host/tmp/sriov-delay-shutdown"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear to me why we need this file on the host filesystem

if grep "$kubelet_config_path" -e shutdownGracePeriod | grep -qv \"0s\"; then
start=$(date +%s)
touch "$chroot_path/var/log/sriov-delay-start"
while [ $(( $(date +%s) - $start )) -lt $wait_time ]; do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear to me why do we need this loop.

the file in delay_shutdown_path is created when config daemon needs reboot.
after witch the daemon exits.

@adrianchiris
Copy link
Collaborator

adrianchiris commented Jul 30, 2023

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 21, 2023

Hi @jerpeter1 do you still want to push this one? there are some comments from @adrianchiris and this PR needs a rebase

@SchSeba SchSeba closed this Dec 21, 2023
@SchSeba SchSeba reopened this Dec 21, 2023
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 11, 2024

closing this one feel free to reopen if needed

@SchSeba SchSeba closed this Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants