Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: restrict etcdctl defrag to single member #1394

Merged

Conversation

Elbehery
Copy link
Contributor

Most of etcdctl e2e test are disabled by default.

During last weeks, this test has failed most of the time, from changes that is non related to it.

It is better to disable it, as it blocks more important work, for no reason.

cc @tjungblu

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2025
@openshift-ci-robot
Copy link

@Elbehery: This pull request explicitly references no jira issue.

In response to this:

Most of etcdctl e2e test are disabled by default.

During last weeks, this test has failed most of the time, from changes that is non related to it.

It is better to disable it, as it blocks more important work, for no reason.

cc @tjungblu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Elbehery
Copy link
Contributor Author

cc @benluddy

@openshift-ci openshift-ci bot requested review from dusk125 and hasbro17 January 31, 2025 15:15
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2025
@Elbehery
Copy link
Contributor Author

@benluddy has faced this issue 2 weeks ago

testCases := []struct {
command string
skip bool
skipReason string
expectedError string
}{
{"etcdctl alarm disarm", false, "", ""},
{"etcdctl alarm list", false, "", ""},
{"etcdctl check perf", true, "OpenShift disabled command", ""},
{"etcdctl compaction 0", false, "", "Error: etcdserver: mvcc: required revision has been compacted"},
{"etcdctl defrag", false, "", ""},
{"etcdctl elect myelection", false, "", "no proposal argument but -l not set"},

@Elbehery
Copy link
Contributor Author

I had the same issue here #1391 (comment)

although running the test against a cluster passed from first run 🤔

@benluddy
Copy link
Contributor

When I saw this, etcdctl was timing out but the defrag completed on the last member after a few seconds. Could you set a more generous timeout on the etcdctl invocation instead of skipping it? I don't have a sense of how important this test coverage is.

@tjungblu
Copy link
Contributor

I think the issue is that this defrags all members, but there is no guarantee that all members are really up and running at the time. One might be down with a static pod rollout and then your call times out.

Timeout might work, but would be 5-10m to withstand a static pod rollout. Maybe you can scope this down with --endpoints to a known pod that's up and running instead

@Elbehery
Copy link
Contributor Author

I think the issue is that this defrags all members, but there is no guarantee that all members are really up and running at the time. One might be down with a static pod rollout and then your call times out.

Timeout might work, but would be 5-10m to withstand a static pod rollout. Maybe you can scope this down with --endpoints to a known pod that's up and running instead

let me try this, I would go for localhost :D

@Elbehery Elbehery force-pushed the disable_etcdctl_defrag_test branch from 8d47e9c to 4135910 Compare January 31, 2025 15:45
@Elbehery
Copy link
Contributor Author

now it is

		{"etcdctl --endpoints=localhost:2379 defrag", false, "", ""},

let see

@tjungblu
Copy link
Contributor

yeah that could work

@Elbehery Elbehery force-pushed the disable_etcdctl_defrag_test branch from 4135910 to 6958d4d Compare February 1, 2025 21:12
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 1, 2025

Failed with msg

== RUN   TestEtcdctlCommands/etcdctl_--endpoints=localhost:2379_defrag
    etcdctl_test.go:96: 
        	Error Trace:	/go/src/github.com/openshift/cluster-etcd-operator/test/e2e/etcdctl_test.go:96
        	Error:      	Received unexpected error:
        	            	exit status 1
        	Test:       	TestEtcdctlCommands/etcdctl_--endpoints=localhost:2379_defrag
        	Messages:   	output was {"level":"fatal","ts":"2025-01-31T16:57:19.828221Z","caller":"flags/flag.go:85","msg":"conflicting environment variable is shadowed by corresponding command-line flag (either unset environment variable or disable flag))","environment-variable":"ETCDCTL_ENDPOINTS","stacktrace":"go.etcd.io/etcd/pkg/v3/flags.verifyEnv\n\tgo.etcd.io/etcd/pkg/[email protected]/flags/flag.go:85\ngo.etcd.io/etcd/pkg/v3/flags.SetPflagsFromEnv\n\tgo.etcd.io/etcd/pkg/[email protected]/flags/flag.go:63\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.clientConfigFromCmd\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/command/global.go:129\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.mustClientFromCmd\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/command/global.go:175\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.defragCommandFunc\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/command/defrag_command.go:53\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/ctl.go:111\nmain.main\n\tgo.etcd.io/etcd/etcdctl/v3/main.go:59\nruntime.main\n\truntime/proc.go:271"}
        	            	command terminated with exit code 1
    --- FAIL: TestEtcdctlCommands/etcdctl_--endpoints=localhost:2379_defrag (0.16s)

The test runs within etcdctl container, which contains the ETCDCTL_ENDPOINTS environment variable populated.

The erros is cause by etcd checks, that no flag overrides any envVar in use.

The workaround is to unset the envVar before running this test, use localhost during the test, and then again set the original envVar, so that the rest of the tests can run within the etcdctl container.

lets see 🤞🏽

@Elbehery Elbehery force-pushed the disable_etcdctl_defrag_test branch from 6958d4d to c23edad Compare February 1, 2025 23:46
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 1, 2025

trying diff approach

env ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 etcdctl defrag

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 1, 2025

trying diff approach

env ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 etcdctl defrag

tested against CRC cluster

    --- PASS: TestEtcdctlCommands/env_ETCDCTL_ENDPOINTS=https://127.0.0.1:2379_etcdctl_defrag (0.91s)

@Elbehery Elbehery changed the title NO-JIRA: disable etcdctl defrag flaky test NO-JIRA: restrict etcdctl defrag to single member Feb 1, 2025
@Elbehery Elbehery changed the title NO-JIRA: restrict etcdctl defrag to single member NO-JIRA: restrict etcdctl defrag to single member Feb 1, 2025
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

ran the test against a 4.19 cluster, and everything is fine

 e2e % go test -run TestEtcdctlCommands -v                                                                                                                                         disable_etcdctl_defrag_test
=== RUN   TestEtcdctlCommands
=== RUN   TestEtcdctlCommands/etcdctl_alarm_disarm
=== RUN   TestEtcdctlCommands/etcdctl_alarm_list
=== RUN   TestEtcdctlCommands/etcdctl_check_perf
    etcdctl_test.go:89: OpenShift disabled command
=== RUN   TestEtcdctlCommands/etcdctl_compaction_0
=== RUN   TestEtcdctlCommands/env_ETCDCTL_ENDPOINTS=https://127.0.0.1:2379_etcdctl_defrag
=== RUN   TestEtcdctlCommands/etcdctl_elect_myelection
=== RUN   TestEtcdctlCommands/etcdctl_endpoint_hashkv
=== RUN   TestEtcdctlCommands/etcdctl_endpoint_health
=== RUN   TestEtcdctlCommands/etcdctl_endpoint_status
=== RUN   TestEtcdctlCommands/etcdctl_lease_grant_10
=== RUN   TestEtcdctlCommands/etcdctl_lease_keep-alive_694d74da6ca8020b
=== RUN   TestEtcdctlCommands/etcdctl_lease_list
=== RUN   TestEtcdctlCommands/etcdctl_lease_revoke_694d74da6ca8020b
=== RUN   TestEtcdctlCommands/etcdctl_lease_timetolive_694d74da6ca8020b
=== RUN   TestEtcdctlCommands/etcdctl_member_list
=== RUN   TestEtcdctlCommands/etcdctl_member_add_tester
=== RUN   TestEtcdctlCommands/etcdctl_member_promote_8e9e05c52164694d
=== RUN   TestEtcdctlCommands/etcdctl_member_remove_8e9e05c52164694d
=== RUN   TestEtcdctlCommands/etcdctl_member_update_8e9e05c52164694d
=== RUN   TestEtcdctlCommands/etcdctl_del_--prefix_/etcdctl-check-perf/
=== RUN   TestEtcdctlCommands/etcdctl_get_/foo
=== RUN   TestEtcdctlCommands/etcdctl_put_/etcdctl-check-perf/foo_bar
=== RUN   TestEtcdctlCommands/env_ETCDCTL_ENDPOINTS=https://127.0.0.1:2379_etcdctl_snapshot_save_./backup.db
=== RUN   TestEtcdctlCommands/etcdctl_snapshot_status_./backup.db
=== RUN   TestEtcdctlCommands/etcdctl_snapshot_restore_./backup.db
=== RUN   TestEtcdctlCommands/etcdctl_version
=== RUN   TestEtcdctlCommands/etcdctl_auth_enable
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_auth_disable
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_lock
    etcdctl_test.go:89: distributed locks should not be done via etcdctl
=== RUN   TestEtcdctlCommands/etcdctl_make-mirror
    etcdctl_test.go:89: make-mirror takes one destination argument
=== RUN   TestEtcdctlCommands/etcdctl_migrate
    etcdctl_test.go:89: migrate from v2 is not support for OCP4
=== RUN   TestEtcdctlCommands/etcdctl_role_add
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_role_delete
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_role_get
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_role_grant-permission
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_role_list
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_role_revoke-permission
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_add
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_delete
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_get
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_grant-role
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_list
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_user_revoke-role
    etcdctl_test.go:89: k8s does not use internal etcd RBAC
=== RUN   TestEtcdctlCommands/etcdctl_move-leader_8e9e05c52164694d
    etcdctl_test.go:89: https://bugzilla.redhat.com/show_bug.cgi?id=1918413
=== RUN   TestEtcdctlCommands/etcdctl_check_datascale
    etcdctl_test.go:89: TODO currently fails in test but works fine in container directly
=== RUN   TestEtcdctlCommands/etcdctl_txn
    etcdctl_test.go:89: TODO figure out multiline command to work in test
--- PASS: TestEtcdctlCommands (49.90s)
    --- PASS: TestEtcdctlCommands/etcdctl_alarm_disarm (2.07s)
    --- PASS: TestEtcdctlCommands/etcdctl_alarm_list (1.77s)
    --- SKIP: TestEtcdctlCommands/etcdctl_check_perf (0.00s)
    --- PASS: TestEtcdctlCommands/etcdctl_compaction_0 (1.76s)
    --- PASS: TestEtcdctlCommands/env_ETCDCTL_ENDPOINTS=https://127.0.0.1:2379_etcdctl_defrag (2.20s)
    --- PASS: TestEtcdctlCommands/etcdctl_elect_myelection (1.75s)
    --- PASS: TestEtcdctlCommands/etcdctl_endpoint_hashkv (1.84s)
    --- PASS: TestEtcdctlCommands/etcdctl_endpoint_health (1.77s)
    --- PASS: TestEtcdctlCommands/etcdctl_endpoint_status (1.76s)
    --- PASS: TestEtcdctlCommands/etcdctl_lease_grant_10 (1.85s)
    --- PASS: TestEtcdctlCommands/etcdctl_lease_keep-alive_694d74da6ca8020b (1.89s)
    --- PASS: TestEtcdctlCommands/etcdctl_lease_list (1.82s)
    --- PASS: TestEtcdctlCommands/etcdctl_lease_revoke_694d74da6ca8020b (1.73s)
    --- PASS: TestEtcdctlCommands/etcdctl_lease_timetolive_694d74da6ca8020b (1.77s)
    --- PASS: TestEtcdctlCommands/etcdctl_member_list (1.79s)
    --- PASS: TestEtcdctlCommands/etcdctl_member_add_tester (1.77s)
    --- PASS: TestEtcdctlCommands/etcdctl_member_promote_8e9e05c52164694d (1.77s)
    --- PASS: TestEtcdctlCommands/etcdctl_member_remove_8e9e05c52164694d (1.78s)
    --- PASS: TestEtcdctlCommands/etcdctl_member_update_8e9e05c52164694d (1.77s)
    --- PASS: TestEtcdctlCommands/etcdctl_del_--prefix_/etcdctl-check-perf/ (1.78s)
    --- PASS: TestEtcdctlCommands/etcdctl_get_/foo (1.80s)
    --- PASS: TestEtcdctlCommands/etcdctl_put_/etcdctl-check-perf/foo_bar (1.78s)
    --- PASS: TestEtcdctlCommands/env_ETCDCTL_ENDPOINTS=https://127.0.0.1:2379_etcdctl_snapshot_save_./backup.db (2.06s)
    --- PASS: TestEtcdctlCommands/etcdctl_snapshot_status_./backup.db (1.79s)
    --- PASS: TestEtcdctlCommands/etcdctl_snapshot_restore_./backup.db (2.04s)
    --- PASS: TestEtcdctlCommands/etcdctl_version (1.77s)
    --- SKIP: TestEtcdctlCommands/etcdctl_auth_enable (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_auth_disable (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_lock (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_make-mirror (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_migrate (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_add (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_delete (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_get (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_grant-permission (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_list (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_role_revoke-permission (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_add (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_delete (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_get (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_grant-role (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_list (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_user_revoke-role (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_move-leader_8e9e05c52164694d (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_check_datascale (0.00s)
    --- SKIP: TestEtcdctlCommands/etcdctl_txn (0.00s)
PASS
ok  	github.com/openshift/cluster-etcd-operator/test/e2e	50.663s

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

e2e operator failure is not related to this change

=== RUN   TestWrongScheduleDegradesOperator
    backup_test.go:333: 
        	Error Trace:	/go/src/github.com/openshift/cluster-etcd-operator/test/e2e/backup_test.go:333
        	Error:      	Received unexpected error:
        	            	backups.config.openshift.io "periodic-backup-wrong-schedule" already exists
        	Test:       	TestWrongScheduleDegradesOperator
--- FAIL: TestWrongScheduleDegradesOperator (7.36s)

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 2, 2025
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2025

/retest-required

Copy link
Contributor

openshift-ci bot commented Feb 3, 2025

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown b2558b8 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown b2558b8 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/okd-scos-e2e-aws-ovn b2558b8 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-etcd-certrotation b2558b8 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery b2558b8 link false /test e2e-aws-etcd-recovery

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 3, 2025

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 3, 2025

/test e2e-operator-fips

@tjungblu
Copy link
Contributor

tjungblu commented Feb 3, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2025
Copy link
Contributor

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 3, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 3, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 5bbe494 into openshift:master Feb 3, 2025
13 of 18 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.19.0-202502031606.p0.g5bbe494.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants