From CI runs like [1]: : [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available Run #0: Failed 0s 1 unexpected clusteroperator state transitions during e2e test run Apr 09 13:23:49.846 - 13s E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available With: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-storage-version-migrator condition/Available' Apr 09 13:23:49.846 E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/_NoMigratorPod changed: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available Apr 09 13:23:49.846 - 13s E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available Apr 09 13:24:03.175 W clusteroperator/kube-storage-version-migrator condition/Available status/True reason/AsExpected changed: All is well [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available Very popular: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/kube-storage-version-migrator+s hould+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 88% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 80% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.
Checking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1432988446850289664: ``` 4 unexpected clusteroperator state transitions during e2e test run Sep 01 09:20:16.134 - 206ms E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved 1 tests failed during this blip (2021-09-01 09:20:16.134607583 +0000 UTC to 2021-09-01 09:20:16.134607583 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] Sep 01 10:03:16.976 - 9s E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available 1 tests failed during this blip (2021-09-01 10:03:16.976376214 +0000 UTC to 2021-09-01 10:03:16.976376214 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] ``` From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221630447200000%22,%221630533599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1432988446850289664%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-kube-storage-version-migrator-operator%5C%22%22%7D%5D): ``` 2021-09-01 11:20:16 I0901 09:20:16.314719 1 status_controller.go:211] clusteroperator/kube-storage-version-migrator diff {"status":{"conditions":[{"lastTransitionTime":"2021-09-01T09:01:59Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-09-01T09:01:59Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} 2021-09-01 11:20:16 I0901 09:20:16.086073 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-storage-version-migrator-operator", Name:"kube-storage-version-migrator-operator", UID:"ba007b65-7e7a-46f3-ab29-6cbea1b7f264", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-storage-version-migrator changed: Degraded message changed from "All is well" to "TargetDegraded: \"deployments\": etcdserver: leader changed\nTargetDegraded: ",Progressing changed from False to True ("Progressing: syncing openshift-kube-storage-version-migrator resources: \"deployments\": etcdserver: leader changed"),Available changed from True to False ("Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved") ``` Short blip (0.228646s) when the condition/Available flipped to False. The client-go returned "etcdserver: leader changed" error instead of retrying. Solutions: - have client-go retry on the leader election - have the operator check for the leader election and retry without changing the condition/Available to false
Pushing to 4.10.0.
Moved to https://issues.redhat.com/browse/OCPBUGS-20062