Update from 4.7.17 to 4.8.0-0.ci-2021-06-14-090014 stuck on [1]: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 2h31m26s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-1fiidp14/release@sha256:92fa389aa952b8ed952a3ffe624be6353e4be9ec2bec1a811f4172b324a7f84d Unexpected error: <*errors.errorString | 0xc0012d4d90>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded", } Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded occurred : [sig-cluster-lifecycle] Cluster completes upgrade 2h30m0s Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded : Operator upgrade authentication 0s Failed to upgrade authentication, operator was degraded (EncryptionMigrationController_Error): EncryptionMigrationControllerDegraded: etcdserver: request timed out : [bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded Run #0: Failed 0s 2 unexpected clusteroperator state transitions during e2e test run Jun 21 01:16:31.171 - 7104s E clusteroperator/authentication condition/Degraded status/True reason/EncryptionMigrationControllerDegraded: etcdserver: request timed out 1 tests failed during this blip (2021-06-21 01:16:31.171240906 +0000 UTC to 2021-06-21 01:16:31.171240906 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] There were some other signs that etcd was briefly sad, but only the auth operator seems to have decided to stay mad about etcd for hours. Checking on the auth pods: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.namespace | contains("auth")) | .metadata as $m | .status.containerStatuses[].state.running.startedAt + " " + $m.namespace + " " + $m.name' | sort | tail -n3 2021-06-21T01:16:24Z openshift-oauth-apiserver apiserver-7cfbf75f7d-dvvnn 2021-06-21T01:19:18Z openshift-oauth-apiserver apiserver-7cfbf75f7d-bscbj 2021-06-21T03:16:41Z openshift-oauth-apiserver apiserver-7cfbf75f7d-v7pxg Poking at that late one, looks like a single restart: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "apiserver-7cfbf75f7d-v7pxg").status.containerStatuses[] | {restartCount, lastState,state}' { "restartCount": 1, "lastState": { "terminated": { "containerID": "cri-o://46452442dcccabe46958da0d3cce07860b9ba893df9e0c90212878f2d0ccfcbf", "exitCode": 0, "finishedAt": "2021-06-21T03:16:41Z", "reason": "Completed", "startedAt": "2021-06-21T01:17:50Z" } }, "state": { "running": { "startedAt": "2021-06-21T03:16:41Z" } } } Not sure why that container was terminated, but we have logs from its whole run, including the termination: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512/artifacts/e2e-ovirt-upgrade/gather-extra/artifacts/pods/openshift-oauth-apiserver_apiserver-7cfbf75f7d-v7pxg_oauth-apiserver_previous.log | grep -1 03:15:40.011632 I0621 03:15:20.999809 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0621 03:15:40.011632 1 genericapiserver.go:684] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"kube-system", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationStart' Received signal to terminate, becoming unready, but keeping serving I0621 03:15:40.011692 1 genericapiserver.go:684] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"kube-system", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationPreShutdownHooksFinished' All pre-shutdown hooks have been finished Seems fairly rare: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-devfile-integration-tests-main-v4.8.odo-integration-devfile-odo-periodic (all) - 24 runs, 29% failed, 14% of failures match = 4% impact pull-ci-openshift-machine-api-operator-release-4.8-e2e-aws-disruptive (all) - 33 runs, 97% failed, 3% of failures match = 3% impact release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 31 runs, 65% failed, 5% of failures match = 3% impact periodic-ci-openshift-windows-machine-config-operator-release-4.9-vsphere-e2e-periodic (all) - 18 runs, 100% failed, 6% of failures match = 6% impact pull-ci-openshift-installer-master-e2e-aws-workers-rhel7 (all) - 71 runs, 66% failed, 2% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 68 runs, 59% failed, 3% of failures match = 1% impact Anyhow, I expect there was an etcd instability trigger at the beginning of this, but this bug is about the auth operator not recovering from that instability despite having lots of time to do so. It's also possible that I'm just completely misreading these assets :p And despite the rarity, the severity of hanging an update makes me think this is high-severity, at least until we have a better idea of what's going on. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406763993887936512
The new instance: acquired a lock at: 2021-06-21T01:14:51.630754688Z I0621 01:14:51.630702 1 leaderelection.go:253] successfully acquired lease openshift-authentication-operator/cluster-authentication-operator-lock waited for caches for the encryption controllers at: 2021-06-21T01:14:51.706080667Z I0621 01:14:51.705910 1 base_controller.go:66] Waiting for caches to sync for EncryptionPruneController 2021-06-21T01:14:51.706093456Z I0621 01:14:51.705933 1 base_controller.go:66] Waiting for caches to sync for EncryptionMigrationController observed a degraded condition at: 2021-06-21T01:14:51.806808962Z I0621 01:14:51.806784 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-06-21T00:40:05Z","message":"EncryptionMigrationControllerDegraded: etcdserver: request timed out\nAPIServerStaticResourcesDegraded: \"oauth-apiserver/svc.yaml\" (string): etcdserver: request timed out\nAPIServerStaticResourcesDegraded: \nRouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: etcdserver: request timed out","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-06-21T00:28:17Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-06-21T00:40:05Z","message":"OAuthServerDeploymentAvailable: availableReplicas==2","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-06-21T00:18:52Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}],"versions":[{"name":"oauth-apiserver","version":"4.7.17"},{"name":"operator","version":"4.8.0-0.ci-2021-06-14-090014"},{"name":"oauth-openshift","version":"4.7.17_openshift"}]}} started the encryption controllers at: 2021-06-21T01:14:55.306409205Z I0621 01:14:55.306165 1 base_controller.go:72] Caches are synced for EncryptionMigrationController 2021-06-21T01:14:55.306423053Z I0621 01:14:55.306411 1 base_controller.go:109] Starting #1 worker of EncryptionMigrationController controller ... Based on the above and the attached screenshots the degraded condition must have been set by the previous instance (4.7). The new precondition checker added in 4.8 prevents the encryption controllers from running and clearing the stale conditions.
Seems more rare now, with no hits in jobs that could touch 4.9 code: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 6 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 19 runs, 100% failed, 5% of failures match = 5% impact release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 6 runs, 100% failed, 17% of failures match = 17% impact Back in c0, we had a few hits in periodic-ci-openshift-windows-machine-config-operator-release-4.9-vsphere-e2e-periodic and pull-ci-openshift-installer-master-e2e-aws-workers-rhel7 .
From the PR code 1. It clears stale conditions when the preconditions are not fulfilled like below if err != nil { degradedCondition = nil progressingCondition = nil } 2. All encryption conditions are covered in the code $ oc get kubeapiserver -o yaml | grep -i encr type: EncryptionPruneControllerDegraded type: EncryptionStateControllerDegraded type: Encrypted type: EncryptionKeyControllerDegraded type: EncryptionMigrationControllerDegraded type: EncryptionMigrationControllerProgressing From the CI result, no hits in jobs that touch 4.9 code $ curl -s 'https://search.ci.openshift.org/?maxAge=144h&type=junit&search=EncryptionMigrationControllerDegraded.*etcdserver:+request+timed+out' | sed 's/<[^>]*>//g' | grep 'failures match' | sort periodic-ci-openshift-release-master-nightly-4.8-e2e-azure (all) - 9 runs, 11% failed, 100% of failures match = 11% impact Move to VERIFIED status based on above.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759