Description of problem: Upgrade jobs failing with: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338 Unexpected error: <*errors.errorString | 0xc0022d5740>: { s: "Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition", } Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition occurred example: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107 recurring quite a bit: https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
moving the upgrade bug to CVO
Notes from the example job: * Update was from 4.5.0-0.ci-2020-06-02-094408 -> 4.5.0-0.ci-2020-06-03-045338 [3,8]. * 5:34:06, update triggered [4], test suite gives two minutes to acknowledge [10]. * 5:36:06, test suite gives up [11]. * 5:36:27, ClusterVersion history entry records the attempt for the new release version [8]. * 5:36:55, CVO pod starts logging [2], running the new release image [6]. * 5:42:03, test harness gave up on waiting [1]. * Gathered ClusterVersion claims the new target version [7] and has both generation and observedGeneration 2 [9]. So it's hard to know exactly what's going on here without logs from the old pod, but my impression is that due to informer caching or some such we aren't noticing the ClusterVersion change quickly enough to satisfy the tests. I'm going to keep this bug to be about making the CVO more responsive, but given the high failure rates [12], maybe we should temporarily raise the test suite timeout [8] to 3m or 4m so we don't mask other, more significant issues? [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt failed: (11m10s) 2020-06-03T05:42:03 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-7ff65cb79-qs6lj_cluster-version-operator.log I0603 05:36:55.646275 1 start.go:19] ClusterVersionOperator v1.0.0-240-g2fdbb166-dirty [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt 2020/06/03 04:59:22 Imported release 4.5.0-0.ci-2020-06-03-045338 created at 2020-06-03 04:56:04 +0000 UTC with 110 images to tag release:latest 2020/06/03 05:00:15 Imported release 4.5.0-0.ci-2020-06-02-094408 created at 2020-06-02 09:48:28 +0000 UTC with 110 images to tag release:initial [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt Jun 3 05:34:06.190: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338 [5]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' ... 2020-06-03T05:13:20Z - 2020-06-03T05:13:20Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release@sha256:0f69c029d63dde168fceeb729fb37f48a2173d1a199c502fb175d860d9628e44" ... 2020-06-03T05:23:18Z - 2020-06-03T05:23:18Z (1) openshift-cluster-version ip-10-0-231-208_00b6f5ea-d95a-4a65-9f7c-9ffb26cfcbb8 became leader 2020-06-03T05:36:26Z - 2020-06-03T05:36:26Z (1) openshift-cluster-version ip-10-0-231-208_22dd3d94-ae5f-45fe-b74b-92d4440afea3 became leader ... 2020-06-03T05:36:29Z - 2020-06-03T05:36:29Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338" ... 2020-06-03T05:36:36Z - 2020-06-03T05:36:36Z (1) openshift-cluster-version Deleted pod: cluster-version-operator-65d8d56b9f-knwt2 2020-06-03T05:36:36Z - 2020-06-03T05:36:36Z (1) openshift-cluster-version Scaled down replica set cluster-version-operator-65d8d56b9f to 0 ... 2020-06-03T05:36:49.270551Z openshift-cluster-version Successfully assigned openshift-cluster-version/cluster-version-operator-7ff65cb79-qs6lj to ip-10-0-129-130.us-west-1.compute.internal 2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338" 2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Created pod: cluster-version-operator-7ff65cb79-qs6lj 2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Scaled up replica set cluster-version-operator-7ff65cb79 to 1 2020-06-03T05:36:55Z - 2020-06-03T05:36:55Z (1) openshift-cluster-version Successfully pulled image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338" 2020-06-03T05:36:55Z - 2020-06-03T05:36:55Z (1) openshift-cluster-version Created container cluster-version-operator ... [6]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).spec.containers[].image' registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338 [7]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq '.items[].status.desired' { "force": true, "image": "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338", "version": "4.5.0-0.ci-2020-06-03-045338" } [8]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring) + " " + .image' 2020-06-03T05:36:27Z 4.5.0-0.ci-2020-06-03-045338 Partial false registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338 2020-06-03T05:07:12Z 2020-06-03T05:30:49Z 4.5.0-0.ci-2020-06-02-094408 Completed false registry.svc.ci.openshift.org/ocp/release@sha256:0f69c029d63dde168fceeb729fb37f48a2173d1a199c502fb175d860d9628e44 [9]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[] | (.metadata.generation | tostring) + " " + (.status.observedGeneration | tostring)' 2 2 [10]: https://github.com/openshift/origin/blob/a6374911c68081c3e87a01fb67a40f56ca2403ec/test/e2e/upgrade/upgrade.go#L295-L306 [11]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt Jun 3 05:36:06.566: INFO: Cluster version: [12]: $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=0&type=junit&groupBy=job' | grep 'of failures match' release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 11 runs, 27% failed, 33% of failures match pull-ci-openshift-cluster-samples-operator-master-e2e-aws-upgrade - 6 runs, 50% failed, 33% of failures match pull-ci-openshift-installer-master-e2e-aws-upgrade - 81 runs, 43% failed, 11% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-gcp-upgrade - 20 runs, 20% failed, 25% of failures match rehearse-9448-pull-ci-openshift-kubernetes-marun-testing2-e2e-gcp-upgrade - 10 runs, 90% failed, 11% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 307 runs, 32% failed, 2% of failures match pull-ci-openshift-cluster-samples-operator-release-4.5-e2e-aws-upgrade - 6 runs, 33% failed, 50% of failures match pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-upgrade - 7 runs, 57% failed, 25% of failures match pull-ci-openshift-oc-master-e2e-aws-upgrade - 20 runs, 35% failed, 14% of failures match pull-ci-openshift-origin-master-e2e-gcp-upgrade - 41 runs, 32% failed, 8% of failures match pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 70 runs, 30% failed, 5% of failures match pull-ci-openshift-cloud-credential-operator-master-e2e-aws-upgrade - 22 runs, 45% failed, 10% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 2 runs, 100% failed, 50% of failures match pull-ci-openshift-cluster-dns-operator-master-e2e-aws-upgrade - 15 runs, 67% failed, 20% of failures match pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-gcp-upgrade - 48 runs, 29% failed, 7% of failures match
Still hard to debug without logs from the pre-update CVO, but one possibility would be that the delay is the result of the local Kubernetes API server going down, like bug 1842002. $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.name + " " + .message' | grep 'Started container kube-apiserver$' 2020-06-03T05:15:21Z - 2020-06-03T05:15:21Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f0ca982a2a38 Started container kube-apiserver 2020-06-03T05:16:14Z - 2020-06-03T05:16:14Z (1) kube-apiserver-ip-10-0-191-192.us-west-1.compute.internal.1614f0d6ef78c3f8 Started container kube-apiserver 2020-06-03T05:16:59Z - 2020-06-03T05:16:59Z (1) kube-apiserver-ip-10-0-129-130.us-west-1.compute.internal.1614f0e16bb47ee3 Started container kube-apiserver 2020-06-03T05:20:44Z - 2020-06-03T05:20:44Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f115c76d261f Started container kube-apiserver 2020-06-03T05:25:05Z - 2020-06-03T05:25:05Z (1) kube-apiserver-ip-10-0-191-192.us-west-1.compute.internal.1614f152b4353bb0 Started container kube-apiserver 2020-06-03T05:29:37Z - 2020-06-03T05:29:37Z (1) kube-apiserver-ip-10-0-129-130.us-west-1.compute.internal.1614f191fa156dfd Started container kube-apiserver 2020-06-03T05:33:51Z - 2020-06-03T05:33:51Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f1ccfe43c8a5 Started container kube-apiserver 2020-06-03T05:39:44Z - 2020-06-03T05:39:44Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f21f4c1631d0 Started container kube-apiserver Those last two are close to the 5:34:06 to 5:36:06 window, and they're both on ip-10-0-231-208 which is where both the pre-update CVO and its immediate successor were running.
We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features. Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.
With David's events around this landed [1], here's a recent CI job that failed this test-case [2]. Timeline: * 10:05:37, update requested [3] * 10:05:37, RetrievePayload event [4] * 10:05:38, created pod version--n59hb-r2qkg to retrieve the version [4] * 10:05:44, VerifyPayload event [4] * 10:05:44, PreconditionsPassed event [4] * 10:05:44, PayloadLoaded event [4] * 10:05:55, new CVO pod (presumably running the new image) [4] * 10:05:59, new CVO pod pulls its own image (useless, but didn't take too long) [4] * 10:06:00, new CVO container started (I'm not clear on why this doesn't happen before the redundant image pull; maybe it's not actually redundant) [4] * 10:07:38, test-case failed, status.desired still pointing at 0.0.1-2020-08-02-091930 [3] * 10:07:42, new CVO container becomes the leader [4] * 10:07:43, RetrievePayload again (this one seems redundant too) [4] * 10:07:43, VerifyPayload again (probably can drop this too when the CVO skips an explicit load when the desired payload is the current CVO's image) [4] * 10:07:45, PayloadLoaded again [4] So some useless stuff in there that we can drop to help stay under the 2m timeout. But the main issue seems to be leader election. Since [5,6] (bug 1670727), we've been using the 'Recreate' strategy for the CVO, so I don't see a need to keep leader election around. Should we drop it? [1]: https://github.com/openshift/cluster-version-operator/pull/411 [2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280 [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/build-log.txt ... Aug 2 10:05:37.925: INFO: Starting upgrade to version= image=default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95 Aug 2 10:07:38.043: INFO: Cluster version: ... [4]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .reason + ": " + .message' | grep '^2020-08-02T10:0[5-9]' 2020-08-02T10:05:37Z - 2020-08-02T10:05:37Z (1) RetrievePayload: retrieving payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:38Z - 2020-08-02T10:05:38Z (1) SuccessfulCreate: Created pod: version--n59hb-r2qkg 2020-08-02T10:05:40Z - 2020-08-02T10:05:40Z (1) AddedInterface: Add eth0 [10.129.0.62/23] 2020-08-02T10:05:41Z - 2020-08-02T10:05:41Z (1) Pulling: Pulling image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:42Z - 2020-08-02T10:05:42Z (1) Pulled: Successfully pulled image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" in 1.575062464s 2020-08-02T10:05:42Z - 2020-08-02T10:05:42Z (1) Created: Created container payload 2020-08-02T10:05:43Z - 2020-08-02T10:05:43Z (1) Started: Started container payload 2020-08-02T10:05:43Z - 2020-08-02T10:05:43Z (1) Completed: Job completed 2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) SandboxChanged: Pod sandbox changed, it will be killed and re-created. 2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) VerifyPayload: verifying payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) PreconditionsPassed: preconditions passed for payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) PayloadLoaded: payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) Killing: Stopping container cluster-version-operator 2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) SuccessfulDelete: Deleted pod: cluster-version-operator-79b7cd95db-5qz4j 2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) ScalingReplicaSet: Scaled down replica set cluster-version-operator-79b7cd95db to 0 2020-08-02T10:05:46Z - 2020-08-02T10:05:46Z (1) AddedInterface: Add eth0 [10.129.0.63/23] 2020-08-02T10:05:55.451880Z Scheduled: Successfully assigned openshift-cluster-version/cluster-version-operator-5b6ff896c6-57ppb to ci-op-g5n8dbpm-db044-6wcxt-master-1 2020-08-02T10:05:55Z - 2020-08-02T10:05:55Z (1) SuccessfulCreate: Created pod: cluster-version-operator-5b6ff896c6-57ppb 2020-08-02T10:05:55Z - 2020-08-02T10:05:55Z (1) ScalingReplicaSet: Scaled up replica set cluster-version-operator-5b6ff896c6 to 1 2020-08-02T10:05:56Z - 2020-08-02T10:05:56Z (1) Pulling: Pulling image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:05:59Z - 2020-08-02T10:05:59Z (1) Pulled: Successfully pulled image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" in 3.495433181s 2020-08-02T10:06:00Z - 2020-08-02T10:06:00Z (1) Created: Created container cluster-version-operator 2020-08-02T10:06:00Z - 2020-08-02T10:06:00Z (1) Started: Started container cluster-version-operator 2020-08-02T10:07:42Z - 2020-08-02T10:07:42Z (1) LeaderElection: ci-op-g5n8dbpm-db044-6wcxt-master-1_b5e78072-56bf-4d7d-bb2e-493b9b3630fa became leader 2020-08-02T10:07:43Z - 2020-08-02T10:07:43Z (1) RetrievePayload: retrieving payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:07:43Z - 2020-08-02T10:07:43Z (1) VerifyPayload: verifying payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" 2020-08-02T10:07:45Z - 2020-08-02T10:07:45Z (1) PayloadLoaded: payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" [5]: https://github.com/openshift/cluster-version-operator/pull/140 [6]: https://github.com/openshift/cluster-version-operator/pull/155
Abhinav points out that the delay may be the outgoing CVO not releasing the lease, so the incoming CVO has to wait for the old lease to expire. We have a TODO about setting ReleaseOnCancel [1]. I'll address that TODO, and we'll see if it fixes the slow leader elections... [1]: https://github.com/openshift/cluster-version-operator/blob/ed864d6f1ed3b43e7ec719d8b3691813a05cc34f/pkg/start/start.go#L136
from the PR job cited in comment 7 it's obvious we're waiting 1m41s on a lease and the test case failure is within that waiting period I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... I0802 10:06:01.057825 1 metrics.go:132] Metrics port listening for HTTP on 0.0.0.0:9099 I0802 10:06:01.057832 1 metrics.go:140] Metrics port listening for HTTPS on 0.0.0.0:9099 W0802 10:06:26.618036 1 metrics.go:404] availableUpdates is nil W0802 10:06:27.659247 1 metrics.go:404] availableUpdates is nil W0802 10:06:56.609267 1 metrics.go:404] availableUpdates is nil W0802 10:06:57.652903 1 metrics.go:404] availableUpdates is nil W0802 10:07:26.609033 1 metrics.go:404] availableUpdates is nil W0802 10:07:27.655429 1 metrics.go:404] availableUpdates is nil ** 10:07:38, test-case failed, status.desired still pointing at 0.0.1-2020-08-02-091930 [3] *** I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version
cvo#424 still waiting on review.
*** Bug 1870344 has been marked as a duplicate of this bug. ***
*** Bug 1872826 has been marked as a duplicate of this bug. ***
For Sippy, from bug 1872826, seems like some test suites call this test: [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade
Setting "No Doc Update", because the effect of this was a new CVO possibly taking a minute or two to pick up the orphaned lease. Doesn't seem like a big deal, and we don't have a formal commitment around how quickly we turn that status acknowledgement around.
Preliminary CI results look promising. Adjuting maxAge from 72h to 24h to 12h: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 405 runs, 37% failed, 5% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 27 runs, 37% failed, 20% of failures match pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 53 runs, 23% failed, 8% of failures match release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.5-to-4.6 - 3 runs, 100% failed, 33% of failures match $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 91 runs, 41% failed, 3% of failures match $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=12h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 56 runs, 34% failed, 5% of failures match We'll need more cook time to get more of those rollback runs under our belt. And we probably need the backport bug 1872906 for them too. But doesn't seem to be a big player anymore. Not entirely clear where release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci went; I would have expected that to stay mad until the backport lands...
Most recent release-openshift-origin-installer-e2e-aws-upgrade failure was [1], from 4.5.7 to 4.6.0-0.nightly-2020-08-27-005538, so yeah, that's going to need the 4.5 backport to land as well. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1298787752820609024
After a week's cook time, from ci test results in past 24h, there are still failures "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade". Such as: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301201781459521536 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301241878347255808 So we may need further debug here, assign the bug back.
(In reply to liujia from comment #19) > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301241878347255808 This one was 4.5.7 -> 4.6.0-0.ci-2020-09-02-193151, which won't have a fix until the 4.5 backport in bug 1872906 lands. But looking through [1] turns up a 4.6.0-0.nightly-2020-09-09-021653 -> 4.6.0-0.nightly-2020-09-09-062306 job [2] and a 4.6.0-0.nightly-2020-09-09-111337 -> 4.6.0-0.nightly-2020-09-09-152646 job [3]. Both hung after install, without noticing the update at all: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303581180779565056/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-09-09T06:37:38Z 2020-09-09T07:08:24Z 4.6.0-0.nightly-2020-09-09-021653 Completed false $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303718112817844224/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-09-09T15:47:01Z 2020-09-09T16:18:16Z 4.6.0-0.nightly-2020-09-09-111337 Completed false so they sound more like bug 1873900 than this bug's lease-release. [1]: https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=bug%2Bjunit&name=release-openshift- [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303581180779565056 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303718112817844224
Moving this bug to 4.7 as it is not critical for 4.6
We keep landing logging, most recently https://github.com/openshift/cluster-version-operator/pull/451 . I still need to track down something from CI since that has landed and see if it helps us understand why we sometimes stop syncing manifests. I think that's orthogonal to the lease releases though, so I'm fine moving this back to ON_QA if we want to narrowly consider the lease-release angle for this bug.
*** Bug 1881050 has been marked as a duplicate of this bug. ***
As per the description of https://bugzilla.redhat.com/show_bug.cgi?id=1881050 this issue is effecting 8% of all upgrades in 4.6. So changing the release target to 4.6 and bumping the severity. test: [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-cluster-lifecycle%5C%5D+Cluster+version+operator+acknowledges+upgrade 5% of upgrade attempts are still failing on this test. It seems to be worse on gcp and azure than on aws. It may be worse on 4.5 to 4.6 (so you may need backports first). https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352
(In reply to Lalatendu Mohanty from comment #24) > is failing frequently in CI, see search results: > https://search.ci.openshift.org/?maxAge=168h... 168h takes us well before the workCh deadlock fix [1] for bug 1873900. Checking a smaller time window: $ curl -s 'https://search.ci.openshift.org/search?maxAge=72h&type=junit&search=Cluster+version+operator+acknowledges+upgrade' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307238963374723072 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307346484340985856 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307898498426867712 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307091197818310656 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307352263563415552 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307502212473163776 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1307073492423806976 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1307618568002605056 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci/1307346028545970176 Let's also exclude OKD for simplicity (feel free to file a separate bug if folks feel like there is an OKD-specific issue). We need to exclude 4.5 -> 4.6 updates until this 4.6 bug has been VERIFIED to unlock the 4.5 bug 1872906 backport. That covers the 4.5-stable-to-4.6-ci jobs, and also the e2e-aws-upgrade/1307977375606116352 job. The remaining three e2e-aws-upgrade jobs are all 4.6 -> 4.7, and I don't want to worry about 4.7 releases until we accept a 4.7 CI payload [2]. [1]: https://github.com/openshift/cluster-version-operator/pull/455#event-3780199378 [2]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.7.0-0.ci
From ci test results in past 48h, there are two failures "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade". [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307502212473163776 [1] is for 4.5.11 to 4.6.0-0.nightly-2020-09-21-093308, which can be excluded due to #1872906. [2] is for 4.6.0-fc.5 to 4.7.0-0.ci-2020-09-20-020844, which also can be excluded since no available 4.7 ci payload available according to comment25. So move this bug to verify to get v4.5 backport unblocked.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196