Bug 1843505 - Cluster did not acknowledge request to upgrade in a reasonable time
Summary: Cluster did not acknowledge request to upgrade in a reasonable time
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
: 1870344 1872826 1881050 (view as bug list)
Depends On:
Blocks: 1872906
TreeView+ depends on / blocked
 
Reported: 2020-06-03 12:44 UTC by Ben Parees
Modified: 2020-09-22 03:04 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1872906 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 424 None closed Bug 1843505: pkg/start: Release leader lease on graceful shutdown 2020-09-21 13:13:25 UTC
Github openshift cluster-version-operator pull 426 None closed Bug 1843505: Have the leader election use faster defaults 2020-09-21 13:13:25 UTC

Description Ben Parees 2020-06-03 12:44:03 UTC
Description of problem:

Upgrade jobs failing with:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
Unexpected error:
    <*errors.errorString | 0xc0022d5740>: {
        s: "Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition",
    }
    Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition
occurred

example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107

recurring quite a bit:
https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Abhinav Dahiya 2020-06-03 15:59:22 UTC
moving the upgrade bug to CVO

Comment 2 W. Trevor King 2020-06-04 23:40:52 UTC
Notes from the example job:

* Update was from 4.5.0-0.ci-2020-06-02-094408 -> 4.5.0-0.ci-2020-06-03-045338 [3,8].
* 5:34:06, update triggered [4], test suite gives two minutes to acknowledge [10].
* 5:36:06, test suite gives up [11].
* 5:36:27, ClusterVersion history entry records the attempt for the new release version [8].
* 5:36:55, CVO pod starts logging [2], running the new release image [6].
* 5:42:03, test harness gave up on waiting [1].
* Gathered ClusterVersion claims the new target version [7] and has both generation and observedGeneration 2 [9].

So it's hard to know exactly what's going on here without logs from the old pod, but my impression is that due to informer caching or some such we aren't noticing the ClusterVersion change quickly enough to satisfy the tests.  I'm going to keep this bug to be about making the CVO more responsive, but given the high failure rates [12], maybe we should temporarily raise the test suite timeout [8] to 3m or 4m so we don't mask other, more significant issues?

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt
     failed: (11m10s) 2020-06-03T05:42:03 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]"
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-7ff65cb79-qs6lj_cluster-version-operator.log
     I0603 05:36:55.646275       1 start.go:19] ClusterVersionOperator v1.0.0-240-g2fdbb166-dirty
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt
     2020/06/03 04:59:22 Imported release 4.5.0-0.ci-2020-06-03-045338 created at 2020-06-03 04:56:04 +0000 UTC with 110 images to tag release:latest
     2020/06/03 05:00:15 Imported release 4.5.0-0.ci-2020-06-02-094408 created at 2020-06-02 09:48:28 +0000 UTC with 110 images to tag release:initial
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt
     Jun  3 05:34:06.190: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
[5]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message'
     ...
     2020-06-03T05:13:20Z - 2020-06-03T05:13:20Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release@sha256:0f69c029d63dde168fceeb729fb37f48a2173d1a199c502fb175d860d9628e44"
     ...
     2020-06-03T05:23:18Z - 2020-06-03T05:23:18Z (1) openshift-cluster-version ip-10-0-231-208_00b6f5ea-d95a-4a65-9f7c-9ffb26cfcbb8 became leader
     2020-06-03T05:36:26Z - 2020-06-03T05:36:26Z (1) openshift-cluster-version ip-10-0-231-208_22dd3d94-ae5f-45fe-b74b-92d4440afea3 became leader
     ...
     2020-06-03T05:36:29Z - 2020-06-03T05:36:29Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338"
     ...
     2020-06-03T05:36:36Z - 2020-06-03T05:36:36Z (1) openshift-cluster-version Deleted pod: cluster-version-operator-65d8d56b9f-knwt2
     2020-06-03T05:36:36Z - 2020-06-03T05:36:36Z (1) openshift-cluster-version Scaled down replica set cluster-version-operator-65d8d56b9f to 0
     ...
     2020-06-03T05:36:49.270551Z openshift-cluster-version Successfully assigned openshift-cluster-version/cluster-version-operator-7ff65cb79-qs6lj to ip-10-0-129-130.us-west-1.compute.internal
     2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Pulling image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338"
     2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Created pod: cluster-version-operator-7ff65cb79-qs6lj
     2020-06-03T05:36:49Z - 2020-06-03T05:36:49Z (1) openshift-cluster-version Scaled up replica set cluster-version-operator-7ff65cb79 to 1
     2020-06-03T05:36:55Z - 2020-06-03T05:36:55Z (1) openshift-cluster-version Successfully pulled image "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338"
     2020-06-03T05:36:55Z - 2020-06-03T05:36:55Z (1) openshift-cluster-version Created container cluster-version-operator
     ...
[6]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).spec.containers[].image'
     registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
[7]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq '.items[].status.desired'
     {
       "force": true,
       "image": "registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338",
       "version": "4.5.0-0.ci-2020-06-03-045338"
     }
[8]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring) + " " + .image'
     2020-06-03T05:36:27Z  4.5.0-0.ci-2020-06-03-045338 Partial false registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
     2020-06-03T05:07:12Z 2020-06-03T05:30:49Z 4.5.0-0.ci-2020-06-02-094408 Completed false registry.svc.ci.openshift.org/ocp/release@sha256:0f69c029d63dde168fceeb729fb37f48a2173d1a199c502fb175d860d9628e44
[9]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[] | (.metadata.generation | tostring) + " " + (.status.observedGeneration | tostring)'
     2 2
[10]: https://github.com/openshift/origin/blob/a6374911c68081c3e87a01fb67a40f56ca2403ec/test/e2e/upgrade/upgrade.go#L295-L306
[11]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/build-log.txt
      Jun  3 05:36:06.566: INFO: Cluster version:
[12]: $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=0&type=junit&groupBy=job' | grep 'of failures match'
      release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 11 runs, 27% failed, 33% of failures match
      pull-ci-openshift-cluster-samples-operator-master-e2e-aws-upgrade - 6 runs, 50% failed, 33% of failures match
      pull-ci-openshift-installer-master-e2e-aws-upgrade - 81 runs, 43% failed, 11% of failures match
      pull-ci-openshift-cluster-network-operator-master-e2e-gcp-upgrade - 20 runs, 20% failed, 25% of failures match
      rehearse-9448-pull-ci-openshift-kubernetes-marun-testing2-e2e-gcp-upgrade - 10 runs, 90% failed, 11% of failures match
      release-openshift-origin-installer-e2e-aws-upgrade - 307 runs, 32% failed, 2% of failures match
      pull-ci-openshift-cluster-samples-operator-release-4.5-e2e-aws-upgrade - 6 runs, 33% failed, 50% of failures match
      pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-upgrade - 7 runs, 57% failed, 25% of failures match
      pull-ci-openshift-oc-master-e2e-aws-upgrade - 20 runs, 35% failed, 14% of failures match
      pull-ci-openshift-origin-master-e2e-gcp-upgrade - 41 runs, 32% failed, 8% of failures match
      pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 70 runs, 30% failed, 5% of failures match
      pull-ci-openshift-cloud-credential-operator-master-e2e-aws-upgrade - 22 runs, 45% failed, 10% of failures match
      release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 2 runs, 100% failed, 50% of failures match
      pull-ci-openshift-cluster-dns-operator-master-e2e-aws-upgrade - 15 runs, 67% failed, 20% of failures match
      pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-gcp-upgrade - 48 runs, 29% failed, 7% of failures match

Comment 3 W. Trevor King 2020-06-11 04:30:17 UTC
Still hard to debug without logs from the pre-update CVO, but one possibility would be that the delay is the result of the local Kubernetes API server going down, like bug 1842002.

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107/artifacts/e2e-aws-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.name + " " + .message' | grep 'Started container kube-apiserver$'
2020-06-03T05:15:21Z - 2020-06-03T05:15:21Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f0ca982a2a38 Started container kube-apiserver
2020-06-03T05:16:14Z - 2020-06-03T05:16:14Z (1) kube-apiserver-ip-10-0-191-192.us-west-1.compute.internal.1614f0d6ef78c3f8 Started container kube-apiserver
2020-06-03T05:16:59Z - 2020-06-03T05:16:59Z (1) kube-apiserver-ip-10-0-129-130.us-west-1.compute.internal.1614f0e16bb47ee3 Started container kube-apiserver
2020-06-03T05:20:44Z - 2020-06-03T05:20:44Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f115c76d261f Started container kube-apiserver
2020-06-03T05:25:05Z - 2020-06-03T05:25:05Z (1) kube-apiserver-ip-10-0-191-192.us-west-1.compute.internal.1614f152b4353bb0 Started container kube-apiserver
2020-06-03T05:29:37Z - 2020-06-03T05:29:37Z (1) kube-apiserver-ip-10-0-129-130.us-west-1.compute.internal.1614f191fa156dfd Started container kube-apiserver
2020-06-03T05:33:51Z - 2020-06-03T05:33:51Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f1ccfe43c8a5 Started container kube-apiserver
2020-06-03T05:39:44Z - 2020-06-03T05:39:44Z (1) kube-apiserver-ip-10-0-231-208.us-west-1.compute.internal.1614f21f4c1631d0 Started container kube-apiserver

Those last two are close to the 5:34:06 to 5:36:06 window, and they're both on ip-10-0-231-208 which is where both the pre-update CVO and its immediate successor were running.

Comment 4 Lalatendu Mohanty 2020-06-22 11:37:35 UTC
We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 6 Jack Ottofaro 2020-07-30 20:07:35 UTC
We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 7 W. Trevor King 2020-08-03 19:41:02 UTC
With David's events around this landed [1], here's a recent CI job that failed this test-case [2].  Timeline:

* 10:05:37, update requested [3]
* 10:05:37, RetrievePayload event [4]
* 10:05:38, created pod version--n59hb-r2qkg to retrieve the version [4]
* 10:05:44, VerifyPayload event [4]
* 10:05:44, PreconditionsPassed event [4]
* 10:05:44, PayloadLoaded event [4]
* 10:05:55, new CVO pod (presumably running the new image) [4]
* 10:05:59, new CVO pod pulls its own image (useless, but didn't take too long) [4]
* 10:06:00, new CVO container started (I'm not clear on why this doesn't happen before the redundant image pull; maybe it's not actually redundant) [4]
* 10:07:38, test-case failed, status.desired still pointing at 0.0.1-2020-08-02-091930 [3]
* 10:07:42, new CVO container becomes the leader [4]
* 10:07:43, RetrievePayload again (this one seems redundant too) [4]
* 10:07:43, VerifyPayload again (probably can drop this too when the CVO skips an explicit load when the desired payload is the current CVO's image) [4]
* 10:07:45, PayloadLoaded again [4]

So some useless stuff in there that we can drop to help stay under the 2m timeout.  But the main issue seems to be leader election.  Since [5,6] (bug 1670727), we've been using the 'Recreate' strategy for the CVO, so I don't see a need to keep leader election around.  Should we drop it?

[1]: https://github.com/openshift/cluster-version-operator/pull/411
[2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280
[3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/build-log.txt
     ...
     Aug  2 10:05:37.925: INFO: Starting upgrade to version= image=default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95
     Aug  2 10:07:38.043: INFO: Cluster version:
     ...
[4]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .reason + ": " + .message' | grep '^2020-08-02T10:0[5-9]'
     2020-08-02T10:05:37Z - 2020-08-02T10:05:37Z (1) RetrievePayload: retrieving payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:38Z - 2020-08-02T10:05:38Z (1) SuccessfulCreate: Created pod: version--n59hb-r2qkg
     2020-08-02T10:05:40Z - 2020-08-02T10:05:40Z (1) AddedInterface: Add eth0 [10.129.0.62/23]
     2020-08-02T10:05:41Z - 2020-08-02T10:05:41Z (1) Pulling: Pulling image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:42Z - 2020-08-02T10:05:42Z (1) Pulled: Successfully pulled image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" in 1.575062464s
     2020-08-02T10:05:42Z - 2020-08-02T10:05:42Z (1) Created: Created container payload
     2020-08-02T10:05:43Z - 2020-08-02T10:05:43Z (1) Started: Started container payload
     2020-08-02T10:05:43Z - 2020-08-02T10:05:43Z (1) Completed: Job completed
     2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) SandboxChanged: Pod sandbox changed, it will be killed and re-created.
     2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) VerifyPayload: verifying payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) PreconditionsPassed: preconditions passed for payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:44Z - 2020-08-02T10:05:44Z (1) PayloadLoaded: payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) Killing: Stopping container cluster-version-operator
     2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) SuccessfulDelete: Deleted pod: cluster-version-operator-79b7cd95db-5qz4j
     2020-08-02T10:05:45Z - 2020-08-02T10:05:45Z (1) ScalingReplicaSet: Scaled down replica set cluster-version-operator-79b7cd95db to 0
     2020-08-02T10:05:46Z - 2020-08-02T10:05:46Z (1) AddedInterface: Add eth0 [10.129.0.63/23]
     2020-08-02T10:05:55.451880Z Scheduled: Successfully assigned openshift-cluster-version/cluster-version-operator-5b6ff896c6-57ppb to ci-op-g5n8dbpm-db044-6wcxt-master-1
     2020-08-02T10:05:55Z - 2020-08-02T10:05:55Z (1) SuccessfulCreate: Created pod: cluster-version-operator-5b6ff896c6-57ppb
     2020-08-02T10:05:55Z - 2020-08-02T10:05:55Z (1) ScalingReplicaSet: Scaled up replica set cluster-version-operator-5b6ff896c6 to 1
     2020-08-02T10:05:56Z - 2020-08-02T10:05:56Z (1) Pulling: Pulling image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:05:59Z - 2020-08-02T10:05:59Z (1) Pulled: Successfully pulled image "default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95" in 3.495433181s
     2020-08-02T10:06:00Z - 2020-08-02T10:06:00Z (1) Created: Created container cluster-version-operator
     2020-08-02T10:06:00Z - 2020-08-02T10:06:00Z (1) Started: Started container cluster-version-operator
     2020-08-02T10:07:42Z - 2020-08-02T10:07:42Z (1) LeaderElection: ci-op-g5n8dbpm-db044-6wcxt-master-1_b5e78072-56bf-4d7d-bb2e-493b9b3630fa became leader
     2020-08-02T10:07:43Z - 2020-08-02T10:07:43Z (1) RetrievePayload: retrieving payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:07:43Z - 2020-08-02T10:07:43Z (1) VerifyPayload: verifying payload version="" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
     2020-08-02T10:07:45Z - 2020-08-02T10:07:45Z (1) PayloadLoaded: payload loaded version="0.0.1-2020-08-02-092717" image="default-route-openshift-image-registry.apps.build01.ci.devcluster.openshift.com/ci-op-g5n8dbpm/release@sha256:8333b4cb94a336245770f173b9054c0d86e038a17f15ed849a49deb0da68fd95"
[5]: https://github.com/openshift/cluster-version-operator/pull/140
[6]: https://github.com/openshift/cluster-version-operator/pull/155

Comment 8 W. Trevor King 2020-08-03 19:56:22 UTC
Abhinav points out that the delay may be the outgoing CVO not releasing the lease, so the incoming CVO has to wait for the old lease to expire.  We have a TODO about setting ReleaseOnCancel [1].  I'll address that TODO, and we'll see if it fixes the slow leader elections...

[1]: https://github.com/openshift/cluster-version-operator/blob/ed864d6f1ed3b43e7ec719d8b3691813a05cc34f/pkg/start/start.go#L136

Comment 9 Scott Dodson 2020-08-04 13:08:43 UTC
from the PR job cited in comment 7 it's obvious we're waiting 1m41s on a lease and the test case failure is within that waiting period

I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
I0802 10:06:01.057825       1 metrics.go:132] Metrics port listening for HTTP on 0.0.0.0:9099
I0802 10:06:01.057832       1 metrics.go:140] Metrics port listening for HTTPS on 0.0.0.0:9099
W0802 10:06:26.618036       1 metrics.go:404] availableUpdates is nil
W0802 10:06:27.659247       1 metrics.go:404] availableUpdates is nil
W0802 10:06:56.609267       1 metrics.go:404] availableUpdates is nil
W0802 10:06:57.652903       1 metrics.go:404] availableUpdates is nil
W0802 10:07:26.609033       1 metrics.go:404] availableUpdates is nil
W0802 10:07:27.655429       1 metrics.go:404] availableUpdates is nil

** 10:07:38, test-case failed, status.desired still pointing at 0.0.1-2020-08-02-091930 [3] ***

I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

Comment 10 W. Trevor King 2020-08-21 22:28:24 UTC
cvo#424 still waiting on review.

Comment 11 W. Trevor King 2020-08-25 17:22:19 UTC
*** Bug 1870344 has been marked as a duplicate of this bug. ***

Comment 12 W. Trevor King 2020-08-26 18:07:40 UTC
*** Bug 1872826 has been marked as a duplicate of this bug. ***

Comment 13 W. Trevor King 2020-08-26 18:08:34 UTC
For Sippy, from bug 1872826, seems like some test suites call this test:

  [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade

Comment 16 W. Trevor King 2020-08-27 04:45:54 UTC
Setting "No Doc Update", because the effect of this was a new CVO possibly taking a minute or two to pick up the orphaned lease.  Doesn't seem like a big deal, and we don't have a formal commitment around how quickly we turn that status acknowledgement around.

Comment 17 W. Trevor King 2020-08-27 04:55:19 UTC
Preliminary CI results look promising.  Adjuting maxAge from 72h to 24h to 12h:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 405 runs, 37% failed, 5% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 27 runs, 37% failed, 20% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 53 runs, 23% failed, 8% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.5-to-4.6 - 3 runs, 100% failed, 33% of failures match
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 91 runs, 41% failed, 3% of failures match
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=12h&context=0&type=junit&groupBy=job&search=Cluster+version+operator+acknowledges+upgrade' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 56 runs, 34% failed, 5% of failures match

We'll need more cook time to get more of those rollback runs under our belt.  And we probably need the backport bug 1872906 for them too.  But doesn't seem to be a big player anymore.  Not entirely clear where release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci went; I would have expected that to stay mad until the backport lands...

Comment 18 W. Trevor King 2020-08-27 04:58:29 UTC
Most recent release-openshift-origin-installer-e2e-aws-upgrade failure was [1], from 4.5.7 to 4.6.0-0.nightly-2020-08-27-005538, so yeah, that's going to need the 4.5 backport to land as well.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1298787752820609024

Comment 19 liujia 2020-09-03 08:19:28 UTC
After a week's cook time, from ci test results in past 24h, there are still failures "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade". Such as:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301201781459521536
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301241878347255808

So we may need further debug here, assign the bug back.

Comment 20 W. Trevor King 2020-09-10 05:01:04 UTC
(In reply to liujia from comment #19)
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1301241878347255808

This one was 4.5.7 -> 4.6.0-0.ci-2020-09-02-193151, which won't have a fix until the 4.5 backport in bug 1872906 lands.  But looking through [1] turns up a 4.6.0-0.nightly-2020-09-09-021653 -> 4.6.0-0.nightly-2020-09-09-062306 job [2] and a 4.6.0-0.nightly-2020-09-09-111337 -> 4.6.0-0.nightly-2020-09-09-152646 job [3].  Both hung after install, without noticing the update at all:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303581180779565056/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-09-09T06:37:38Z 2020-09-09T07:08:24Z 4.6.0-0.nightly-2020-09-09-021653 Completed false
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303718112817844224/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-09-09T15:47:01Z 2020-09-09T16:18:16Z 4.6.0-0.nightly-2020-09-09-111337 Completed false

so they sound more like bug 1873900 than this bug's lease-release.

[1]: https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=bug%2Bjunit&name=release-openshift-
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303581180779565056
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1303718112817844224

Comment 21 Lalatendu Mohanty 2020-09-10 12:55:27 UTC
Moving this bug to 4.7 as it is not critical for 4.6

Comment 22 W. Trevor King 2020-09-13 05:21:58 UTC
We keep landing logging, most recently https://github.com/openshift/cluster-version-operator/pull/451 .  I still need to track down something from CI since that has landed and see if it helps us understand why we sometimes stop syncing manifests.  I think that's orthogonal to the lease releases though, so I'm fine moving this back to ON_QA if we want to narrowly consider the lease-release angle for this bug.

Comment 23 Lalatendu Mohanty 2020-09-21 14:29:06 UTC
*** Bug 1881050 has been marked as a duplicate of this bug. ***

Comment 24 Lalatendu Mohanty 2020-09-21 14:32:17 UTC
As per the description of https://bugzilla.redhat.com/show_bug.cgi?id=1881050 this issue is effecting 8% of all upgrades in 4.6. So changing the release target to 4.6 and bumping the severity. 

test:
[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-cluster-lifecycle%5C%5D+Cluster+version+operator+acknowledges+upgrade

5% of upgrade attempts are still failing on this test.  It seems to be worse on gcp and azure than on aws.  It may be worse on 4.5 to 4.6 (so you may need backports first).

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352

Comment 25 W. Trevor King 2020-09-21 17:18:01 UTC
(In reply to Lalatendu Mohanty from comment #24)
> is failing frequently in CI, see search results:
> https://search.ci.openshift.org/?maxAge=168h...

168h takes us well before the workCh deadlock fix [1] for bug 1873900.  Checking a smaller time window:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=72h&type=junit&search=Cluster+version+operator+acknowledges+upgrade' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307238963374723072
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307346484340985856
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1307898498426867712
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307091197818310656
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307352263563415552
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307502212473163776
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1307073492423806976
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1307618568002605056
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci/1307346028545970176

Let's also exclude OKD for simplicity (feel free to file a separate bug if folks feel like there is an OKD-specific issue).

We need to exclude 4.5 -> 4.6 updates until this 4.6 bug has been VERIFIED to unlock the 4.5 bug 1872906 backport.  That covers the 4.5-stable-to-4.6-ci jobs, and also the e2e-aws-upgrade/1307977375606116352 job.

The remaining three e2e-aws-upgrade jobs are all 4.6 -> 4.7, and I don't want to worry about 4.7 releases until we accept a 4.7 CI payload [2].

[1]: https://github.com/openshift/cluster-version-operator/pull/455#event-3780199378
[2]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.7.0-0.ci

Comment 26 liujia 2020-09-22 03:04:18 UTC
From ci test results in past 48h, there are two failures "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade".

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307977375606116352
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1307502212473163776

[1] is for 4.5.11 to 4.6.0-0.nightly-2020-09-21-093308, which can be excluded due to #1872906.
[2] is for 4.6.0-fc.5 to 4.7.0-0.ci-2020-09-20-020844, which also can be excluded since no available 4.7 ci payload available according to comment25. 

So move this bug to verify to get v4.5 backport unblocked.


Note You need to log in before you can comment on or make changes to this bug.