Bug 1909875

Summary:

[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade : timed out waiting for cluster to acknowledge upgrade

Product:

OpenShift Container Platform

Reporter:

Ben Parees <bparees>

Component:

Cluster Version Operator

Assignee:

Jack Ottofaro <jack.ottofaro>

Status:

CLOSED ERRATA

QA Contact:

liujia <jiajliu>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.7

CC:

aos-bugs, jack.ottofaro, jiajliu, jokerman, lmohanty, pmahajan, vrutkovs, wking

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

This bug allows backport to version where issue occurs.

Story Points:

---

Clone Of:

Clones:

1915559 (view as bug list)

Environment:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]

Last Closed:

2021-07-27 22:35:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1942055

Attachments:

Description	Flags
CVO logs	none

Description Ben Parees 2020-12-21 21:55:07 UTC

test:
[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1341025758704832512


: [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade 
timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition

Comment 1 W. Trevor King 2020-12-21 23:16:13 UTC

Most recent issue with this test-case should have been resolved in 4.5.15 [1] and has not been backported to 4.4.  Job you link is 4.4.31 -> 4.5.0-0.nightly-2020-12-21-141644.  From test-case stdout in the job you linked:

  Dec 21 14:58:07.871: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-12-21-141644
  STEP: continuously hitting pods through the service's LoadBalancer
  Dec 21 15:00:08.007: INFO: Cluster version:

However, the gathered assets show the new target was picked up a few seconds later at 15:00:33Z:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1341025758704832512/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)'
  2020-12-21T15:00:33Z - Partial 4.5.0-0.nightly-2020-12-21-141644 false
  2020-12-21T14:34:05Z 2020-12-21T14:55:25Z Completed 4.4.31 false

4.4 maintenance ends when 4.7 GAs [2], so not quite yet, but given the minor delay at play here, I don't think this is important enough to be worth touching before 4.4 reaches end-of-life.

Going back to the CI-search query:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h' | grep 'failures match' | sort
  pull-ci-openshift-installer-master-e2e-aws-upgrade - 26 runs, 81% failed, 5% of failures match
  release-openshift-origin-installer-e2e-aws-upgrade - 107 runs, 22% failed, 17% of failures match
  release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 16 runs, 88% failed, 7% of failures match
  release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.4-stable-to-4.5-ci - 7 runs, 100% failed, 14% of failures match
  release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 7 runs, 86% failed, 17% of failures match
  release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci - 7 runs, 86% failed, 33% of failures match

So let's look at some of the 4.5 -> 4.6 failures.  [3] is the only match.  From its stdout:

  Dec 21 14:28:36.080: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153
  STEP: continuously hitting pods through the service's LoadBalancer
  Dec 21 14:30:36.476: INFO: Cluster version:

And from its gathered assets:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring) + " " + .image'
  2020-12-21T14:31:02Z - Partial 4.6.0-0.ci-2020-12-21-112400 false registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153
  2020-12-21T14:01:26Z 2020-12-21T14:25:38Z Completed 4.5.24 false registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11

So another case where we were a few seconds over the test-case's threshold for noticing the new target.  We have logs of a CVO pod starting at 14:31:37.658518 [4], so that's launched after we noticed the new target.  Looks like that job was using the centrally-hosted Loki [5], but I'm not remembering how to find those at the moment.  Looking at events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
  ...
  2020-12-21T14:17:23Z 1 LeaderElection: ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 became leader
  2020-12-21T14:31:02Z 1 LeaderElection: ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became leader
  2020-12-21T14:31:02Z 1 SuccessfulCreate: Created pod: version--h7d82-ljwqc
  2020-12-21T14:31:05Z 1 AddedInterface: Add eth0 [10.129.0.61/23]
  2020-12-21T14:31:05Z 1 Pulling: Pulling image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:11Z 1 Completed: Job completed
  2020-12-21T14:31:11Z 1 Created: Created container payload
  2020-12-21T14:31:11Z 1 Pulled: Successfully pulled image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:11Z 1 Started: Started container payload
  2020-12-21T14:31:12Z 1 Killing: Stopping container cluster-version-operator
  2020-12-21T14:31:12Z 1 SandboxChanged: Pod sandbox changed, it will be killed and re-created.
  2020-12-21T14:31:12Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-7bfc778cdb to 0
  2020-12-21T14:31:12Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-7bfc778cdb-5fzgs
  2020-12-21T14:31:14Z 1 AddedInterface: Add eth0 [10.129.0.62/23]
  2020-12-21T14:31:21Z 1 Pulling: Pulling image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:21Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-659478b6d4 to 1
  2020-12-21T14:31:21Z 1 SuccessfulCreate: Created pod: cluster-version-operator-659478b6d4-z9ns9
  2020-12-21T14:31:26Z 1 Created: Created container cluster-version-operator
  2020-12-21T14:31:26Z 1 Pulled: Successfully pulled image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:26Z 1 Started: Started container cluster-version-operator
  2020-12-21T14:31:27Z 1 LeaderElection: ip-10-0-155-110_91825bd9-329f-42aa-95f3-cfd8f45850b5 became leader
  2020-12-21T14:31:27Z 1 RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:27Z 1 VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:28Z 1 PayloadLoaded: payload loaded version="4.6.0-0.ci-2020-12-21-112400" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:28Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-659478b6d4 to 0
  2020-12-21T14:31:28Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-659478b6d4-z9ns9
  2020-12-21T14:31:29Z 1 Killing: Stopping container cluster-version-operator
  2020-12-21T14:31:36Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-79d9b74fd8 to 1
  2020-12-21T14:31:36Z 1 SuccessfulCreate: Created pod: cluster-version-operator-79d9b74fd8-q5vnj
  2020-12-21T14:31:37Z 1 Created: Created container cluster-version-operator
  2020-12-21T14:31:37Z 1 Pulled: Container image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" already present on machine
  2020-12-21T14:31:37Z 1 Started: Started container cluster-version-operator
  2020-12-21T14:31:38Z 1 LeaderElection: ip-10-0-155-110_979e790d-f90e-48c7-b3c3-ded7c683362e became leader
  2020-12-21T14:31:38Z 1 PayloadLoaded: payload loaded version="4.6.0-0.ci-2020-12-21-112400" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:38Z 1 RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"
  2020-12-21T14:31:38Z 1 VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153"

Not entirely clear to me, but looks like maybe there's an extra CVO pod in the middle there.  And the excitement starts at 14:31:02Z, which is alread ~1.5 minutes after the new target was requested.  Might be something going on there, but we'd need to find the hosted-Loki logs to say for sure.  And if there's an issue in that job, it would be with the 4.5.24 jumping-off point.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1872906#c10
[2]: https://access.redhat.com/support/policy/updates/openshift#dates
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-79d9b74fd8-q5vnj_cluster-version-operator.log
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/ipi-install-hosted-loki/

Comment 2 Vadim Rutkovsky 2021-01-06 17:56:36 UTC

Created attachment 1745014 [details]
CVO logs

Seems API connection was unstable at this time:
----
E1221 14:27:23.661434       1 leaderelection.go:320] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
	
I1221 14:27:23.977808       1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:27:23.977787534 +0000 UTC m=+750.236526296)
	
I1221 14:27:23.977909       1 cvo.go:463] Desired version from operator is v1.Update{Version:"4.5.24", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", Force:false}
----

and so CVO has restarted:
----
I1221 14:27:38.660520       1 leaderelection.go:277] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition
	
I1221 14:27:38.660664       1 start.go:218] Run context completed; beginning two-minute graceful shutdown period.
	
I1221 14:27:38.660646       1 start.go:204] Stopped leading; shutting down.
	
E1221 14:27:38.660633       1 leaderelection.go:296] Failed to release lock: resource name may not be empty
	
I1221 14:27:38.660601       1 request.go:807] Error in request: resource name may not be empty
	
I1221 14:27:38.660695       1 start.go:214] Waiting on 2 outstanding goroutines.
	
I1221 14:27:38.660688       1 start.go:240] Collected leader controller goroutine.
	
I1221 14:27:38.660676       1 start.go:214] Waiting on 3 outstanding goroutines.
	
E1221 14:27:38.660671       1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
	
I1221 14:27:38.660690       1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 stopped leading
	
I1221 14:27:38.660744       1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:27:38.660732924 +0000 UTC m=+764.919471709)
	
I1221 14:27:38.660810       1 cvo.go:463] Desired version from operator is v1.Update{Version:"4.5.24", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", Force:false}
	
I1221 14:27:38.660954       1 cvo.go:357] Shutting down ClusterVersionOperator
	
I1221 14:27:38.660921       1 cvo.go:433] Finished syncing cluster version "openshift-cluster-version/version" (182.199µs)
	
I1221 14:27:38.660977       1 start.go:214] Waiting on 1 outstanding goroutines.
	
I1221 14:27:38.660966       1 start.go:240] Collected main operator goroutine.
	
I1221 14:27:38.661161       1 start.go:240] Collected metrics server goroutine.
	
I1221 14:27:38.661224       1 start.go:250] Finished collecting operator goroutines.
	
W1221 14:27:38.661360       1 persist.go:51] Unable to save signatures during final flush: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config-managed/configmaps/signatures-managed: dial tcp 127.0.0.1:6443: connect: connection refused
	
I1221 14:27:38.661349       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v4.5.0-202012120433.p0-dirty.
	
I1221 14:27:39.837434       1 start.go:21] ClusterVersionOperator v4.5.0-202012120433.p0-dirty
	
----

During that time leader lock could not be refreshed, it happened few minutes later:

I1221 14:30:21.484926       1 leaderelection.go:247] failed to acquire lease openshift-cluster-version/version
	
I1221 14:30:21.484895       1 leaderelection.go:341] lock is held by ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 and has not yet expired
	
I1221 14:31:02.353489       1 leaderelection.go:252] successfully acquired lease openshift-cluster-version/version
	
I1221 14:31:02.353710       1 shared_informer.go:253] caches populated
	
I1221 14:31:02.353681       1 cvo.go:323] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
	
I1221 14:31:02.354012       1 cvo.go:498] Started syncing available updates "openshift-cluster-version/version" (2020-12-21 14:31:02.354004609 +0000 UTC m=+202.527064530)
	
I1221 14:31:02.354096       1 cvo.go:500] Finished syncing available updates "openshift-cluster-version/version" (85.572µs)
	
I1221 14:31:02.354137       1 cvo.go:521] Started syncing upgradeable "openshift-cluster-version/version" (2020-12-21 14:31:02.354130545 +0000 UTC m=+202.527190646)
	
I1221 14:31:02.354216       1 cvo.go:523] Finished syncing upgradeable "openshift-cluster-version/version" (80.279µs)
	
I1221 14:31:02.354260       1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-cluster-version", Name:"version", UID:"44e221d7-a7e5-43a6-bcf8-c76bb8812dfb", APIVersion:"v1", ResourceVersion:"27090", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became leader
	
I1221 14:31:02.354256       1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:31:02.35424947 +0000 UTC m=+202.527309474)
	
I1221 14:31:02.354483       1 cvo.go:460] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153", Force:true}
	

Grafana link: https://grafana-loki.svc.ci.openshift.org/goto/ByUs7o-Gk

Comment 3 Jack Ottofaro 2021-01-07 13:51:38 UTC

(In reply to Vadim Rutkovsky from comment #2)
> Created attachment 1745014 [details]
> CVO logs
> 
> Seems API connection was unstable at this time:
> ----
> E1221 14:27:23.661434       1 leaderelection.go:320] error retrieving
> resource lock openshift-cluster-version/version: Get
> https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/
> configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
> 	
> I1221 14:27:23.977808       1 cvo.go:431] Started syncing cluster version
> "openshift-cluster-version/version" (2020-12-21 14:27:23.977787534 +0000 UTC
> m=+750.236526296)
> 	
> I1221 14:27:23.977909       1 cvo.go:463] Desired version from operator is
> v1.Update{Version:"4.5.24",
> Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:
> f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11",
> Force:false}
> ----
> 
> and so CVO has restarted:
> ----
> I1221 14:27:38.660520       1 leaderelection.go:277] failed to renew lease
> openshift-cluster-version/version: timed out waiting for the condition
> 	
> I1221 14:27:38.660664       1 start.go:218] Run context completed; beginning
> two-minute graceful shutdown period.
> 	
> I1221 14:27:38.660646       1 start.go:204] Stopped leading; shutting down.
> 	
> E1221 14:27:38.660633       1 leaderelection.go:296] Failed to release lock:
> resource name may not be empty
> 	
> I1221 14:27:38.660601       1 request.go:807] Error in request: resource
> name may not be empty
> 	
> I1221 14:27:38.660695       1 start.go:214] Waiting on 2 outstanding
> goroutines.
> 	
> I1221 14:27:38.660688       1 start.go:240] Collected leader controller
> goroutine.
> 	
> I1221 14:27:38.660676       1 start.go:214] Waiting on 3 outstanding
> goroutines.
> 	
> E1221 14:27:38.660671       1 event.go:269] Unable to write event: 'can't
> create an event with namespace 'default' in namespace
> 'openshift-cluster-version'' (may retry after sleeping)
> 	
> I1221 14:27:38.660690       1 event.go:278]
> Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"",
> APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason:
> 'LeaderElection' ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867
> stopped leading
> 	
> I1221 14:27:38.660744       1 cvo.go:431] Started syncing cluster version
> "openshift-cluster-version/version" (2020-12-21 14:27:38.660732924 +0000 UTC
> m=+764.919471709)
> 	
> I1221 14:27:38.660810       1 cvo.go:463] Desired version from operator is
> v1.Update{Version:"4.5.24",
> Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:
> f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11",
> Force:false}
> 	
> I1221 14:27:38.660954       1 cvo.go:357] Shutting down
> ClusterVersionOperator
> 	
> I1221 14:27:38.660921       1 cvo.go:433] Finished syncing cluster version
> "openshift-cluster-version/version" (182.199µs)
> 	
> I1221 14:27:38.660977       1 start.go:214] Waiting on 1 outstanding
> goroutines.
> 	
> I1221 14:27:38.660966       1 start.go:240] Collected main operator
> goroutine.
> 	
> I1221 14:27:38.661161       1 start.go:240] Collected metrics server
> goroutine.
> 	
> I1221 14:27:38.661224       1 start.go:250] Finished collecting operator
> goroutines.
> 	
> W1221 14:27:38.661360       1 persist.go:51] Unable to save signatures
> during final flush: Get
> https://127.0.0.1:6443/api/v1/namespaces/openshift-config-managed/configmaps/
> signatures-managed: dial tcp 127.0.0.1:6443: connect: connection refused
> 	
> I1221 14:27:38.661349       1 start.go:26] Graceful shutdown complete for
> ClusterVersionOperator v4.5.0-202012120433.p0-dirty.
> 	
> I1221 14:27:39.837434       1 start.go:21] ClusterVersionOperator
> v4.5.0-202012120433.p0-dirty
> 	
> ----
> 
> During that time leader lock could not be refreshed, it happened few minutes
> later:
> 
> I1221 14:30:21.484926       1 leaderelection.go:247] failed to acquire lease
> openshift-cluster-version/version
> 	
> I1221 14:30:21.484895       1 leaderelection.go:341] lock is held by
> ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 and has not yet expired
> 	
> I1221 14:31:02.353489       1 leaderelection.go:252] successfully acquired
> lease openshift-cluster-version/version
> 	
> I1221 14:31:02.353710       1 shared_informer.go:253] caches populated
> 	
> I1221 14:31:02.353681       1 cvo.go:323] Starting ClusterVersionOperator
> with minimum reconcile period 2m52.525702462s
> 	
> I1221 14:31:02.354012       1 cvo.go:498] Started syncing available updates
> "openshift-cluster-version/version" (2020-12-21 14:31:02.354004609 +0000 UTC
> m=+202.527064530)
> 	
> I1221 14:31:02.354096       1 cvo.go:500] Finished syncing available updates
> "openshift-cluster-version/version" (85.572µs)
> 	
> I1221 14:31:02.354137       1 cvo.go:521] Started syncing upgradeable
> "openshift-cluster-version/version" (2020-12-21 14:31:02.354130545 +0000 UTC
> m=+202.527190646)
> 	
> I1221 14:31:02.354216       1 cvo.go:523] Finished syncing upgradeable
> "openshift-cluster-version/version" (80.279µs)
> 	
> I1221 14:31:02.354260       1 event.go:278]
> Event(v1.ObjectReference{Kind:"ConfigMap",
> Namespace:"openshift-cluster-version", Name:"version",
> UID:"44e221d7-a7e5-43a6-bcf8-c76bb8812dfb", APIVersion:"v1",
> ResourceVersion:"27090", FieldPath:""}): type: 'Normal' reason:
> 'LeaderElection' ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became
> leader
> 	
> I1221 14:31:02.354256       1 cvo.go:431] Started syncing cluster version
> "openshift-cluster-version/version" (2020-12-21 14:31:02.35424947 +0000 UTC
> m=+202.527309474)
> 	
> I1221 14:31:02.354483       1 cvo.go:460] Desired version from spec is
> v1.Update{Version:"",
> Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:
> bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153",
> Force:true}
> 	
> 
> Grafana link: https://grafana-loki.svc.ci.openshift.org/goto/ByUs7o-Gk

The test requests the upgrade at 14:58:07.871 and then starts monitoring progress [1] but gets no response. Would be nice to know what's in the log around that time.

...
Dec 21 14:58:07.871: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-12-21-141644
...

[1] https://github.com/openshift/origin/blob/ad9e07aa75958fc920f08464dba926b57a626a92/test/e2e/upgrade/upgrade.go#L317

Comment 4 Vadim Rutkovsky 2021-01-08 18:28:53 UTC

That's odd, but we don't have records after 14:45 for the whole cluster. Not sure why though, seems it was torn down much later (there has been a test result at 15:05)


Final log entries in CVO pod are:
```
		
I1221 14:45:55.636966       1 cvo.go:332] Shutting down ClusterVersionOperator
	
I1221 14:45:55.637078       1 start.go:286] Collected main operator goroutine.
	
I1221 14:45:55.637217       1 start.go:260] Waiting on 2 outstanding goroutines.
	
I1221 14:45:55.636447       1 task_graph.go:532] Workers finished
	
I1221 14:45:55.645546       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating]
	
I1221 14:45:55.637345       1 reflector.go:213] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
	
I1221 14:45:55.637417       1 reflector.go:213] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134
	
I1221 14:45:55.637463       1 reflector.go:213] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
	
I1221 14:45:55.637520       1 reflector.go:213] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
	
I1221 14:45:55.637566       1 reflector.go:213] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
	
I1221 14:45:55.645710       1 sync_worker.go:865] Summarizing 1 errors
	
I1221 14:45:55.646131       1 sync_worker.go:869] Update error 93 of 617: ClusterOperatorNotAvailable Cluster operator kube-apiserver is still updating (*errors.errorString: cluster operator kube-apiserver is still updating)
	
I1221 14:45:55.665697       1 sync_worker.go:361] Worker shut down
	
I1221 14:45:55.665682       1 sync_worker.go:274] Stopped worker
	
E1221 14:45:55.665641       1 sync_worker.go:348] unable to synchronize image (waiting 1m25.478249824s): Cluster operator kube-apiserver is still updating
	
F1221 14:45:55.875270       1 start.go:193] Received shutdown signal twice, exiting: terminated
	
k8s.io/klog/v2.(*loggingT).output(0x26b07c0, 0xc000000003, 0x0, 0x0, 0xc0001f25b0, 0x25f7daa, 0x8, 0xc1, 0x0)
	
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
	
k8s.io/klog/v2.stacks(0xc000012001, 0xc000ab2000, 0x60, 0x1fe)
	
goroutine 27 [running]:
```

At this time kube API started failing:

Dec 21 14:56:11.245 E kube-apiserver Kube API started failing: Get https://api.ci-op-2cjtqwm4-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=5s: dial tcp 52.15.198.53:6443: connect: connection refused

and then

Dec 21 15:05:26.562 E clusteroperator/kube-apiserver changed Degraded to True: NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 8:\nNodeInstallerDegraded: 


I'll revisit this on fresh, workflow-based upgrade tests and have another look.

Comment 5 Vadim Rutkovsky 2021-01-11 10:03:34 UTC

Another example of this test failing: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1348405150905012224

Test started at 23:42:

> 2021/01/10 23:42:09 Executing pod "e2e-aws-upgrade-openshift-e2e-test" with activeDeadlineSeconds=10800

At 23:43 kube-apiserver connection is lost:

> E0110 23:43:38.059077       1 reflector.go:383] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ConfigMap: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=23787&timeout=8m27s&timeoutSeconds=507&watch=true: dial tcp 127.0.0.1:6443: connect: connection refused

Leader lease cannot be restored, so operator shuts down:

> I0110 23:44:03.841022       1 leaderelection.go:277] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition
> E0110 23:44:03.841111       1 leaderelection.go:296] Failed to release lock: resource name may not be empty
> E0110 23:44:03.841101       1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
> I0110 23:44:03.841089       1 request.go:807] Error in request: resource name may not be empty
> I0110 23:44:03.841119       1 start.go:204] Stopped leading; shutting down.
...
> I0110 23:44:03.841623       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v4.5.0-202101051830.p0-dirty.

After restart leader lease cannot be restored - previous lease hasn't expire yet:

>	I0110 23:45:54.791121       1 leaderelection.go:247] failed to acquire lease openshift-cluster-version/version
>	I0110 23:45:54.791095       1 leaderelection.go:341] lock is held by ip-10-0-229-138_debbb6eb-1ce3-4d07-ad54-f78f86d55fc8 and has not yet expired

Meanwhile 4.6.ci update is downloaded:
>	I0110 23:47:58.523747       1 verify.go:298] Unable to verify sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee against keyring verifier-public-key-redhat
> I0110 23:48:10.546761       1 payload.go:230] Loading updatepayload from "/etc/cvo/updatepayloads/XjG1rNwq9kcoYCqQPjIK4Q"
> I0110 23:48:11.293071       1 sync_worker.go:547] Payload loaded from registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee with hash Kv17vphVbJM=

So the test starts syncing manifests:
>	 I0110 23:48:11.293190       1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2021-01-10 23:48:11.293184032 +0000 UTC m=+246.667742394)
> I0110 23:48:11.293295       1 cvo.go:460] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee", Force:true}
> I0110 23:48:11.295993       1 task_graph.go:586] Waiting for workers to complete
> I0110 23:48:11.296086       1 task_graph.go:575] Running 0 on 3

Finally the lease is acquired (after 5 mins):

> I0110 23:48:46.467442       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version
> I0110 23:48:52.084669       1 cvo.go:406] Started syncing cluster version "openshift-cluster-version/version" (2021-01-10 23:48:52.084654977 +0000 UTC m=+16.151734730)

But apparently CV status was not updated after a new lease was acquired as test expects that:

> return cv.Status.ObservedGeneration >= updated.Generation, nil

in https://github.com/openshift/origin/blob/release-4.6/test/e2e/upgrade/upgrade.go#L317,

and eventually the test fails:

```
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee
Unexpected error:
    <*errors.errorString | 0xc0021db070>: {
        s: "timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition",
    }
    timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition
occurred

failed: (10m37s) 2021-01-10T23:52:55 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]"
```

Seems we should resync CV status when the lease is acquired.

Comment 6 Lalatendu Mohanty 2021-03-09 17:33:39 UTC

We have not seen this in 4.6 to 4.7 update (CI tests). Reducing the sev/prio to medium.

Comment 8 Jack Ottofaro 2021-03-22 15:19:57 UTC

@jialiu Just need verification that you do not see this post 4.6 so I can use this bug to backport https://github.com/openshift/origin/pull/25984.

Comment 9 liujia 2021-03-23 06:43:15 UTC

From ci test results in past 48h, there are not failures post 4.6.

# w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match'
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-ovn-upgrade (all) - 8 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 22 runs, 59% failed, 23% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 20 runs, 100% failed, 30% of failures match = 30% impact
release-openshift-origin-installer-e2e-aws-upgrade (all) - 10 runs, 20% failed, 50% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 8 runs, 63% failed, 20% of failures match = 13% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact

periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-ovn-upgrade -> v4.4 tp v4.5 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade -> v4.5 to v4.6 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade -> v4.5 to v4.6 upgrade
release-openshift-origin-installer-e2e-aws-upgrade -> v4.5 to v4.6 upgrade
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade -> v4.4 to v4.5 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback -> v4.5 to v4.6 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade -> v4.5 to v4.6 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade -> v4.5 to v4.6 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade  -> v4.5 to v4.6 upgrade

So move this bug to verify to get v4.6 backport unblocked.

Comment 12 errata-xmlrpc 2021-07-27 22:35:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438