Bug 1909875
Summary: | [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade : timed out waiting for cluster to acknowledge upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | ||||
Component: | Cluster Version Operator | Assignee: | Jack Ottofaro <jack.ottofaro> | ||||
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.7 | CC: | aos-bugs, jack.ottofaro, jiajliu, jokerman, lmohanty, pmahajan, vrutkovs, wking | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: |
This bug allows backport to version where issue occurs.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1915559 (view as bug list) | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
|
||||
Last Closed: | 2021-07-27 22:35:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1942055 | ||||||
Attachments: |
|
Description
Ben Parees
2020-12-21 21:55:07 UTC
Most recent issue with this test-case should have been resolved in 4.5.15 [1] and has not been backported to 4.4. Job you link is 4.4.31 -> 4.5.0-0.nightly-2020-12-21-141644. From test-case stdout in the job you linked: Dec 21 14:58:07.871: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-12-21-141644 STEP: continuously hitting pods through the service's LoadBalancer Dec 21 15:00:08.007: INFO: Cluster version: However, the gathered assets show the new target was picked up a few seconds later at 15:00:33Z: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1341025758704832512/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)' 2020-12-21T15:00:33Z - Partial 4.5.0-0.nightly-2020-12-21-141644 false 2020-12-21T14:34:05Z 2020-12-21T14:55:25Z Completed 4.4.31 false 4.4 maintenance ends when 4.7 GAs [2], so not quite yet, but given the minor delay at play here, I don't think this is important enough to be worth touching before 4.4 reaches end-of-life. Going back to the CI-search query: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h' | grep 'failures match' | sort pull-ci-openshift-installer-master-e2e-aws-upgrade - 26 runs, 81% failed, 5% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 107 runs, 22% failed, 17% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 16 runs, 88% failed, 7% of failures match release-openshift-origin-installer-e2e-azure-ovn-upgrade-4.4-stable-to-4.5-ci - 7 runs, 100% failed, 14% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 7 runs, 86% failed, 17% of failures match release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci - 7 runs, 86% failed, 33% of failures match So let's look at some of the 4.5 -> 4.6 failures. [3] is the only match. From its stdout: Dec 21 14:28:36.080: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153 STEP: continuously hitting pods through the service's LoadBalancer Dec 21 14:30:36.476: INFO: Cluster version: And from its gathered assets: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring) + " " + .image' 2020-12-21T14:31:02Z - Partial 4.6.0-0.ci-2020-12-21-112400 false registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153 2020-12-21T14:01:26Z 2020-12-21T14:25:38Z Completed 4.5.24 false registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11 So another case where we were a few seconds over the test-case's threshold for noticing the new target. We have logs of a CVO pod starting at 14:31:37.658518 [4], so that's launched after we noticed the new target. Looks like that job was using the centrally-hosted Loki [5], but I'm not remembering how to find those at the moment. Looking at events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort ... 2020-12-21T14:17:23Z 1 LeaderElection: ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 became leader 2020-12-21T14:31:02Z 1 LeaderElection: ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became leader 2020-12-21T14:31:02Z 1 SuccessfulCreate: Created pod: version--h7d82-ljwqc 2020-12-21T14:31:05Z 1 AddedInterface: Add eth0 [10.129.0.61/23] 2020-12-21T14:31:05Z 1 Pulling: Pulling image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:11Z 1 Completed: Job completed 2020-12-21T14:31:11Z 1 Created: Created container payload 2020-12-21T14:31:11Z 1 Pulled: Successfully pulled image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:11Z 1 Started: Started container payload 2020-12-21T14:31:12Z 1 Killing: Stopping container cluster-version-operator 2020-12-21T14:31:12Z 1 SandboxChanged: Pod sandbox changed, it will be killed and re-created. 2020-12-21T14:31:12Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-7bfc778cdb to 0 2020-12-21T14:31:12Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-7bfc778cdb-5fzgs 2020-12-21T14:31:14Z 1 AddedInterface: Add eth0 [10.129.0.62/23] 2020-12-21T14:31:21Z 1 Pulling: Pulling image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:21Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-659478b6d4 to 1 2020-12-21T14:31:21Z 1 SuccessfulCreate: Created pod: cluster-version-operator-659478b6d4-z9ns9 2020-12-21T14:31:26Z 1 Created: Created container cluster-version-operator 2020-12-21T14:31:26Z 1 Pulled: Successfully pulled image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:26Z 1 Started: Started container cluster-version-operator 2020-12-21T14:31:27Z 1 LeaderElection: ip-10-0-155-110_91825bd9-329f-42aa-95f3-cfd8f45850b5 became leader 2020-12-21T14:31:27Z 1 RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:27Z 1 VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:28Z 1 PayloadLoaded: payload loaded version="4.6.0-0.ci-2020-12-21-112400" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:28Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-659478b6d4 to 0 2020-12-21T14:31:28Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-659478b6d4-z9ns9 2020-12-21T14:31:29Z 1 Killing: Stopping container cluster-version-operator 2020-12-21T14:31:36Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-79d9b74fd8 to 1 2020-12-21T14:31:36Z 1 SuccessfulCreate: Created pod: cluster-version-operator-79d9b74fd8-q5vnj 2020-12-21T14:31:37Z 1 Created: Created container cluster-version-operator 2020-12-21T14:31:37Z 1 Pulled: Container image "registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" already present on machine 2020-12-21T14:31:37Z 1 Started: Started container cluster-version-operator 2020-12-21T14:31:38Z 1 LeaderElection: ip-10-0-155-110_979e790d-f90e-48c7-b3c3-ded7c683362e became leader 2020-12-21T14:31:38Z 1 PayloadLoaded: payload loaded version="4.6.0-0.ci-2020-12-21-112400" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:38Z 1 RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" 2020-12-21T14:31:38Z 1 VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153" Not entirely clear to me, but looks like maybe there's an extra CVO pod in the middle there. And the excitement starts at 14:31:02Z, which is alread ~1.5 minutes after the new target was requested. Might be something going on there, but we'd need to find the hosted-Loki logs to say for sure. And if there's an issue in that job, it would be with the 4.5.24 jumping-off point. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1872906#c10 [2]: https://access.redhat.com/support/policy/updates/openshift#dates [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880 [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-79d9b74fd8-q5vnj_cluster-version-operator.log [5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1341017974881914880/artifacts/e2e-aws-upgrade/ipi-install-hosted-loki/ Created attachment 1745014 [details] CVO logs Seems API connection was unstable at this time: ---- E1221 14:27:23.661434 1 leaderelection.go:320] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused I1221 14:27:23.977808 1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:27:23.977787534 +0000 UTC m=+750.236526296) I1221 14:27:23.977909 1 cvo.go:463] Desired version from operator is v1.Update{Version:"4.5.24", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", Force:false} ---- and so CVO has restarted: ---- I1221 14:27:38.660520 1 leaderelection.go:277] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition I1221 14:27:38.660664 1 start.go:218] Run context completed; beginning two-minute graceful shutdown period. I1221 14:27:38.660646 1 start.go:204] Stopped leading; shutting down. E1221 14:27:38.660633 1 leaderelection.go:296] Failed to release lock: resource name may not be empty I1221 14:27:38.660601 1 request.go:807] Error in request: resource name may not be empty I1221 14:27:38.660695 1 start.go:214] Waiting on 2 outstanding goroutines. I1221 14:27:38.660688 1 start.go:240] Collected leader controller goroutine. I1221 14:27:38.660676 1 start.go:214] Waiting on 3 outstanding goroutines. E1221 14:27:38.660671 1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) I1221 14:27:38.660690 1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 stopped leading I1221 14:27:38.660744 1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:27:38.660732924 +0000 UTC m=+764.919471709) I1221 14:27:38.660810 1 cvo.go:463] Desired version from operator is v1.Update{Version:"4.5.24", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", Force:false} I1221 14:27:38.660954 1 cvo.go:357] Shutting down ClusterVersionOperator I1221 14:27:38.660921 1 cvo.go:433] Finished syncing cluster version "openshift-cluster-version/version" (182.199µs) I1221 14:27:38.660977 1 start.go:214] Waiting on 1 outstanding goroutines. I1221 14:27:38.660966 1 start.go:240] Collected main operator goroutine. I1221 14:27:38.661161 1 start.go:240] Collected metrics server goroutine. I1221 14:27:38.661224 1 start.go:250] Finished collecting operator goroutines. W1221 14:27:38.661360 1 persist.go:51] Unable to save signatures during final flush: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config-managed/configmaps/signatures-managed: dial tcp 127.0.0.1:6443: connect: connection refused I1221 14:27:38.661349 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v4.5.0-202012120433.p0-dirty. I1221 14:27:39.837434 1 start.go:21] ClusterVersionOperator v4.5.0-202012120433.p0-dirty ---- During that time leader lock could not be refreshed, it happened few minutes later: I1221 14:30:21.484926 1 leaderelection.go:247] failed to acquire lease openshift-cluster-version/version I1221 14:30:21.484895 1 leaderelection.go:341] lock is held by ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 and has not yet expired I1221 14:31:02.353489 1 leaderelection.go:252] successfully acquired lease openshift-cluster-version/version I1221 14:31:02.353710 1 shared_informer.go:253] caches populated I1221 14:31:02.353681 1 cvo.go:323] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s I1221 14:31:02.354012 1 cvo.go:498] Started syncing available updates "openshift-cluster-version/version" (2020-12-21 14:31:02.354004609 +0000 UTC m=+202.527064530) I1221 14:31:02.354096 1 cvo.go:500] Finished syncing available updates "openshift-cluster-version/version" (85.572µs) I1221 14:31:02.354137 1 cvo.go:521] Started syncing upgradeable "openshift-cluster-version/version" (2020-12-21 14:31:02.354130545 +0000 UTC m=+202.527190646) I1221 14:31:02.354216 1 cvo.go:523] Finished syncing upgradeable "openshift-cluster-version/version" (80.279µs) I1221 14:31:02.354260 1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-cluster-version", Name:"version", UID:"44e221d7-a7e5-43a6-bcf8-c76bb8812dfb", APIVersion:"v1", ResourceVersion:"27090", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became leader I1221 14:31:02.354256 1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2020-12-21 14:31:02.35424947 +0000 UTC m=+202.527309474) I1221 14:31:02.354483 1 cvo.go:460] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256:bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153", Force:true} Grafana link: https://grafana-loki.svc.ci.openshift.org/goto/ByUs7o-Gk (In reply to Vadim Rutkovsky from comment #2) > Created attachment 1745014 [details] > CVO logs > > Seems API connection was unstable at this time: > ---- > E1221 14:27:23.661434 1 leaderelection.go:320] error retrieving > resource lock openshift-cluster-version/version: Get > https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/ > configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused > > I1221 14:27:23.977808 1 cvo.go:431] Started syncing cluster version > "openshift-cluster-version/version" (2020-12-21 14:27:23.977787534 +0000 UTC > m=+750.236526296) > > I1221 14:27:23.977909 1 cvo.go:463] Desired version from operator is > v1.Update{Version:"4.5.24", > Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256: > f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", > Force:false} > ---- > > and so CVO has restarted: > ---- > I1221 14:27:38.660520 1 leaderelection.go:277] failed to renew lease > openshift-cluster-version/version: timed out waiting for the condition > > I1221 14:27:38.660664 1 start.go:218] Run context completed; beginning > two-minute graceful shutdown period. > > I1221 14:27:38.660646 1 start.go:204] Stopped leading; shutting down. > > E1221 14:27:38.660633 1 leaderelection.go:296] Failed to release lock: > resource name may not be empty > > I1221 14:27:38.660601 1 request.go:807] Error in request: resource > name may not be empty > > I1221 14:27:38.660695 1 start.go:214] Waiting on 2 outstanding > goroutines. > > I1221 14:27:38.660688 1 start.go:240] Collected leader controller > goroutine. > > I1221 14:27:38.660676 1 start.go:214] Waiting on 3 outstanding > goroutines. > > E1221 14:27:38.660671 1 event.go:269] Unable to write event: 'can't > create an event with namespace 'default' in namespace > 'openshift-cluster-version'' (may retry after sleeping) > > I1221 14:27:38.660690 1 event.go:278] > Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", > APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: > 'LeaderElection' ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 > stopped leading > > I1221 14:27:38.660744 1 cvo.go:431] Started syncing cluster version > "openshift-cluster-version/version" (2020-12-21 14:27:38.660732924 +0000 UTC > m=+764.919471709) > > I1221 14:27:38.660810 1 cvo.go:463] Desired version from operator is > v1.Update{Version:"4.5.24", > Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256: > f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11", > Force:false} > > I1221 14:27:38.660954 1 cvo.go:357] Shutting down > ClusterVersionOperator > > I1221 14:27:38.660921 1 cvo.go:433] Finished syncing cluster version > "openshift-cluster-version/version" (182.199µs) > > I1221 14:27:38.660977 1 start.go:214] Waiting on 1 outstanding > goroutines. > > I1221 14:27:38.660966 1 start.go:240] Collected main operator > goroutine. > > I1221 14:27:38.661161 1 start.go:240] Collected metrics server > goroutine. > > I1221 14:27:38.661224 1 start.go:250] Finished collecting operator > goroutines. > > W1221 14:27:38.661360 1 persist.go:51] Unable to save signatures > during final flush: Get > https://127.0.0.1:6443/api/v1/namespaces/openshift-config-managed/configmaps/ > signatures-managed: dial tcp 127.0.0.1:6443: connect: connection refused > > I1221 14:27:38.661349 1 start.go:26] Graceful shutdown complete for > ClusterVersionOperator v4.5.0-202012120433.p0-dirty. > > I1221 14:27:39.837434 1 start.go:21] ClusterVersionOperator > v4.5.0-202012120433.p0-dirty > > ---- > > During that time leader lock could not be refreshed, it happened few minutes > later: > > I1221 14:30:21.484926 1 leaderelection.go:247] failed to acquire lease > openshift-cluster-version/version > > I1221 14:30:21.484895 1 leaderelection.go:341] lock is held by > ip-10-0-190-118_ae94cf77-a279-41a4-a919-543580f87867 and has not yet expired > > I1221 14:31:02.353489 1 leaderelection.go:252] successfully acquired > lease openshift-cluster-version/version > > I1221 14:31:02.353710 1 shared_informer.go:253] caches populated > > I1221 14:31:02.353681 1 cvo.go:323] Starting ClusterVersionOperator > with minimum reconcile period 2m52.525702462s > > I1221 14:31:02.354012 1 cvo.go:498] Started syncing available updates > "openshift-cluster-version/version" (2020-12-21 14:31:02.354004609 +0000 UTC > m=+202.527064530) > > I1221 14:31:02.354096 1 cvo.go:500] Finished syncing available updates > "openshift-cluster-version/version" (85.572µs) > > I1221 14:31:02.354137 1 cvo.go:521] Started syncing upgradeable > "openshift-cluster-version/version" (2020-12-21 14:31:02.354130545 +0000 UTC > m=+202.527190646) > > I1221 14:31:02.354216 1 cvo.go:523] Finished syncing upgradeable > "openshift-cluster-version/version" (80.279µs) > > I1221 14:31:02.354260 1 event.go:278] > Event(v1.ObjectReference{Kind:"ConfigMap", > Namespace:"openshift-cluster-version", Name:"version", > UID:"44e221d7-a7e5-43a6-bcf8-c76bb8812dfb", APIVersion:"v1", > ResourceVersion:"27090", FieldPath:""}): type: 'Normal' reason: > 'LeaderElection' ip-10-0-190-118_0e7ed6e6-6980-4681-a100-0435ea453cfc became > leader > > I1221 14:31:02.354256 1 cvo.go:431] Started syncing cluster version > "openshift-cluster-version/version" (2020-12-21 14:31:02.35424947 +0000 UTC > m=+202.527309474) > > I1221 14:31:02.354483 1 cvo.go:460] Desired version from spec is > v1.Update{Version:"", > Image:"registry.svc.ci.openshift.org/ci-op-gtjmrr32/release@sha256: > bb61b99c72d8addbe71800d38e6a1d961fb558af38372f8822fc164a3f0d0153", > Force:true} > > > Grafana link: https://grafana-loki.svc.ci.openshift.org/goto/ByUs7o-Gk The test requests the upgrade at 14:58:07.871 and then starts monitoring progress [1] but gets no response. Would be nice to know what's in the log around that time. ... Dec 21 14:58:07.871: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-12-21-141644 ... [1] https://github.com/openshift/origin/blob/ad9e07aa75958fc920f08464dba926b57a626a92/test/e2e/upgrade/upgrade.go#L317 That's odd, but we don't have records after 14:45 for the whole cluster. Not sure why though, seems it was torn down much later (there has been a test result at 15:05) Final log entries in CVO pod are: ``` I1221 14:45:55.636966 1 cvo.go:332] Shutting down ClusterVersionOperator I1221 14:45:55.637078 1 start.go:286] Collected main operator goroutine. I1221 14:45:55.637217 1 start.go:260] Waiting on 2 outstanding goroutines. I1221 14:45:55.636447 1 task_graph.go:532] Workers finished I1221 14:45:55.645546 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating] I1221 14:45:55.637345 1 reflector.go:213] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I1221 14:45:55.637417 1 reflector.go:213] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134 I1221 14:45:55.637463 1 reflector.go:213] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 I1221 14:45:55.637520 1 reflector.go:213] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I1221 14:45:55.637566 1 reflector.go:213] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I1221 14:45:55.645710 1 sync_worker.go:865] Summarizing 1 errors I1221 14:45:55.646131 1 sync_worker.go:869] Update error 93 of 617: ClusterOperatorNotAvailable Cluster operator kube-apiserver is still updating (*errors.errorString: cluster operator kube-apiserver is still updating) I1221 14:45:55.665697 1 sync_worker.go:361] Worker shut down I1221 14:45:55.665682 1 sync_worker.go:274] Stopped worker E1221 14:45:55.665641 1 sync_worker.go:348] unable to synchronize image (waiting 1m25.478249824s): Cluster operator kube-apiserver is still updating F1221 14:45:55.875270 1 start.go:193] Received shutdown signal twice, exiting: terminated k8s.io/klog/v2.(*loggingT).output(0x26b07c0, 0xc000000003, 0x0, 0x0, 0xc0001f25b0, 0x25f7daa, 0x8, 0xc1, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:996 +0xb9 k8s.io/klog/v2.stacks(0xc000012001, 0xc000ab2000, 0x60, 0x1fe) goroutine 27 [running]: ``` At this time kube API started failing: Dec 21 14:56:11.245 E kube-apiserver Kube API started failing: Get https://api.ci-op-2cjtqwm4-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=5s: dial tcp 52.15.198.53:6443: connect: connection refused and then Dec 21 15:05:26.562 E clusteroperator/kube-apiserver changed Degraded to True: NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 8:\nNodeInstallerDegraded: I'll revisit this on fresh, workflow-based upgrade tests and have another look. Another example of this test failing: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1348405150905012224 Test started at 23:42: > 2021/01/10 23:42:09 Executing pod "e2e-aws-upgrade-openshift-e2e-test" with activeDeadlineSeconds=10800 At 23:43 kube-apiserver connection is lost: > E0110 23:43:38.059077 1 reflector.go:383] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ConfigMap: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=23787&timeout=8m27s&timeoutSeconds=507&watch=true: dial tcp 127.0.0.1:6443: connect: connection refused Leader lease cannot be restored, so operator shuts down: > I0110 23:44:03.841022 1 leaderelection.go:277] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition > E0110 23:44:03.841111 1 leaderelection.go:296] Failed to release lock: resource name may not be empty > E0110 23:44:03.841101 1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) > I0110 23:44:03.841089 1 request.go:807] Error in request: resource name may not be empty > I0110 23:44:03.841119 1 start.go:204] Stopped leading; shutting down. ... > I0110 23:44:03.841623 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v4.5.0-202101051830.p0-dirty. After restart leader lease cannot be restored - previous lease hasn't expire yet: > I0110 23:45:54.791121 1 leaderelection.go:247] failed to acquire lease openshift-cluster-version/version > I0110 23:45:54.791095 1 leaderelection.go:341] lock is held by ip-10-0-229-138_debbb6eb-1ce3-4d07-ad54-f78f86d55fc8 and has not yet expired Meanwhile 4.6.ci update is downloaded: > I0110 23:47:58.523747 1 verify.go:298] Unable to verify sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee against keyring verifier-public-key-redhat > I0110 23:48:10.546761 1 payload.go:230] Loading updatepayload from "/etc/cvo/updatepayloads/XjG1rNwq9kcoYCqQPjIK4Q" > I0110 23:48:11.293071 1 sync_worker.go:547] Payload loaded from registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee with hash Kv17vphVbJM= So the test starts syncing manifests: > I0110 23:48:11.293190 1 cvo.go:431] Started syncing cluster version "openshift-cluster-version/version" (2021-01-10 23:48:11.293184032 +0000 UTC m=+246.667742394) > I0110 23:48:11.293295 1 cvo.go:460] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee", Force:true} > I0110 23:48:11.295993 1 task_graph.go:586] Waiting for workers to complete > I0110 23:48:11.296086 1 task_graph.go:575] Running 0 on 3 Finally the lease is acquired (after 5 mins): > I0110 23:48:46.467442 1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version > I0110 23:48:52.084669 1 cvo.go:406] Started syncing cluster version "openshift-cluster-version/version" (2021-01-10 23:48:52.084654977 +0000 UTC m=+16.151734730) But apparently CV status was not updated after a new lease was acquired as test expects that: > return cv.Status.ObservedGeneration >= updated.Generation, nil in https://github.com/openshift/origin/blob/release-4.6/test/e2e/upgrade/upgrade.go#L317, and eventually the test fails: ``` fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ci-op-fc9sc8b0/release@sha256:6ecfa480f2d4c44cf32abe9d6fca687c87b509525db56a8721dd05e6306f2bee Unexpected error: <*errors.errorString | 0xc0021db070>: { s: "timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition", } timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition occurred failed: (10m37s) 2021-01-10T23:52:55 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" ``` Seems we should resync CV status when the lease is acquired. We have not seen this in 4.6 to 4.7 update (CI tests). Reducing the sev/prio to medium. @jialiu Just need verification that you do not see this post 4.6 so I can use this bug to backport https://github.com/openshift/origin/pull/25984. From ci test results in past 48h, there are not failures post 4.6. # w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match' periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-ovn-upgrade (all) - 8 runs, 75% failed, 33% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 22 runs, 59% failed, 23% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 20 runs, 100% failed, 30% of failures match = 30% impact release-openshift-origin-installer-e2e-aws-upgrade (all) - 10 runs, 20% failed, 50% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 8 runs, 63% failed, 20% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-ovn-upgrade -> v4.4 tp v4.5 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade -> v4.5 to v4.6 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade -> v4.5 to v4.6 upgrade release-openshift-origin-installer-e2e-aws-upgrade -> v4.5 to v4.6 upgrade periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade -> v4.4 to v4.5 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback -> v4.5 to v4.6 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade -> v4.5 to v4.6 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade -> v4.5 to v4.6 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade -> v4.5 to v4.6 upgrade So move this bug to verify to get v4.6 backport unblocked. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |