Bug 1843732
Summary: | [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO enters reconciling mode without applying any manifests in update mode | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | ||||
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.3.z | CC: | aos-bugs, jiajliu, jialiu, jokerman, mfuruta | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.4.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: The Cluster Version Operator had a race where it would consider a timed-out update reconciliation cycle as a successful update. The race was very rare, except for restricted-network clusters where the operator timed out attempting to fetch release image signatures.
Consequence: The Cluster Version Operator would enter its shuffled-manifest reconciliation mode, possibly breaking the cluster if the manifests were applied in an order that the components could not handle.
Fix: The Cluster Version Operator now treats those timed-out updates as failures.
Result: The Cluster Version Operator no longer enters reconciling mode before the update succeeds.
|
Story Points: | --- | ||||
Clone Of: | 1843526 | ||||||
: | 1843987 (view as bug list) | Environment: | |||||
Last Closed: | 2020-06-17 22:27:05 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1843526 | ||||||
Bug Blocks: | 1843987, 1844117 | ||||||
Attachments: |
|
Description
W. Trevor King
2020-06-03 23:37:41 UTC
*** Bug 1843987 has been marked as a duplicate of this bug. *** Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced. 1. set up a full-disconnected cluster with 4.3.19 2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force option [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.19 True True 2m40s Working towards registry.svc.ci.openshift.org/ocp/release@sha256:7bfabfd2569bf719033f7c08c8040535627e07d36c5ecb78db9e7857ea325a4c: downloading update [root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-50-99.us-east-2.compute.internal NotReady,SchedulingDisabled master 135m v1.16.2 ip-10-0-54-103.us-east-2.compute.internal Ready worker 126m v1.16.2 ip-10-0-55-33.us-east-2.compute.internal Ready master 135m v1.16.2 ip-10-0-63-79.us-east-2.compute.internal NotReady,SchedulingDisabled worker 126m v1.16.2 ip-10-0-66-133.us-east-2.compute.internal Ready master 135m v1.16.2 ip-10-0-77-46.us-east-2.compute.internal Ready [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.19 True True 39m Unable to apply 4.4.0-0.nightly-2020-06-07-075345: the cluster operator etcd is degraded [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-06-07-075345 True False False 132m cloud-credential 4.3.19 True False False 132m cluster-autoscaler 4.4.0-0.nightly-2020-06-07-075345 True False False 163m console 4.3.19 True False False 40m dns 4.4.0-0.nightly-2020-06-07-075345 True False False 168m etcd 4.4.0-0.nightly-2020-06-07-075345 True False True 25m image-registry 4.4.0-0.nightly-2020-06-07-075345 True False False 85m ingress 4.3.19 True False False 40m insights 4.3.19 True False False 164m kube-apiserver 4.3.19 True False True 166m kube-controller-manager 4.3.19 True False True 166m kube-scheduler 4.3.19 True False True 166m machine-api 4.4.0-0.nightly-2020-06-07-075345 True False False 168m machine-config 4.3.19 False True True 29m marketplace 4.4.0-0.nightly-2020-06-07-075345 True False False 50m monitoring 4.4.0-0.nightly-2020-06-07-075345 True False False 81m network 4.3.19 True False False 168m node-tuning 4.4.0-0.nightly-2020-06-07-075345 True False False 47m openshift-apiserver 4.4.0-0.nightly-2020-06-07-075345 True False True 42m openshift-controller-manager 4.4.0-0.nightly-2020-06-07-075345 True False False 168m openshift-samples 4.3.19 True False False 154m operator-lifecycle-manager 4.3.19 True False False 164m operator-lifecycle-manager-catalog 4.3.19 True False False 164m operator-lifecycle-manager-packageserver 4.3.19 True False False 39m service-ca 4.4.0-0.nightly-2020-06-07-075345 True False False 168m service-catalog-apiserver 4.4.0-0.nightly-2020-06-07-075345 True False False 164m service-catalog-controller-manager 4.4.0-0.nightly-2020-06-07-075345 True False False 164m storage 4.4.0-0.nightly-2020-06-07-075345 True False False 53m [root@ip-10-0-50-99 ~]# journalctl -f -u kubelet <--snip--> Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.983939 48332 kubelet_node_status.go:70] Attempting to register node ip-10-0-50-99.us-east-2.compute.internal Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984028 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientMemory Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984072 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984265 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientPID' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientPID Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.998603 48332 kubelet_node_status.go:112] Node ip-10-0-50-99.us-east-2.compute.internal was previously registered Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.005588 48332 kubelet_node_status.go:122] Unable to reconcile node "ip-10-0-50-99.us-east-2.compute.internal" with API server: error updating node: failed to patch status "{\"metadata\":{\"labels\":{\"node.kubernetes.io/instance-type\":\"m4.xlarge\",\"topology.kubernetes.io/region\":\"us-east-2\",\"topology.kubernetes.io/zone\":\"us-east-2a\"}}}" for node "ip-10-0-50-99.us-east-2.compute.internal": nodes "ip-10-0-50-99.us-east-2.compute.internal" is forbidden: is not allowed to modify labels: topology.kubernetes.io/region, topology.kubernetes.io/zone Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.083904 48332 prober.go:129] Liveness probe for "openshift-kube-scheduler-ip-10-0-50-99.us-east-2.compute.internal_openshift-kube-scheduler(9afa2b85dd4b8d7bf23e023894147856):scheduler" succeeded Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.173944 48332 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.174004 48332 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: the server could not find the requested resource Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: F0608 05:01:00.174014 48332 csi_plugin.go:287] Failed to initialize CSINode after retrying: timed out waiting for the condition Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Failed with result 'exit-code'. Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Consumed 2.790s CPU tim Created attachment 1695988 [details]
qe cvo log
(In reply to Johnny Liu from comment #4) > Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced. > > 1. set up a full-disconnected cluster with 4.3.19 > 2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force > option The bug should be fixed when the patches land in the source release; the target release is irrelevant. This bug targets 4.4.z, so you should be verifying with a recent-4.4-nightly -> whatever update. Could be recent-4.4-nightly -> other-4.4.z. Could be recent-4.4-nightly -> 4.5.0-rc.1. But 4.3.z -> 4.4 is going to continue to fail until this 4.4.z bug gets VERIFIED, which will unblock the bug 1844117 4.3.z backport. Once bug 1844117 gets fixed, then 4.3.z -> 4.4 should be safe. Per my test experience, I can not reproduce this issue from 4.4 to 4.5, only reproduced from 4.3 to 4.4. Here I trigger an upgrade against a fully disconnected cluster from 4.4.0-0.nightly-2020-06-07-075345 to 4.5.0-0.nightly-2020-06-04-001344. SUCCESS. I will trigger one more round of testing from latest-4.3-nightly to 4.4.0-0.nightly-2020-06-07-075345 to confirm the original issue is really fixed. Move this bug to verified to unblock 4.3.z backport. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2445 |