Bug 1843732
| Summary: | [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO enters reconciling mode without applying any manifests in update mode | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.3.z | CC: | aos-bugs, jiajliu, jialiu, jokerman, mfuruta | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.4.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: The Cluster Version Operator had a race where it would consider a timed-out update reconciliation cycle as a successful update. The race was very rare, except for restricted-network clusters where the operator timed out attempting to fetch release image signatures.
Consequence: The Cluster Version Operator would enter its shuffled-manifest reconciliation mode, possibly breaking the cluster if the manifests were applied in an order that the components could not handle.
Fix: The Cluster Version Operator now treats those timed-out updates as failures.
Result: The Cluster Version Operator no longer enters reconciling mode before the update succeeds.
|
Story Points: | --- | ||||
| Clone Of: | 1843526 | ||||||
| : | 1843987 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-06-17 22:27:05 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1843526 | ||||||
| Bug Blocks: | 1843987, 1844117 | ||||||
| Attachments: |
|
||||||
|
Description
W. Trevor King
2020-06-03 23:37:41 UTC
*** Bug 1843987 has been marked as a duplicate of this bug. *** Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced.
1. set up a full-disconnected cluster with 4.3.19
2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force option
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.19 True True 2m40s Working towards registry.svc.ci.openshift.org/ocp/release@sha256:7bfabfd2569bf719033f7c08c8040535627e07d36c5ecb78db9e7857ea325a4c: downloading update
[root@preserve-jialiu-ansible ~]# oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-50-99.us-east-2.compute.internal NotReady,SchedulingDisabled master 135m v1.16.2
ip-10-0-54-103.us-east-2.compute.internal Ready worker 126m v1.16.2
ip-10-0-55-33.us-east-2.compute.internal Ready master 135m v1.16.2
ip-10-0-63-79.us-east-2.compute.internal NotReady,SchedulingDisabled worker 126m v1.16.2
ip-10-0-66-133.us-east-2.compute.internal Ready master 135m v1.16.2
ip-10-0-77-46.us-east-2.compute.internal Ready
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.19 True True 39m Unable to apply 4.4.0-0.nightly-2020-06-07-075345: the cluster operator etcd is degraded
[root@preserve-jialiu-ansible ~]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.0-0.nightly-2020-06-07-075345 True False False 132m
cloud-credential 4.3.19 True False False 132m
cluster-autoscaler 4.4.0-0.nightly-2020-06-07-075345 True False False 163m
console 4.3.19 True False False 40m
dns 4.4.0-0.nightly-2020-06-07-075345 True False False 168m
etcd 4.4.0-0.nightly-2020-06-07-075345 True False True 25m
image-registry 4.4.0-0.nightly-2020-06-07-075345 True False False 85m
ingress 4.3.19 True False False 40m
insights 4.3.19 True False False 164m
kube-apiserver 4.3.19 True False True 166m
kube-controller-manager 4.3.19 True False True 166m
kube-scheduler 4.3.19 True False True 166m
machine-api 4.4.0-0.nightly-2020-06-07-075345 True False False 168m
machine-config 4.3.19 False True True 29m
marketplace 4.4.0-0.nightly-2020-06-07-075345 True False False 50m
monitoring 4.4.0-0.nightly-2020-06-07-075345 True False False 81m
network 4.3.19 True False False 168m
node-tuning 4.4.0-0.nightly-2020-06-07-075345 True False False 47m
openshift-apiserver 4.4.0-0.nightly-2020-06-07-075345 True False True 42m
openshift-controller-manager 4.4.0-0.nightly-2020-06-07-075345 True False False 168m
openshift-samples 4.3.19 True False False 154m
operator-lifecycle-manager 4.3.19 True False False 164m
operator-lifecycle-manager-catalog 4.3.19 True False False 164m
operator-lifecycle-manager-packageserver 4.3.19 True False False 39m
service-ca 4.4.0-0.nightly-2020-06-07-075345 True False False 168m
service-catalog-apiserver 4.4.0-0.nightly-2020-06-07-075345 True False False 164m
service-catalog-controller-manager 4.4.0-0.nightly-2020-06-07-075345 True False False 164m
storage 4.4.0-0.nightly-2020-06-07-075345 True False False 53m
[root@ip-10-0-50-99 ~]# journalctl -f -u kubelet
<--snip-->
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.983939 48332 kubelet_node_status.go:70] Attempting to register node ip-10-0-50-99.us-east-2.compute.internal
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984028 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984072 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984265 48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientPID' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientPID
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.998603 48332 kubelet_node_status.go:112] Node ip-10-0-50-99.us-east-2.compute.internal was previously registered
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.005588 48332 kubelet_node_status.go:122] Unable to reconcile node "ip-10-0-50-99.us-east-2.compute.internal" with API server: error updating node: failed to patch status "{\"metadata\":{\"labels\":{\"node.kubernetes.io/instance-type\":\"m4.xlarge\",\"topology.kubernetes.io/region\":\"us-east-2\",\"topology.kubernetes.io/zone\":\"us-east-2a\"}}}" for node "ip-10-0-50-99.us-east-2.compute.internal": nodes "ip-10-0-50-99.us-east-2.compute.internal" is forbidden: is not allowed to modify labels: topology.kubernetes.io/region, topology.kubernetes.io/zone
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.083904 48332 prober.go:129] Liveness probe for "openshift-kube-scheduler-ip-10-0-50-99.us-east-2.compute.internal_openshift-kube-scheduler(9afa2b85dd4b8d7bf23e023894147856):scheduler" succeeded
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.173944 48332 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.174004 48332 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: the server could not find the requested resource
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: F0608 05:01:00.174014 48332 csi_plugin.go:287] Failed to initialize CSINode after retrying: timed out waiting for the condition
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Consumed 2.790s CPU tim
Created attachment 1695988 [details]
qe cvo log
(In reply to Johnny Liu from comment #4) > Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced. > > 1. set up a full-disconnected cluster with 4.3.19 > 2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force > option The bug should be fixed when the patches land in the source release; the target release is irrelevant. This bug targets 4.4.z, so you should be verifying with a recent-4.4-nightly -> whatever update. Could be recent-4.4-nightly -> other-4.4.z. Could be recent-4.4-nightly -> 4.5.0-rc.1. But 4.3.z -> 4.4 is going to continue to fail until this 4.4.z bug gets VERIFIED, which will unblock the bug 1844117 4.3.z backport. Once bug 1844117 gets fixed, then 4.3.z -> 4.4 should be safe. Per my test experience, I can not reproduce this issue from 4.4 to 4.5, only reproduced from 4.3 to 4.4. Here I trigger an upgrade against a fully disconnected cluster from 4.4.0-0.nightly-2020-06-07-075345 to 4.5.0-0.nightly-2020-06-04-001344. SUCCESS. I will trigger one more round of testing from latest-4.3-nightly to 4.4.0-0.nightly-2020-06-07-075345 to confirm the original issue is really fixed. Move this bug to verified to unblock 4.3.z backport. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2445 |