Bug 1843732

Summary: [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO enters reconciling mode without applying any manifests in update mode
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.3.zCC: aos-bugs, jiajliu, jialiu, jokerman, mfuruta
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The Cluster Version Operator had a race where it would consider a timed-out update reconciliation cycle as a successful update. The race was very rare, except for restricted-network clusters where the operator timed out attempting to fetch release image signatures. Consequence: The Cluster Version Operator would enter its shuffled-manifest reconciliation mode, possibly breaking the cluster if the manifests were applied in an order that the components could not handle. Fix: The Cluster Version Operator now treats those timed-out updates as failures. Result: The Cluster Version Operator no longer enters reconciling mode before the update succeeds.
Story Points: ---
Clone Of: 1843526
: 1843987 (view as bug list) Environment:
Last Closed: 2020-06-17 22:27:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1843526    
Bug Blocks: 1843987, 1844117    
Attachments:
Description Flags
qe cvo log none

Description W. Trevor King 2020-06-03 23:37:41 UTC
+++ This bug was initially created as a clone of Bug #1843526 +++

+++ This bug was initially created as a clone of Bug #1838497 +++

--- Additional comment from W. Trevor King on 2020-05-28 08:39:16 UTC ---

Ok, we have a better lead on this issue after looking at some local reproducers and then taking a closer look at the CVO logs attached in comment 6.  From the attached unsuccessful_oc_logs_cluster_version_operator.log:

I0524 16:02:50.774740       1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty
...
I0524 16:02:51.342076       1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
...
I0524 16:02:51.444381       1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Updating at attempt 0
...
I0524 16:08:36.950963       1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng=
...
I0524 16:08:36.953283       1 task_graph.go:611] Result of work: [update was cancelled at 0 of 573]
...
I0524 16:11:29.479262       1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Reconciling at attempt 0
...

So the 4.3.19 CVO loads the 4.4.3 (in that case) manifests, begins updating to them, immediately hits a cancel/timeout [1], and then decides (mistakenly) that it successfully completed the update and start Reconciling.  We're still not clear on exactly what the mistake is.

In the meantime, reconciliation's shuffled, flattened manifest graph can do bad things like updating the kubelets before updating the Kubernetes API server.  Raising to urgent while we work on bottoming this out.

[1]: https://github.com/openshift/cluster-version-operator/pull/372 to improve the logging here, but the 5m45s duration between 2:51 and 8:36 roughly matches 2 * 2m52s [2].
[2]: https://github.com/openshift/cluster-version-operator/blob/86b9bdba55a85e2e071603916db4c43b481e7588/pkg/cvo/sync_worker.go#L296

--- Additional comment from W. Trevor King on 2020-05-28 12:46:48 UTC ---

PR submitted.  We should backport through 4.2, when we started supporting restricted-network flows, because the timing out signature retrievals plus forced updates common there are what makes tripping this race more likely.  Here's a full impact statement, now that we understand the issue:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  All customers upgrading out of a CVO that does not contain the patch are potentially affected, but the chance of tripping the race is very small except for restricted-network users who are forcing updates.
  The impact when the race trips is also small for patch-level bumps, so the main concern is restricted-network users who are performing minor bumps like 4.2->4.3.
What is the impact?  Is it serious enough to warrant blocking edges?
  The CVO enters reconciliation mode on the target version, attempting to apply a flat, shuffled manifest graph.  All kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator.  That one will make manifest application sticky, but it would not surprise me if you could find some terrible ordering that might brick a cluster.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Admin must update to a CVO that is not vulnerable to the race.  Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race.  Using a patch-level update would reduce the impact if the race trips anyway.

[1]: https://github.com/openshift/openshift-docs/pull/21993

--- Additional comment from Brenton Leanhardt on 2020-05-28 12:51:30 UTC ---

Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race?

--- Additional comment from W. Trevor King on 2020-05-28 13:11:06 UTC ---

> Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race?

Once you trip the race and move from UpdatingPayload to ReconcilingPayload, that CVO will not go back to updating.  You can re-target your update with 'oc adm upgrade ...' and that will get you back into UpdatingPayload mode.  But while the CVO was running between the trace trip and your update, it could have been doing all sorts of things as it tried to push out the flattened, shuffled manifest graph.  Recovering a cluster that has hit this bug is going to be hard, and will probably involve a case-by-case review of its current state to try to determine a next-hop update target that is as close as possible to what the cluster is currently running.  And also accounts for which directions and orders operators can transition in.  Worst case short of a bricked cluster would be having to turn the CVO off entirely, and push manifests one at a time on its behalf to slowly unwind any components that had been too tangled up.

Comment 1 Lalatendu Mohanty 2020-06-04 14:55:54 UTC
*** Bug 1843987 has been marked as a duplicate of this bug. ***

Comment 4 Johnny Liu 2020-06-08 05:11:46 UTC
Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced. 

1. set up a full-disconnected cluster with 4.3.19
2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force option

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.19    True        True          2m40s   Working towards registry.svc.ci.openshift.org/ocp/release@sha256:7bfabfd2569bf719033f7c08c8040535627e07d36c5ecb78db9e7857ea325a4c: downloading update


[root@preserve-jialiu-ansible ~]# oc get node
NAME                                        STATUS                        ROLES    AGE    VERSION
ip-10-0-50-99.us-east-2.compute.internal    NotReady,SchedulingDisabled   master   135m   v1.16.2
ip-10-0-54-103.us-east-2.compute.internal   Ready                         worker   126m   v1.16.2
ip-10-0-55-33.us-east-2.compute.internal    Ready                         master   135m   v1.16.2
ip-10-0-63-79.us-east-2.compute.internal    NotReady,SchedulingDisabled   worker   126m   v1.16.2
ip-10-0-66-133.us-east-2.compute.internal   Ready                         master   135m   v1.16.2
ip-10-0-77-46.us-east-2.compute.internal    Ready

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.19    True        True          39m     Unable to apply 4.4.0-0.nightly-2020-06-07-075345: the cluster operator etcd is degraded


[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-06-07-075345   True        False         False      132m
cloud-credential                           4.3.19                              True        False         False      132m
cluster-autoscaler                         4.4.0-0.nightly-2020-06-07-075345   True        False         False      163m
console                                    4.3.19                              True        False         False      40m
dns                                        4.4.0-0.nightly-2020-06-07-075345   True        False         False      168m
etcd                                       4.4.0-0.nightly-2020-06-07-075345   True        False         True       25m
image-registry                             4.4.0-0.nightly-2020-06-07-075345   True        False         False      85m
ingress                                    4.3.19                              True        False         False      40m
insights                                   4.3.19                              True        False         False      164m
kube-apiserver                             4.3.19                              True        False         True       166m
kube-controller-manager                    4.3.19                              True        False         True       166m
kube-scheduler                             4.3.19                              True        False         True       166m
machine-api                                4.4.0-0.nightly-2020-06-07-075345   True        False         False      168m
machine-config                             4.3.19                              False       True          True       29m
marketplace                                4.4.0-0.nightly-2020-06-07-075345   True        False         False      50m
monitoring                                 4.4.0-0.nightly-2020-06-07-075345   True        False         False      81m
network                                    4.3.19                              True        False         False      168m
node-tuning                                4.4.0-0.nightly-2020-06-07-075345   True        False         False      47m
openshift-apiserver                        4.4.0-0.nightly-2020-06-07-075345   True        False         True       42m
openshift-controller-manager               4.4.0-0.nightly-2020-06-07-075345   True        False         False      168m
openshift-samples                          4.3.19                              True        False         False      154m
operator-lifecycle-manager                 4.3.19                              True        False         False      164m
operator-lifecycle-manager-catalog         4.3.19                              True        False         False      164m
operator-lifecycle-manager-packageserver   4.3.19                              True        False         False      39m
service-ca                                 4.4.0-0.nightly-2020-06-07-075345   True        False         False      168m
service-catalog-apiserver                  4.4.0-0.nightly-2020-06-07-075345   True        False         False      164m
service-catalog-controller-manager         4.4.0-0.nightly-2020-06-07-075345   True        False         False      164m
storage                                    4.4.0-0.nightly-2020-06-07-075345   True        False         False      53m


[root@ip-10-0-50-99 ~]# journalctl -f  -u kubelet
<--snip-->
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.983939   48332 kubelet_node_status.go:70] Attempting to register node ip-10-0-50-99.us-east-2.compute.internal
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984028   48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984072   48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.984265   48332 event.go:281] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-50-99.us-east-2.compute.internal", UID:"ip-10-0-50-99.us-east-2.compute.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientPID' Node ip-10-0-50-99.us-east-2.compute.internal status is now: NodeHasSufficientPID
Jun 08 05:00:59 ip-10-0-50-99 hyperkube[48332]: I0608 05:00:59.998603   48332 kubelet_node_status.go:112] Node ip-10-0-50-99.us-east-2.compute.internal was previously registered
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.005588   48332 kubelet_node_status.go:122] Unable to reconcile node "ip-10-0-50-99.us-east-2.compute.internal" with API server: error updating node: failed to patch status "{\"metadata\":{\"labels\":{\"node.kubernetes.io/instance-type\":\"m4.xlarge\",\"topology.kubernetes.io/region\":\"us-east-2\",\"topology.kubernetes.io/zone\":\"us-east-2a\"}}}" for node "ip-10-0-50-99.us-east-2.compute.internal": nodes "ip-10-0-50-99.us-east-2.compute.internal" is forbidden: is not allowed to modify labels: topology.kubernetes.io/region, topology.kubernetes.io/zone
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.083904   48332 prober.go:129] Liveness probe for "openshift-kube-scheduler-ip-10-0-50-99.us-east-2.compute.internal_openshift-kube-scheduler(9afa2b85dd4b8d7bf23e023894147856):scheduler" succeeded
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: I0608 05:01:00.173944   48332 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: E0608 05:01:00.174004   48332 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: the server could not find the requested resource
Jun 08 05:01:00 ip-10-0-50-99 hyperkube[48332]: F0608 05:01:00.174014   48332 csi_plugin.go:287] Failed to initialize CSINode after retrying: timed out waiting for the condition
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 08 05:01:00 ip-10-0-50-99 systemd[1]: kubelet.service: Consumed 2.790s CPU tim

Comment 6 Johnny Liu 2020-06-08 05:41:27 UTC
Created attachment 1695988 [details]
qe cvo log

Comment 7 W. Trevor King 2020-06-08 19:18:45 UTC
(In reply to Johnny Liu from comment #4)
> Re-test this bug with 4.4.0-0.nightly-2020-06-07-075345, still reproduced. 
> 
> 1. set up a full-disconnected cluster with 4.3.19
> 2. trigger upgrade towards 4.4.0-0.nightly-2020-06-07-075345 with --force
> option

The bug should be fixed when the patches land in the source release; the target release is irrelevant.  This bug targets 4.4.z, so you should be verifying with a recent-4.4-nightly -> whatever update.  Could be recent-4.4-nightly -> other-4.4.z.  Could be recent-4.4-nightly -> 4.5.0-rc.1.  But 4.3.z -> 4.4 is going to continue to fail until this 4.4.z bug gets VERIFIED, which will unblock the bug 1844117 4.3.z backport.  Once bug 1844117 gets fixed, then 4.3.z -> 4.4 should be safe.

Comment 8 Johnny Liu 2020-06-09 05:48:56 UTC
Per my test experience, I can not reproduce this issue from 4.4 to 4.5, only reproduced from 4.3 to 4.4. 

Here I trigger an upgrade against a fully disconnected cluster from 4.4.0-0.nightly-2020-06-07-075345 to 4.5.0-0.nightly-2020-06-04-001344. SUCCESS.

I will trigger one more round of testing from latest-4.3-nightly to 4.4.0-0.nightly-2020-06-07-075345 to confirm the original issue is really fixed.

Move this bug to verified to unblock 4.3.z backport.

Comment 10 errata-xmlrpc 2020-06-17 22:27:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2445