+++ This bug was initially created as a clone of Bug #1838497 +++ --- Additional comment from W. Trevor King on 2020-05-28 08:39:16 UTC --- Ok, we have a better lead on this issue after looking at some local reproducers and then taking a closer look at the CVO logs attached in comment 6. From the attached unsuccessful_oc_logs_cluster_version_operator.log: I0524 16:02:50.774740 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty ... I0524 16:02:51.342076 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s ... I0524 16:02:51.444381 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Updating at attempt 0 ... I0524 16:08:36.950963 1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng= ... I0524 16:08:36.953283 1 task_graph.go:611] Result of work: [update was cancelled at 0 of 573] ... I0524 16:11:29.479262 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Reconciling at attempt 0 ... So the 4.3.19 CVO loads the 4.4.3 (in that case) manifests, begins updating to them, immediately hits a cancel/timeout [1], and then decides (mistakenly) that it successfully completed the update and start Reconciling. We're still not clear on exactly what the mistake is. In the meantime, reconciliation's shuffled, flattened manifest graph can do bad things like updating the kubelets before updating the Kubernetes API server. Raising to urgent while we work on bottoming this out. [1]: https://github.com/openshift/cluster-version-operator/pull/372 to improve the logging here, but the 5m45s duration between 2:51 and 8:36 roughly matches 2 * 2m52s [2]. [2]: https://github.com/openshift/cluster-version-operator/blob/86b9bdba55a85e2e071603916db4c43b481e7588/pkg/cvo/sync_worker.go#L296 --- Additional comment from W. Trevor King on 2020-05-28 12:46:48 UTC --- PR submitted. We should backport through 4.2, when we started supporting restricted-network flows, because the timing out signature retrievals plus forced updates common there are what makes tripping this race more likely. Here's a full impact statement, now that we understand the issue: Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All customers upgrading out of a CVO that does not contain the patch are potentially affected, but the chance of tripping the race is very small except for restricted-network users who are forcing updates. The impact when the race trips is also small for patch-level bumps, so the main concern is restricted-network users who are performing minor bumps like 4.2->4.3. What is the impact? Is it serious enough to warrant blocking edges? The CVO enters reconciliation mode on the target version, attempting to apply a flat, shuffled manifest graph. All kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator. That one will make manifest application sticky, but it would not surprise me if you could find some terrible ordering that might brick a cluster. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin must update to a CVO that is not vulnerable to the race. Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race. Using a patch-level update would reduce the impact if the race trips anyway. [1]: https://github.com/openshift/openshift-docs/pull/21993 --- Additional comment from Brenton Leanhardt on 2020-05-28 12:51:30 UTC --- Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race? --- Additional comment from W. Trevor King on 2020-05-28 13:11:06 UTC --- > Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race? Once you trip the race and move from UpdatingPayload to ReconcilingPayload, that CVO will not go back to updating. You can re-target your update with 'oc adm upgrade ...' and that will get you back into UpdatingPayload mode. But while the CVO was running between the trace trip and your update, it could have been doing all sorts of things as it tried to push out the flattened, shuffled manifest graph. Recovering a cluster that has hit this bug is going to be hard, and will probably involve a case-by-case review of its current state to try to determine a next-hop update target that is as close as possible to what the cluster is currently running. And also accounts for which directions and orders operators can transition in. Worst case short of a bricked cluster would be having to turn the CVO off entirely, and push manifests one at a time on its behalf to slowly unwind any components that had been too tangled up.
Verified this bug with 4.5.0-0.nightly-2020-06-04-001344, Passed. Set up a pure disconnected cluster with 4.4.4, trigger upgrade with --force to 4.5.0-0.nightly-2020-06-04-001344. [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-06-04-001344 True False False 5h24m cloud-credential 4.5.0-0.nightly-2020-06-04-001344 True False False 5h22m cluster-autoscaler 4.5.0-0.nightly-2020-06-04-001344 True False False 5h53m config-operator 4.5.0-0.nightly-2020-06-04-001344 True False False 111m console 4.5.0-0.nightly-2020-06-04-001344 True False False 81m csi-snapshot-controller 4.5.0-0.nightly-2020-06-04-001344 True False False 97m dns 4.5.0-0.nightly-2020-06-04-001344 True False False 5h57m etcd 4.5.0-0.nightly-2020-06-04-001344 True False False 3h31m image-registry 4.5.0-0.nightly-2020-06-04-001344 True False False 5h52m ingress 4.5.0-0.nightly-2020-06-04-001344 True False False 3h29m insights 4.5.0-0.nightly-2020-06-04-001344 True False False 5h54m kube-apiserver 4.5.0-0.nightly-2020-06-04-001344 True False False 5h56m kube-controller-manager 4.5.0-0.nightly-2020-06-04-001344 True False False 5h56m kube-scheduler 4.5.0-0.nightly-2020-06-04-001344 True False False 5h56m kube-storage-version-migrator 4.5.0-0.nightly-2020-06-04-001344 True False False 83m machine-api 4.5.0-0.nightly-2020-06-04-001344 True False False 5h57m machine-approver 4.5.0-0.nightly-2020-06-04-001344 True False False 175m machine-config 4.5.0-0.nightly-2020-06-04-001344 True False False 5h57m marketplace 4.5.0-0.nightly-2020-06-04-001344 True False False 81m monitoring 4.5.0-0.nightly-2020-06-04-001344 True False False 80m network 4.5.0-0.nightly-2020-06-04-001344 True False False 5h58m node-tuning 4.5.0-0.nightly-2020-06-04-001344 True False False 108m openshift-apiserver 4.5.0-0.nightly-2020-06-04-001344 True False False 98m openshift-controller-manager 4.5.0-0.nightly-2020-06-04-001344 True False False 5h52m openshift-samples 4.5.0-0.nightly-2020-06-04-001344 True False False 99m operator-lifecycle-manager 4.5.0-0.nightly-2020-06-04-001344 True False False 5h57m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-06-04-001344 True False False 5h57m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-06-04-001344 True False False 81m service-ca 4.5.0-0.nightly-2020-06-04-001344 True False False 5h58m storage 4.5.0-0.nightly-2020-06-04-001344 True False False 109m [root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-50-112.us-east-2.compute.internal Ready master 6h3m v1.18.3+a637491 ip-10-0-50-64.us-east-2.compute.internal Ready master 6h3m v1.18.3+a637491 ip-10-0-62-65.us-east-2.compute.internal Ready worker 5h54m v1.18.3+a637491 ip-10-0-67-224.us-east-2.compute.internal Ready master 6h3m v1.18.3+a637491 ip-10-0-75-234.us-east-2.compute.internal Ready worker 5h53m v1.18.3+a637491 ip-10-0-78-237.us-east-2.compute.internal Ready worker 5h53m v1.18.3+a637491 [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-06-04-001344 True False 71m Cluster version is 4.5.0-0.nightly-2020-06-04-001344 [root@preserve-jialiu-ansible ~]# grep -r "update was cancelled at" cvo.log <--empty-->
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409