4.5 bug 1814397 still just POST, so too soon for the Bugzilla bot to link this 4.4 bug to its backport PR. But I'm linking it manually now, because that makes it easier for me to understand where we stand on 4.4 UpgradeBlockers ;).
*** Bug 1817847 has been marked as a duplicate of this bug. ***
Please make the bug description public unless you have sensitive information. This impairs searching for bugs from CI tooling.
Upstream bug has Upgrade Impact Assessment questions, please answer them on the upstream bug. https://bugzilla.redhat.com/show_bug.cgi?id=1814397#c14
*** Bug 1819232 has been marked as a duplicate of this bug. ***
Reproduced the bug when upgrade from 4.3.9 to nightly 4.4. $ oc get node NAME STATUS ROLES AGE VERSION qe-upg-share-mmc2q-compute-0 Ready worker 19h v1.17.1 qe-upg-share-mmc2q-compute-1 Ready worker 19h v1.17.1 qe-upg-share-mmc2q-compute-2 Ready worker 19h v1.17.1 qe-upg-share-mmc2q-control-plane-0 Ready,SchedulingDisabled master 19h v1.16.2 qe-upg-share-mmc2q-control-plane-1 Ready master 19h v1.16.2 qe-upg-share-mmc2q-control-plane-2 Ready master 19h v1.16.2 $ oc get co master-config -o yaml <--snip--> status: conditions: - lastTransitionTime: "2020-04-03T07:16:10Z" message: Cluster not available for 4.4.0-0.nightly-2020-04-02-130551 status: "False" type: Available - lastTransitionTime: "2020-04-03T07:02:01Z" message: Working towards 4.4.0-0.nightly-2020-04-02-130551 status: "True" type: Progressing - lastTransitionTime: "2020-04-03T07:16:10Z" message: 'Unable to apply 4.4.0-0.nightly-2020-04-02-130551: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-1d1b2a692c950078690d8b3b215bec2f expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca, retrying' reason: RequiredPoolsFailed status: "True" type: Degraded <--snip--> $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-1d1b2a692c950078690d8b3b215bec2f False True True 3 0 0 1 20h worker rendered-worker-72f43e4889519a6ede04333776de8d32 True False False 3 3 3 0 20h $ oc get mcp master -o yaml <--snip--> - lastTransitionTime: "2020-04-03T07:08:32Z" message: 'Node qe-upg-share-mmc2q-control-plane-0 is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded <--snip-->
*** Bug 1821369 has been marked as a duplicate of this bug. ***
*** Bug 1821364 has been marked as a duplicate of this bug. ***
*** Bug 1821716 has been marked as a duplicate of this bug. ***
Upgrade from 4.3.9 to nightly 4.4.0-0.nightly-2020-04-07-130324 successfully, don't meet issue in the bug now. So move the bug to "Verified"
Hit the issue again for 4.2.29->4.3.18->4.4.1 upgrade. After upgrade v4.2.29-v4.3.18 successfully, we continue upgrade the cluster to v4.4.1. But the upgrade failed. # ./oc adm upgrade info: An upgrade is in progress. Unable to apply 4.4.1: the cluster operator openshift-apiserver is degraded # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.1 True False False 4h56m cloud-credential 4.4.1 True False False 5h8m cluster-autoscaler 4.4.1 True False False 5h2m console 4.4.1 True False False 137m csi-snapshot-controller 4.4.1 True False False 123m dns 4.4.1 True False False 5h8m etcd 4.4.1 True False False 145m image-registry 4.4.1 True False False 117m ingress 4.4.1 True False False 117m insights 4.4.1 True False False 5h8m kube-apiserver 4.4.1 True False False 5h7m kube-controller-manager 4.4.1 True False False 152m kube-scheduler 4.4.1 True False False 152m kube-storage-version-migrator 4.4.1 True False False 123m machine-api 4.4.1 True False False 5h9m machine-config 4.3.18 False True True 116m marketplace 4.4.1 True False False 144m monitoring 4.4.1 True False False 3h42m network 4.4.1 True False False 5h7m node-tuning 4.4.1 True False False 145m openshift-apiserver 4.4.1 True False True 137m openshift-controller-manager 4.4.1 True False False 5h7m openshift-samples 4.4.1 False True True 1s operator-lifecycle-manager 4.4.1 True False False 5h2m operator-lifecycle-manager-catalog 4.4.1 True False False 5h5m operator-lifecycle-manager-packageserver 4.4.1 True False False 136m service-ca 4.4.1 True False False 5h8m service-catalog-apiserver 4.4.1 True False False 137m service-catalog-controller-manager 4.4.1 True False False 4h7m storage 4.4.1 True False False 145m Checked openshift-apiserver degreaded is because one of master is unscheduled. # ./oc get node NAME STATUS ROLES AGE VERSION ugdci2-x9ljw-m-0.c.openshift-qe.internal Ready,SchedulingDisabled master 5h11m v1.16.2 ugdci2-x9ljw-m-1.c.openshift-qe.internal Ready master 5h11m v1.16.2 ugdci2-x9ljw-m-2.c.openshift-qe.internal Ready master 5h11m v1.16.2 ugdci2-x9ljw-w-a-frg9s.c.openshift-qe.internal Ready worker 5h5m v1.17.1 ugdci2-x9ljw-w-b-qggrc.c.openshift-qe.internal Ready worker 5h5m v1.17.1 ugdci2-x9ljw-w-c-7knmr.c.openshift-qe.internal Ready worker 5h6m v1.17.1 # ./oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-51bc0454ee0f0a886b5812eb225f400e False True True 3 0 0 1 5h13m worker rendered-worker-1893f230e08f250db257307b8a6db414 True False False 3 3 3 0 5h13m Reason: Status: False Type: Updated Last Transition Time: 2020-04-30T08:47:50Z Message: All nodes are updating to rendered-master-8cf84e0dd6e3e8e6b2a0d533d74074d8 Reason: Status: True Type: Updating Last Transition Time: 2020-04-30T08:48:09Z Message: Node ugdci2-x9ljw-m-0.c.openshift-qe.internal is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link" Reason: 1 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2020-04-30T08:48:09Z Message: Reason: Status: True Type: Degraded
I can succeed upgrade from 4.3.18 to 4.4.1 on ipi-on-azure.
Dropping in more complete summary of frequency from QE. xiaoli 1 hour ago - bug 1817455, QE hit it 4 times from 4.2 to 4.3 to 4.4, 1 time from 4.3 to 4.4 (AWS) , 3 succeed from 4.3 to 4.4 (in Azure, GCP, vSphere) (edited)
Ok, I've looked at an Azure cluster and I'm confirming my previous comment. The different failures across platforms are still triggered/or not by the same root cause: bugged backup and restore routine. There are mainly two different ways to trigger this bug but again same root cause: - the diff between 3 machine configs (rendered in our case) triggers this In the azure case we can only see 2 rendered MCs, **that's why it doesn't trigger** In the aws case we can see 3 rendered MCs because another MC has been deployed to tweak chrony In the 4.2->4.3->4.4 case, regardless of the platform, we'll always have 3+ MCs so this bug is triggered So, same root cause, same fix for the >=3 rendered machineconfigs case
Verified on 4.4.0-0.nightly-2020-04-30-145451. Upgrades are working again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475