+++ This bug was initially created as a clone of Bug #1834194 +++ Description of problem: during the upgradeVersion-Release number of selected component (if applicable): during upgrade from 4.1.0-0.nightly-2020-05-04-100857 -> 4.2.0-0.nightly-2020-05-07-194422 -> 4.3.0-0.nightly-2020-05-07-171148 -> 4.4.0-0.nightly-2020-05-08-224132 How reproducible: sometimes, found twice in upgrade CI one day Steps to Reproduce: 1. setup 4.1 cluster, and upgrade to 4.4 step by step 2. monitor upgrade progress Actual results: 2. failed to upgrade to 4.4.0-0.nightly-2020-05-08-224132 due to error: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-05-08-224132 True False False 4h3m cloud-credential 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m cluster-autoscaler 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m console 4.4.0-0.nightly-2020-05-08-224132 True False False 105m csi-snapshot-controller 4.4.0-0.nightly-2020-05-08-224132 True False False 103m dns 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m etcd 4.4.0-0.nightly-2020-05-08-224132 True False False 119m image-registry 4.4.0-0.nightly-2020-05-08-224132 True False False 143m ingress 4.4.0-0.nightly-2020-05-08-224132 True False False 4h9m insights 4.4.0-0.nightly-2020-05-08-224132 True False False 3h33m kube-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 118m kube-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 116m kube-scheduler 4.4.0-0.nightly-2020-05-08-224132 True False False 116m kube-storage-version-migrator 4.4.0-0.nightly-2020-05-08-224132 True False False 110m machine-api 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m machine-config 4.3.0-0.nightly-2020-05-07-171148 False True True 88m marketplace 4.4.0-0.nightly-2020-05-08-224132 True False False 109m monitoring 4.4.0-0.nightly-2020-05-08-224132 True False False 136m network 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m node-tuning 4.4.0-0.nightly-2020-05-08-224132 True False False 100m openshift-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m openshift-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h13m openshift-samples 4.4.0-0.nightly-2020-05-08-224132 True False False 4m53s operator-lifecycle-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h11m operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-05-08-224132 True False False 4h11m operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m service-ca 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m service-catalog-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m service-catalog-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h2m storage 4.4.0-0.nightly-2020-05-08-224132 True False False 110m # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-05-09T22:22:18Z Generation: 1 Resource Version: 137088 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 8b48ddbf-9243-11ea-b1ae-0050568b3a4a Spec: Status: Conditions: Last Transition Time: 2020-05-10T01:11:44Z Message: Cluster not available for 4.4.0-0.nightly-2020-05-08-224132 Status: False Type: Available Last Transition Time: 2020-05-10T01:04:34Z Message: Working towards 4.4.0-0.nightly-2020-05-08-224132 Status: True Type: Progressing Last Transition Time: 2020-05-10T01:11:44Z Message: Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-05-09T23:34:49Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-05-07-171148 Events: <none> Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-05-09T22:22:18Z Generation: 1 Resource Version: 137088 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 8b48ddbf-9243-11ea-b1ae-0050568b3a4a Spec: Status: Conditions: Last Transition Time: 2020-05-10T01:11:44Z Message: Cluster not available for 4.4.0-0.nightly-2020-05-08-224132 Status: False Type: Available Last Transition Time: 2020-05-10T01:04:34Z Message: Working towards 4.4.0-0.nightly-2020-05-08-224132 Status: True Type: Progressing Last Transition Time: 2020-05-10T01:11:44Z Message: Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-05-09T23:34:49Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-05-07-171148 Events: <none> Expected results: 2. upgrade from 4.3 -> 4.4 should succeed Additional info: please get must-gather logs from comment --- Additional comment from Yadan Pei on 2020-05-11 09:49:52 UTC --- must-gather logs: http://10.73.131.57:9000/minio/openshift-must-gather/2020-05-09-22-45-29/must-gather.local.6006406755885421395.tar.gz Access Key: 'openshift' Secret Key: 'am5bM2Es8SRYe$^A' http://10.73.131.57:9000/minio/openshift-must-gather/2020-05-10-22-33-11/must-gather.local.1373645183016184295.tar.gz Access Key: 'openshift' Secret Key: 'am5bM2Es8SRYe$^A' --- Additional comment from Yu Qi Zhang on 2020-05-11 21:11:13 UTC --- First glance: the error blocking the upgrade is: 2020-05-10T02:42:20.322072985Z E0510 02:42:20.322026 1 container_runtime_config_controller.go:374] could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.322072985Z I0510 02:42:20.322048 1 container_runtime_config_controller.go:375] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.355788752Z I0510 02:42:20.355743 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status You can see this in the machine-config-controller logs. Basically a nil pointer error for PlatformStatus on vsphere. A question: you say this happens "sometimes". How reproduceable is this within vsphere? CC'ing Christian since he worked on this recently and may have a better idea what the root cause is. --- Additional comment from Yu Qi Zhang on 2020-05-11 21:19:27 UTC --- Actually, I can see that the vsphere file in question was last updated in Feb for release-4.4 branch: https://github.com/openshift/machine-config-operator/blob/release-4.4/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml. Adding Joseph to see if he knows what's up --- Additional comment from Joseph Callen on 2020-05-11 21:35:39 UTC --- 4.4 does not include vSphere IPI. This check was added so that UPI did not get the in-network services. Perhaps an additional check for .Infra.Status.PlatformStatus or .Infra.Status --- Additional comment from Yu Qi Zhang on 2020-05-11 21:46:44 UTC --- Should we add a {{ if .Infra.Status -}} to everything in https://github.com/openshift/machine-config-operator/commit/49dbfb23527502b7201241dcb865cd197d088a0e or is that a bit overkill --- Additional comment from Joseph Callen on 2020-05-12 14:47:50 UTC --- PR: https://github.com/openshift/machine-config-operator/pull/1728 Need to clone this bug for 4.4 and 4.5
Version: 4.5.0-0.nightly-2020-05-19-031245 Upgrade ocp/vsphere from v4.4.4 to 4.5.0-0.nightly-2020-05-19-031245 successfully. Machine-config-operator works well. # ./oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.5.0-0.nightly-2020-05-19-031245 True False False 12h
This is a high priority bug it effects upgrade of OpenShift cluster to 4.4. Hence increasing the sev and adding upgradeblocker keyword. The bug for 4.4 backport also have similar severity https://bugzilla.redhat.com/show_bug.cgi?id=1834194#c11
Is there a manual workaround for the issue which customer can do and get unblocked?
Moving back to assigned based on 4.4 testing
The customer cases attached to this BZ should be on: https://bugzilla.redhat.com/show_bug.cgi?id=1834194 Sorry for the inconvenience this has caused. The templates that we added to MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on various variables was to ensure the difference between UPI and IPI. In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) to implement a CI job that will upgrade a cluster on vSphere through the releases in the hope that we would catch this failure before a customer does. Workaround: 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' 2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml 3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller` 5.) Then perform an update
verified on 4.5.0-0.nightly-2020-06-18-114733 upgrade ocp from 4.4.0-0.nightly-2020-06-18-212632 to 4.5.0-0.nightly-2020-06-18-114733, it is successful and machine-config-operator works well. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-06-18-114733 True False 20m Cluster version is 4.5.0-0.nightly-2020-06-18-114733 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-06-18-114733 True False False 98m cloud-credential 4.5.0-0.nightly-2020-06-18-114733 True False False 114m cluster-autoscaler 4.5.0-0.nightly-2020-06-18-114733 True False False 105m config-operator 4.5.0-0.nightly-2020-06-18-114733 True False False 73m console 4.5.0-0.nightly-2020-06-18-114733 True False False 33m csi-snapshot-controller 4.5.0-0.nightly-2020-06-18-114733 True False False 103m dns 4.5.0-0.nightly-2020-06-18-114733 True False False 109m etcd 4.5.0-0.nightly-2020-06-18-114733 True False False 109m image-registry 4.5.0-0.nightly-2020-06-18-114733 True False False 33m ingress 4.5.0-0.nightly-2020-06-18-114733 True False False 103m insights 4.5.0-0.nightly-2020-06-18-114733 True False False 106m kube-apiserver 4.5.0-0.nightly-2020-06-18-114733 True False False 109m kube-controller-manager 4.5.0-0.nightly-2020-06-18-114733 True False False 108m kube-scheduler 4.5.0-0.nightly-2020-06-18-114733 True False False 108m kube-storage-version-migrator 4.5.0-0.nightly-2020-06-18-114733 True False False 33m machine-api 4.5.0-0.nightly-2020-06-18-114733 True False False 106m machine-approver 4.5.0-0.nightly-2020-06-18-114733 True False False 65m machine-config 4.5.0-0.nightly-2020-06-18-114733 True False False 109m marketplace 4.5.0-0.nightly-2020-06-18-114733 True False False 32m monitoring 4.5.0-0.nightly-2020-06-18-114733 True False False 62m network 4.5.0-0.nightly-2020-06-18-114733 True False False 111m node-tuning 4.5.0-0.nightly-2020-06-18-114733 True False False 65m openshift-apiserver 4.5.0-0.nightly-2020-06-18-114733 True False False 106m openshift-controller-manager 4.5.0-0.nightly-2020-06-18-114733 True False False 106m openshift-samples 4.5.0-0.nightly-2020-06-18-114733 True False False 65m operator-lifecycle-manager 4.5.0-0.nightly-2020-06-18-114733 True False False 109m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-06-18-114733 True False False 109m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-06-18-114733 True False False 32m service-ca 4.5.0-0.nightly-2020-06-18-114733 True False False 110m storage 4.5.0-0.nightly-2020-06-18-114733 True False False 65m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409