Description of problem: during the upgradeVersion-Release number of selected component (if applicable): during upgrade from 4.1.0-0.nightly-2020-05-04-100857 -> 4.2.0-0.nightly-2020-05-07-194422 -> 4.3.0-0.nightly-2020-05-07-171148 -> 4.4.0-0.nightly-2020-05-08-224132 How reproducible: sometimes, found twice in upgrade CI one day Steps to Reproduce: 1. setup 4.1 cluster, and upgrade to 4.4 step by step 2. monitor upgrade progress Actual results: 2. failed to upgrade to 4.4.0-0.nightly-2020-05-08-224132 due to error: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-05-08-224132 True False False 4h3m cloud-credential 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m cluster-autoscaler 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m console 4.4.0-0.nightly-2020-05-08-224132 True False False 105m csi-snapshot-controller 4.4.0-0.nightly-2020-05-08-224132 True False False 103m dns 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m etcd 4.4.0-0.nightly-2020-05-08-224132 True False False 119m image-registry 4.4.0-0.nightly-2020-05-08-224132 True False False 143m ingress 4.4.0-0.nightly-2020-05-08-224132 True False False 4h9m insights 4.4.0-0.nightly-2020-05-08-224132 True False False 3h33m kube-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 118m kube-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 116m kube-scheduler 4.4.0-0.nightly-2020-05-08-224132 True False False 116m kube-storage-version-migrator 4.4.0-0.nightly-2020-05-08-224132 True False False 110m machine-api 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m machine-config 4.3.0-0.nightly-2020-05-07-171148 False True True 88m marketplace 4.4.0-0.nightly-2020-05-08-224132 True False False 109m monitoring 4.4.0-0.nightly-2020-05-08-224132 True False False 136m network 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m node-tuning 4.4.0-0.nightly-2020-05-08-224132 True False False 100m openshift-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m openshift-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h13m openshift-samples 4.4.0-0.nightly-2020-05-08-224132 True False False 4m53s operator-lifecycle-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h11m operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-05-08-224132 True False False 4h11m operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m service-ca 4.4.0-0.nightly-2020-05-08-224132 True False False 4h18m service-catalog-apiserver 4.4.0-0.nightly-2020-05-08-224132 True False False 104m service-catalog-controller-manager 4.4.0-0.nightly-2020-05-08-224132 True False False 4h2m storage 4.4.0-0.nightly-2020-05-08-224132 True False False 110m # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-05-09T22:22:18Z Generation: 1 Resource Version: 137088 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 8b48ddbf-9243-11ea-b1ae-0050568b3a4a Spec: Status: Conditions: Last Transition Time: 2020-05-10T01:11:44Z Message: Cluster not available for 4.4.0-0.nightly-2020-05-08-224132 Status: False Type: Available Last Transition Time: 2020-05-10T01:04:34Z Message: Working towards 4.4.0-0.nightly-2020-05-08-224132 Status: True Type: Progressing Last Transition Time: 2020-05-10T01:11:44Z Message: Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-05-09T23:34:49Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-05-07-171148 Events: <none> Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-05-09T22:22:18Z Generation: 1 Resource Version: 137088 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 8b48ddbf-9243-11ea-b1ae-0050568b3a4a Spec: Status: Conditions: Last Transition Time: 2020-05-10T01:11:44Z Message: Cluster not available for 4.4.0-0.nightly-2020-05-08-224132 Status: False Type: Available Last Transition Time: 2020-05-10T01:04:34Z Message: Working towards 4.4.0-0.nightly-2020-05-08-224132 Status: True Type: Progressing Last Transition Time: 2020-05-10T01:11:44Z Message: Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-05-09T23:34:49Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-05-07-171148 Events: <none> Expected results: 2. upgrade from 4.3 -> 4.4 should succeed Additional info: please get must-gather logs from comment
First glance: the error blocking the upgrade is: 2020-05-10T02:42:20.322072985Z E0510 02:42:20.322026 1 container_runtime_config_controller.go:374] could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.322072985Z I0510 02:42:20.322048 1 container_runtime_config_controller.go:375] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.355788752Z I0510 02:42:20.355743 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status You can see this in the machine-config-controller logs. Basically a nil pointer error for PlatformStatus on vsphere. A question: you say this happens "sometimes". How reproduceable is this within vsphere? CC'ing Christian since he worked on this recently and may have a better idea what the root cause is.
Actually, I can see that the vsphere file in question was last updated in Feb for release-4.4 branch: https://github.com/openshift/machine-config-operator/blob/release-4.4/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml. Adding Joseph to see if he knows what's up
4.4 does not include vSphere IPI. This check was added so that UPI did not get the in-network services. Perhaps an additional check for .Infra.Status.PlatformStatus or .Infra.Status
Should we add a {{ if .Infra.Status -}} to everything in https://github.com/openshift/machine-config-operator/commit/49dbfb23527502b7201241dcb865cd197d088a0e or is that a bit overkill
PR: https://github.com/openshift/machine-config-operator/pull/1728 Need to clone this bug for 4.4 and 4.5
Setting this up as the 4.4.z backport version of bug 1834925. My understanding is that we don't need to bother manually cloning backport bugs anymore now, because `/cherrypick ...` will create them as needed on our behalf.
Please note that we have a customer facing this issue during 4.3 to 4.4 (OCP + vsphere) upgrade.
I'd thought that this should have been caught by the job mentioned in bug 1787765. Hopefully someone's reviewing that, since it's not ideal to have an upgrade to a stable-4.4 release fail. FWIW I'm the customer Fatima mentioned. The cluster in question started with 4.1 and has been upgraded through to 4.4, with the 4.3 -> 4.4 failing on the MCO. I'm happy to provide any info I can if there are any questions.
Hello, I am facing the exact same problem. Upgrading from 4.3.18 to 4.4.5 on vSphere. Is there any fix? Cheers # oc logs machine-config-controller-5d58b57c47-bffvq I0525 22:39:03.615421 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0525 22:39:07.489762 1 render_controller.go:376] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true) I0525 22:39:07.490300 1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true) I0525 22:39:10.772401 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status # oc describe co machine-config Status: Conditions: Last Transition Time: 2020-05-25T14:41:08Z Message: Cluster not available for 4.4.5 Status: False Type: Available Last Transition Time: 2020-05-25T14:31:22Z Message: Working towards 4.4.5 Status: True Type: Progressing Last Transition Time: 2020-05-25T14:41:08Z Message: Unable to apply 4.4.5: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-04-23T01:00:59Z Reason: AsExpected Status: True Type: Upgradeable
The expectation is that the assignee answers these questions. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x What is the impact? - From my perspective I would think MCO would always fail How involved is remediation? - Upgrade to a 4.4.x version with this PR change in place. Is this a regression? - No, it’s always been like this we just never noticed. The templates were apart of 4.4 work for vSphere IPI.
Thanks, an existing upgrade that's stuck on this can be re-targeted to the version with this fix (once it becomes available) and the upgrade should complete? QE can we make sure to test the last question? It's ok if that part fails but we'll need to write up additional doc around how to unstick stuck upgrades.
(In reply to Joseph Callen from comment #13) > Who is impacted? > - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster must start at 4.1?
(In reply to Vadim Rutkovsky from comment #18) > (In reply to Joseph Callen from comment #13) > > Who is impacted? > > - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x > > Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster > must start at 4.1? If a customer starts with 4.4 they should not have a problem. If they upgrade from 4.1, 4.2 or 4.3 they will experience the bug
Telemtery data shows we have 4.2 clusters which upgraded to 4.3 and then to 4.4 successfully. Seems this bug affects clusters born in 4.1 only
Curious, why is a cluster that started at 4.1 and was upgraded through to 4.3 different than a cluster that started with 4.3? I'd thought that the combination of custom resources, operators, and RHCOS was to mitigate this sort of drift.
Yeah, ideally everything is managed by the cluster and there is no drift. However, there are still a few things, like the Infrastructure config, that aren't currently managed by an in-cluster operator. In this case, the Infrastructure object grew providerStatus in 4.2 [1] but there was, at the time, no suitable operator to migrate existing clusters. We've since grown out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1 Infrastructure configs. But the PR that closed bug 1814332 only addressed AWS, not vSphere, and was also not backported to 4.4. There should be some similar way to recover in this case with: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' $ oc -n openshift-machine-config-operator get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep machine-config-controller- | while read POD; do oc -n openshift-machine-config-operator delete pod "${POD}"; done or some such, but we haven't worked out the details yet. [1]: https://github.com/openshift/api/pull/300
Thanks. It's good to hear this is being addressed. That schema change had hit us in 4.1 -> 4.2 (bug 1773870), but it didn't seem like there was an appropriate place to handle this specific change. We'd worked around it by adding the new field after upgrading to 4.2. So our current value includes the following. I'd expect to mirror an install originating from 4.2. status: platform: VSphere platformStatus: type: VSphere The full value from our cluster is included in the comments on support case 02659494, in case that helps.
(In reply to W. Trevor King from comment #22) > Yeah, ideally everything is managed by the cluster and there is no drift. > However, there are still a few things, like the Infrastructure config, that > aren't currently managed by an in-cluster operator. In this case, the > Infrastructure object grew providerStatus in 4.2 [1] but there was, at the > time, no suitable operator to migrate existing clusters. We've since grown > out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1 > Infrastructure configs. But the PR that closed bug 1814332 only addressed > AWS, not vSphere, and was also not backported to 4.4. There should be some > similar way to recover in this case with: > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere"}}]' > $ oc -n openshift-machine-config-operator get -o jsonpath='{range > .items[*]}{.metadata.name}{"\n"}{end}' pods | grep > machine-config-controller- | while read POD; do oc -n > openshift-machine-config-operator delete pod "${POD}"; done > > or some such, but we haven't worked out the details yet. > > [1]: https://github.com/openshift/api/pull/300 Should we apply that two commands to our failing cluster upgrade?
> Should we apply that two commands to our failing cluster upgrade? They should be safe enough, but we haven't had time to work out whether they are sufficient to unstick things. If your cluster-version operator is just blocked on the machine-config operator (and not some earlier manifest), then yeah, go ahead and try and report back. If that feels too risky, wait a bit, and we'll get a more formal recovery procedure out (possibly re-targeting your update to a new 4.4.z with the fix this bug is backporting).
That patch reported no change on my cluster: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' infrastructure.config.openshift.io/cluster patched (no change) And after terminating the machine-config-controller pod, its replacement is still logging a lot of entries like the following: I0528 08:50:46.470311 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 08:50:46.551618 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 08:50:46.850498 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
Ah, we need to stick something in /status/platformStatus/vsphere too. How about: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' or some such?
oc get infrastructure cluster -o json | jq .status.platformStatus { "type": "VSphere", "vsphere": {} } I0528 09:22:31.283915 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 09:22:31.504563 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I'm not sure how case-sensitive this is. I note that the log is generally LikeThis, and the JSON shows likeThis. Since the log says VSphere, I tried lowercasing the first letter (producing vSphere instead of vsphere): $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]' That didn't work either; the new key disappeared entirely after that: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]' infrastructure.config.openshift.io/cluster patched $ oc get infrastructure cluster -o json | jq .status.platformStatus { "type": "VSphere" }
The PlatformStatus.VSphere from the logs is probably Go's casing, while platformstatus/vsphere is JSON's casing, per [1]. I don't understand the machine-config controller implementation well enough to know why it was still complaining about a nil pointer in <.Infra.Status.PlatformStatus.VSphere> when there were no nils in that chain. Maybe the old Infrastructure object without the patching is getting cached somewhere that killing the MCC pod does not reset. [1]: https://github.com/openshift/api/blob/0f159fee64dbf711d40dac3fa2ec8b563a2aaca8/config/v1/types_infrastructure.go#L170
(In reply to W. Trevor King from comment #27) > Ah, we need to stick something in /status/platformStatus/vsphere too. How > about: > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' > > or some such? I am getting this error afterwards: I0529 15:06:57.578864 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0529 15:06:57.582976 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0529 15:06:57.702248 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?!
(In reply to Andreas Söhnlein from comment #34) > (In reply to W. Trevor King from comment #27) > > Ah, we need to stick something in /status/platformStatus/vsphere too. How > > about: > > > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > > "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' > > > > or some such? > > I am getting this error afterwards: > > I0529 15:06:57.578864 1 template_controller.go:365] Error syncing > controllerconfig machine-config-controller: failed to create MachineConfig > for role master: failed to execute template: template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > I0529 15:06:57.582976 1 container_runtime_config_controller.go:369] > Error syncing image config openshift-config: could not Create/Update > MachineConfig: could not generate origin ContainerRuntime Configs: > generateMachineConfigsforRole failed with error failed to execute template: > template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > I0529 15:06:57.702248 1 kubelet_config_controller.go:313] Error > syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with > error failed to execute template: template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > > Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?! Well I didnt check it again before applying the patch. I also get that error message without patching the PlatformStatus. BTW the cluster should upgrade to 4.4.6 atm. Maybe there was a change from 4.4.5 to 4.4.6 so now this error message appears?
Tried following upgrade path today: 4.1.41 -> 4.2.34 -> 4.3.0-0.nightly-2020-06-01-043839 -> 4.4.0-0.nightly-2020-06-01-021027 1. Check starting version # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.41 True False 20m Cluster version is 4.1.41 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.41 True False False 20m cloud-credential 4.1.41 True False False 41m cluster-autoscaler 4.1.41 True False False 41m console 4.1.41 True False False 29m dns 4.1.41 True False False 40m image-registry 4.1.41 True False False 32m ingress 4.1.41 True False False 33m kube-apiserver 4.1.41 True False False 35m kube-controller-manager 4.1.41 True False False 34m kube-scheduler 4.1.41 True False False 34m machine-api 4.1.41 True False False 40m machine-config 4.1.41 True False False 34m marketplace 4.1.41 True False False 32m monitoring 4.1.41 True False False 31m network 4.1.41 True False False 41m node-tuning 4.1.41 True False False 34m openshift-apiserver 4.1.41 True False False 34m openshift-controller-manager 4.1.41 True False False 35m openshift-samples 4.1.41 True False False 29m operator-lifecycle-manager 4.1.41 True False False 36m operator-lifecycle-manager-catalog 4.1.41 True False False 36m service-ca 4.1.41 True False False 40m service-catalog-apiserver 4.1.41 True False False 34m service-catalog-controller-manager 4.1.41 True False False 34m storage 4.1.41 True False False 33m 2. update to 4.2.34 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.34 True False 87s Cluster version is 4.2.34 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.34 True False False 62m cloud-credential 4.2.34 True False False 82m cluster-autoscaler 4.2.34 True False False 82m console 4.2.34 True False False 8m dns 4.2.34 True False False 82m image-registry 4.2.34 True False False 6m ingress 4.2.34 True False False 74m insights 4.2.34 True False False 31m kube-apiserver 4.2.34 True False False 77m kube-controller-manager 4.2.34 True False False 76m kube-scheduler 4.2.34 True False False 76m machine-api 4.2.34 True False False 82m machine-config 4.2.34 True False False 76m marketplace 4.2.34 True False False 3m39s monitoring 4.2.34 True False False 5m34s network 4.2.34 True False False 82m node-tuning 4.2.34 True False False 2m56s openshift-apiserver 4.2.34 True False False 2m58s openshift-controller-manager 4.2.34 True False False 77m openshift-samples 4.2.34 True False False 31m operator-lifecycle-manager 4.2.34 True False False 78m operator-lifecycle-manager-catalog 4.2.34 True False False 78m operator-lifecycle-manager-packageserver 4.2.34 True False False 3m1s service-ca 4.2.34 True False False 82m service-catalog-apiserver 4.2.34 True False False 75m service-catalog-controller-manager 4.2.34 True False False 75m storage 4.2.34 True False False 30m 3. Updating to 4.3 # oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839 --force --allow-explicit-upgrade Updating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-06-01-043839 True False 2m46s Cluster version is 4.3.0-0.nightly-2020-06-01-043839 # oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 137m v1.16.2 compute-1 Ready worker 137m v1.16.2 compute-2 Ready worker 137m v1.16.2 control-plane-0 Ready master 137m v1.16.2 control-plane-1 Ready master 137m v1.16.2 control-plane-2 Ready master 137m v1.16.2 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2020-06-01-043839 True False False 123m cloud-credential 4.3.0-0.nightly-2020-06-01-043839 True False False 143m cluster-autoscaler 4.3.0-0.nightly-2020-06-01-043839 True False False 143m console 4.3.0-0.nightly-2020-06-01-043839 True False False 18m dns 4.3.0-0.nightly-2020-06-01-043839 True False False 142m image-registry 4.3.0-0.nightly-2020-06-01-043839 True False False 24m ingress 4.3.0-0.nightly-2020-06-01-043839 True False False 135m insights 4.3.0-0.nightly-2020-06-01-043839 True False False 92m kube-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 137m kube-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 137m kube-scheduler 4.3.0-0.nightly-2020-06-01-043839 True False False 136m machine-api 4.3.0-0.nightly-2020-06-01-043839 True False False 143m machine-config 4.3.0-0.nightly-2020-06-01-043839 True False False 137m marketplace 4.3.0-0.nightly-2020-06-01-043839 True False False 23m monitoring 4.3.0-0.nightly-2020-06-01-043839 True False False 20m network 4.3.0-0.nightly-2020-06-01-043839 True False False 143m node-tuning 4.3.0-0.nightly-2020-06-01-043839 True False False 13m openshift-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 14m openshift-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 137m openshift-samples 4.3.0-0.nightly-2020-06-01-043839 True False False 8m43s operator-lifecycle-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 138m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-06-01-043839 True False False 139m operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-06-01-043839 True False False 19m service-ca 4.3.0-0.nightly-2020-06-01-043839 True False False 142m service-catalog-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 136m service-catalog-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 136m storage 4.3.0-0.nightly-2020-06-01-043839 True False False 51m 4. updating to latest 4.4 nightly # oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --force --allow-explicit-upgrade Updating to release image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-06-01-043839 True True 45m Unable to apply 4.4.0-0.nightly-2020-06-01-021027: the cluster operator machine-config has not yet successfully rolled out # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-06-01T07:35:53Z Generation: 1 Resource Version: 107409 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 85d2801c-a3da-11ea-9656-0050568b03c3 Spec: Status: Conditions: Last Transition Time: 2020-06-01T10:33:58Z Message: Cluster not available for 4.4.0-0.nightly-2020-06-01-021027 Status: False Type: Available Last Transition Time: 2020-06-01T10:26:25Z Message: Working towards 4.4.0-0.nightly-2020-06-01-021027 Status: True Type: Progressing Last Transition Time: 2020-06-01T10:33:58Z Message: Unable to apply 4.4.0-0.nightly-2020-06-01-021027: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-06-01T08:55:50Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-06-01-043839 Events: <none> # oc logs -n openshift-machine-config-operator -f machine-config-controller-7488656c67-7phnq I0601 10:46:37.401973 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:37.688608 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:38.027834 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:41.898894 1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true) 5. Make sure the latest 4.4 nightly has the required fix # oc adm release info registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --pullspecs | grep machine-config-operator machine-config-operator quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f # oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f | grep commit io.openshift.build.commit.id=e3f4e2596eaf47a0081a4df04607eec9acd88e05 io.openshift.build.commit.url=https://github.com/openshift/machine-config-operator/commit/e3f4e2596eaf47a0081a4df04607eec9acd88e05 # git log e3f4e2596eaf47a0081a4df04607eec9acd88e05 | grep '#1735' Merge pull request #1735 from jcpowermac/CP1728 Assigning back, you can also take a look at cluster using the kubeconfig attached
Sorry for the inconvenience this has caused. The templates that we added to MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on various variables was to ensure the difference between UPI and IPI. In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) to implement a CI job that will upgrade a cluster on vSphere through the releases in the hope that we would catch this failure before a customer does. Workaround: 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' 2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml 3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller` 5.) Then perform an update
(In reply to Joseph Callen from comment #38) > Sorry for the inconvenience this has caused. The templates that we added to > MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on > various variables was to ensure the difference between UPI and IPI. > > In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) > to implement a CI job that will upgrade a cluster on vSphere through the > releases in the hope that we would catch this failure before a customer does. > > Workaround: > 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere"}}]' > 2.) oc get controllerconfigs.machineconfiguration.openshift.io > machine-config-controller -o yaml > mcc.yaml > 3.) oc delete controllerconfigs.machineconfiguration.openshift.io > machine-config-controller > 4.) confirm the above ^ is regenerated `oc get > controllerconfigs.machineconfiguration.openshift.io > machine-config-controller` > 5.) Then perform an update Hello Joseph, thanks for your response. I just managed to successfully upgrade our cluster to 4.4.6 without failure. The only issue that remains on my side is that the cluster complains about "unhealthy etcd members generated from openshift-cluster-etcd-operator-etcd-client" after etcd-operator had this issue "Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again" # oc logs etcd-operator-dd8898d94-gcdmx I0603 14:50:41.896787 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.234:2379 0 <nil>} {https://172.16.11.230:2379 0 <nil>} {https://172.16.11.233:2379 0 <nil>} {https://172.16.11.232:2379 0 <nil>} {https://172.16.11.231:2379 0 <nil>}] I0603 14:50:41.922569 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922768 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922778 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922785 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922791 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.907107 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0603 14:50:44.908266 1 client.go:361] parsed scheme: "endpoint" I0603 14:50:44.908327 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.231:2379 0 <nil>} {https://172.16.11.232:2379 0 <nil>} {https://172.16.11.230:2379 0 <nil>} {https://172.16.11.233:2379 0 <nil>} {https://172.16.11.234:2379 0 <nil>}] I0603 14:50:44.927584 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.933897 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935793 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935804 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935834 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:47.917208 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I dont know if it has anything to do with this bug? But it did not appear before the upgrade. Cheers
Hi Andreas, If you can please open a new BZ for that issue or contact customer support I think that would be the best way forward for an issue not related to the BZ subject. Thanks!
Thanks! FYI, this completed on my cluster, which was already hung on 4.3 -> 4.4. 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' This was a noop, since this field had already been set for a 4.1 -> 4.2 issue. 2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml 3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller Done and done. 4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller` This took about a minute to regenerate. 5.) Then perform an update My 4.3 -> 4.4 was already in progress. Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage?
It seems that `oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller` was the wrong thing to do on a 4.3 cluster. I ran it on my production cluster in preparation for the move to 4.3. Several nodes are now in NotReady state. I see new machine CSRs. Upon approving one, I saw a new entry in `oc get nodes` listed as "localhost.localdomain". Approving that machine's CSR, it progressed to "Ready" status, and the original node entry is still showing NotReady. With 4.3 out of full support I'm trying to stay in a current configuration. It would help a lot if Red Hat could test upgrades before dropping full support for the (n-1) point release.
The workaround in comment 38 is only relevant to a cluster that's already stuck on an upgrade and should not be used in any proactive manner. vSphere UPI clusters installed in 4.1 currently running 4.3 should delay upgrading to 4.4 until this bug has resolved. In general please engage with support rather than directly via bugzilla as they'll have the most context regarding your specific environment and requirements.
I have been engaged with support. Here's an excerpt from that conversation: > >> Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage? > > As of now, the engineering team was able to test this workaround before initiating an upgrade only. > So yes, if you have any different cluster where an upgrade is due then you can apply this workaround prior to the upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2445
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days