Bug 1834194
Summary: | upgrade from 4.1 -> 4.2 -> 4.3 -> 4.4 upgrade failed at waitForControllerConfigToBeCompleted | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yadan Pei <yapei> | |
Component: | Machine Config Operator | Assignee: | Joseph Callen <jcallen> | |
Machine Config Operator sub component: | platform-vsphere | QA Contact: | Yadan Pei <yapei> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | adeshpan, aelganzo, andreas.soehnlein, aos-bugs, ChetRHosey, fshaikh, jcallen, jerzhang, jima, kgarriso, knaeem, lmohanty, mabajodu, miabbott, mkrejci, mnguyen, rkshirsa, rsandu, sdodson, vlaad, vrutkovs, wking, yapei | |
Version: | 4.4 | Keywords: | Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.4.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1834925 (view as bug list) | Environment: | ||
Last Closed: | 2020-06-17 22:26:36 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1834925 | |||
Bug Blocks: |
Description
Yadan Pei
2020-05-11 09:48:24 UTC
First glance: the error blocking the upgrade is: 2020-05-10T02:42:20.322072985Z E0510 02:42:20.322026 1 container_runtime_config_controller.go:374] could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.322072985Z I0510 02:42:20.322048 1 container_runtime_config_controller.go:375] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status 2020-05-10T02:42:20.355788752Z I0510 02:42:20.355743 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status You can see this in the machine-config-controller logs. Basically a nil pointer error for PlatformStatus on vsphere. A question: you say this happens "sometimes". How reproduceable is this within vsphere? CC'ing Christian since he worked on this recently and may have a better idea what the root cause is. Actually, I can see that the vsphere file in question was last updated in Feb for release-4.4 branch: https://github.com/openshift/machine-config-operator/blob/release-4.4/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml. Adding Joseph to see if he knows what's up 4.4 does not include vSphere IPI. This check was added so that UPI did not get the in-network services. Perhaps an additional check for .Infra.Status.PlatformStatus or .Infra.Status Should we add a {{ if .Infra.Status -}} to everything in https://github.com/openshift/machine-config-operator/commit/49dbfb23527502b7201241dcb865cd197d088a0e or is that a bit overkill PR: https://github.com/openshift/machine-config-operator/pull/1728 Need to clone this bug for 4.4 and 4.5 Setting this up as the 4.4.z backport version of bug 1834925. My understanding is that we don't need to bother manually cloning backport bugs anymore now, because `/cherrypick ...` will create them as needed on our behalf. Please note that we have a customer facing this issue during 4.3 to 4.4 (OCP + vsphere) upgrade. I'd thought that this should have been caught by the job mentioned in bug 1787765. Hopefully someone's reviewing that, since it's not ideal to have an upgrade to a stable-4.4 release fail. FWIW I'm the customer Fatima mentioned. The cluster in question started with 4.1 and has been upgraded through to 4.4, with the 4.3 -> 4.4 failing on the MCO. I'm happy to provide any info I can if there are any questions. Hello, I am facing the exact same problem. Upgrading from 4.3.18 to 4.4.5 on vSphere. Is there any fix? Cheers # oc logs machine-config-controller-5d58b57c47-bffvq I0525 22:39:03.615421 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0525 22:39:07.489762 1 render_controller.go:376] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true) I0525 22:39:07.490300 1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true) I0525 22:39:10.772401 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status # oc describe co machine-config Status: Conditions: Last Transition Time: 2020-05-25T14:41:08Z Message: Cluster not available for 4.4.5 Status: False Type: Available Last Transition Time: 2020-05-25T14:31:22Z Message: Working towards 4.4.5 Status: True Type: Progressing Last Transition Time: 2020-05-25T14:41:08Z Message: Unable to apply 4.4.5: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-04-23T01:00:59Z Reason: AsExpected Status: True Type: Upgradeable The expectation is that the assignee answers these questions. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Who is impacted? - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x What is the impact? - From my perspective I would think MCO would always fail How involved is remediation? - Upgrade to a 4.4.x version with this PR change in place. Is this a regression? - No, it’s always been like this we just never noticed. The templates were apart of 4.4 work for vSphere IPI. Thanks, an existing upgrade that's stuck on this can be re-targeted to the version with this fix (once it becomes available) and the upgrade should complete? QE can we make sure to test the last question? It's ok if that part fails but we'll need to write up additional doc around how to unstick stuck upgrades. (In reply to Joseph Callen from comment #13) > Who is impacted? > - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster must start at 4.1? (In reply to Vadim Rutkovsky from comment #18) > (In reply to Joseph Callen from comment #13) > > Who is impacted? > > - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x > > Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster > must start at 4.1? If a customer starts with 4.4 they should not have a problem. If they upgrade from 4.1, 4.2 or 4.3 they will experience the bug Telemtery data shows we have 4.2 clusters which upgraded to 4.3 and then to 4.4 successfully. Seems this bug affects clusters born in 4.1 only Curious, why is a cluster that started at 4.1 and was upgraded through to 4.3 different than a cluster that started with 4.3? I'd thought that the combination of custom resources, operators, and RHCOS was to mitigate this sort of drift. Yeah, ideally everything is managed by the cluster and there is no drift. However, there are still a few things, like the Infrastructure config, that aren't currently managed by an in-cluster operator. In this case, the Infrastructure object grew providerStatus in 4.2 [1] but there was, at the time, no suitable operator to migrate existing clusters. We've since grown out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1 Infrastructure configs. But the PR that closed bug 1814332 only addressed AWS, not vSphere, and was also not backported to 4.4. There should be some similar way to recover in this case with: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' $ oc -n openshift-machine-config-operator get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep machine-config-controller- | while read POD; do oc -n openshift-machine-config-operator delete pod "${POD}"; done or some such, but we haven't worked out the details yet. [1]: https://github.com/openshift/api/pull/300 Thanks. It's good to hear this is being addressed. That schema change had hit us in 4.1 -> 4.2 (bug 1773870), but it didn't seem like there was an appropriate place to handle this specific change. We'd worked around it by adding the new field after upgrading to 4.2. So our current value includes the following. I'd expect to mirror an install originating from 4.2. status: platform: VSphere platformStatus: type: VSphere The full value from our cluster is included in the comments on support case 02659494, in case that helps. (In reply to W. Trevor King from comment #22) > Yeah, ideally everything is managed by the cluster and there is no drift. > However, there are still a few things, like the Infrastructure config, that > aren't currently managed by an in-cluster operator. In this case, the > Infrastructure object grew providerStatus in 4.2 [1] but there was, at the > time, no suitable operator to migrate existing clusters. We've since grown > out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1 > Infrastructure configs. But the PR that closed bug 1814332 only addressed > AWS, not vSphere, and was also not backported to 4.4. There should be some > similar way to recover in this case with: > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere"}}]' > $ oc -n openshift-machine-config-operator get -o jsonpath='{range > .items[*]}{.metadata.name}{"\n"}{end}' pods | grep > machine-config-controller- | while read POD; do oc -n > openshift-machine-config-operator delete pod "${POD}"; done > > or some such, but we haven't worked out the details yet. > > [1]: https://github.com/openshift/api/pull/300 Should we apply that two commands to our failing cluster upgrade? > Should we apply that two commands to our failing cluster upgrade?
They should be safe enough, but we haven't had time to work out whether they are sufficient to unstick things. If your cluster-version operator is just blocked on the machine-config operator (and not some earlier manifest), then yeah, go ahead and try and report back. If that feels too risky, wait a bit, and we'll get a more formal recovery procedure out (possibly re-targeting your update to a new 4.4.z with the fix this bug is backporting).
That patch reported no change on my cluster: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' infrastructure.config.openshift.io/cluster patched (no change) And after terminating the machine-config-controller pod, its replacement is still logging a lot of entries like the following: I0528 08:50:46.470311 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 08:50:46.551618 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 08:50:46.850498 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status Ah, we need to stick something in /status/platformStatus/vsphere too. How about: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' or some such? oc get infrastructure cluster -o json | jq .status.platformStatus { "type": "VSphere", "vsphere": {} } I0528 09:22:31.283915 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I0528 09:22:31.504563 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status I'm not sure how case-sensitive this is. I note that the log is generally LikeThis, and the JSON shows likeThis. Since the log says VSphere, I tried lowercasing the first letter (producing vSphere instead of vsphere): $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]' That didn't work either; the new key disappeared entirely after that: $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]' infrastructure.config.openshift.io/cluster patched $ oc get infrastructure cluster -o json | jq .status.platformStatus { "type": "VSphere" } The PlatformStatus.VSphere from the logs is probably Go's casing, while platformstatus/vsphere is JSON's casing, per [1]. I don't understand the machine-config controller implementation well enough to know why it was still complaining about a nil pointer in <.Infra.Status.PlatformStatus.VSphere> when there were no nils in that chain. Maybe the old Infrastructure object without the patching is getting cached somewhere that killing the MCC pod does not reset. [1]: https://github.com/openshift/api/blob/0f159fee64dbf711d40dac3fa2ec8b563a2aaca8/config/v1/types_infrastructure.go#L170 (In reply to W. Trevor King from comment #27) > Ah, we need to stick something in /status/platformStatus/vsphere too. How > about: > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' > > or some such? I am getting this error afterwards: I0529 15:06:57.578864 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0529 15:06:57.582976 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0529 15:06:57.702248 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?! (In reply to Andreas Söhnlein from comment #34) > (In reply to W. Trevor King from comment #27) > > Ah, we need to stick something in /status/platformStatus/vsphere too. How > > about: > > > > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > > "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]' > > > > or some such? > > I am getting this error afterwards: > > I0529 15:06:57.578864 1 template_controller.go:365] Error syncing > controllerconfig machine-config-controller: failed to create MachineConfig > for role master: failed to execute template: template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > I0529 15:06:57.582976 1 container_runtime_config_controller.go:369] > Error syncing image config openshift-config: could not Create/Update > MachineConfig: could not generate origin ContainerRuntime Configs: > generateMachineConfigsforRole failed with error failed to execute template: > template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > I0529 15:06:57.702248 1 kubelet_config_controller.go:313] Error > syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with > error failed to execute template: template: > /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml: > 6:16: executing > "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf. > yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status > > Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?! Well I didnt check it again before applying the patch. I also get that error message without patching the PlatformStatus. BTW the cluster should upgrade to 4.4.6 atm. Maybe there was a change from 4.4.5 to 4.4.6 so now this error message appears? Tried following upgrade path today: 4.1.41 -> 4.2.34 -> 4.3.0-0.nightly-2020-06-01-043839 -> 4.4.0-0.nightly-2020-06-01-021027 1. Check starting version # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.41 True False 20m Cluster version is 4.1.41 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.41 True False False 20m cloud-credential 4.1.41 True False False 41m cluster-autoscaler 4.1.41 True False False 41m console 4.1.41 True False False 29m dns 4.1.41 True False False 40m image-registry 4.1.41 True False False 32m ingress 4.1.41 True False False 33m kube-apiserver 4.1.41 True False False 35m kube-controller-manager 4.1.41 True False False 34m kube-scheduler 4.1.41 True False False 34m machine-api 4.1.41 True False False 40m machine-config 4.1.41 True False False 34m marketplace 4.1.41 True False False 32m monitoring 4.1.41 True False False 31m network 4.1.41 True False False 41m node-tuning 4.1.41 True False False 34m openshift-apiserver 4.1.41 True False False 34m openshift-controller-manager 4.1.41 True False False 35m openshift-samples 4.1.41 True False False 29m operator-lifecycle-manager 4.1.41 True False False 36m operator-lifecycle-manager-catalog 4.1.41 True False False 36m service-ca 4.1.41 True False False 40m service-catalog-apiserver 4.1.41 True False False 34m service-catalog-controller-manager 4.1.41 True False False 34m storage 4.1.41 True False False 33m 2. update to 4.2.34 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.34 True False 87s Cluster version is 4.2.34 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.34 True False False 62m cloud-credential 4.2.34 True False False 82m cluster-autoscaler 4.2.34 True False False 82m console 4.2.34 True False False 8m dns 4.2.34 True False False 82m image-registry 4.2.34 True False False 6m ingress 4.2.34 True False False 74m insights 4.2.34 True False False 31m kube-apiserver 4.2.34 True False False 77m kube-controller-manager 4.2.34 True False False 76m kube-scheduler 4.2.34 True False False 76m machine-api 4.2.34 True False False 82m machine-config 4.2.34 True False False 76m marketplace 4.2.34 True False False 3m39s monitoring 4.2.34 True False False 5m34s network 4.2.34 True False False 82m node-tuning 4.2.34 True False False 2m56s openshift-apiserver 4.2.34 True False False 2m58s openshift-controller-manager 4.2.34 True False False 77m openshift-samples 4.2.34 True False False 31m operator-lifecycle-manager 4.2.34 True False False 78m operator-lifecycle-manager-catalog 4.2.34 True False False 78m operator-lifecycle-manager-packageserver 4.2.34 True False False 3m1s service-ca 4.2.34 True False False 82m service-catalog-apiserver 4.2.34 True False False 75m service-catalog-controller-manager 4.2.34 True False False 75m storage 4.2.34 True False False 30m 3. Updating to 4.3 # oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839 --force --allow-explicit-upgrade Updating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-06-01-043839 True False 2m46s Cluster version is 4.3.0-0.nightly-2020-06-01-043839 # oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 137m v1.16.2 compute-1 Ready worker 137m v1.16.2 compute-2 Ready worker 137m v1.16.2 control-plane-0 Ready master 137m v1.16.2 control-plane-1 Ready master 137m v1.16.2 control-plane-2 Ready master 137m v1.16.2 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2020-06-01-043839 True False False 123m cloud-credential 4.3.0-0.nightly-2020-06-01-043839 True False False 143m cluster-autoscaler 4.3.0-0.nightly-2020-06-01-043839 True False False 143m console 4.3.0-0.nightly-2020-06-01-043839 True False False 18m dns 4.3.0-0.nightly-2020-06-01-043839 True False False 142m image-registry 4.3.0-0.nightly-2020-06-01-043839 True False False 24m ingress 4.3.0-0.nightly-2020-06-01-043839 True False False 135m insights 4.3.0-0.nightly-2020-06-01-043839 True False False 92m kube-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 137m kube-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 137m kube-scheduler 4.3.0-0.nightly-2020-06-01-043839 True False False 136m machine-api 4.3.0-0.nightly-2020-06-01-043839 True False False 143m machine-config 4.3.0-0.nightly-2020-06-01-043839 True False False 137m marketplace 4.3.0-0.nightly-2020-06-01-043839 True False False 23m monitoring 4.3.0-0.nightly-2020-06-01-043839 True False False 20m network 4.3.0-0.nightly-2020-06-01-043839 True False False 143m node-tuning 4.3.0-0.nightly-2020-06-01-043839 True False False 13m openshift-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 14m openshift-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 137m openshift-samples 4.3.0-0.nightly-2020-06-01-043839 True False False 8m43s operator-lifecycle-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 138m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-06-01-043839 True False False 139m operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-06-01-043839 True False False 19m service-ca 4.3.0-0.nightly-2020-06-01-043839 True False False 142m service-catalog-apiserver 4.3.0-0.nightly-2020-06-01-043839 True False False 136m service-catalog-controller-manager 4.3.0-0.nightly-2020-06-01-043839 True False False 136m storage 4.3.0-0.nightly-2020-06-01-043839 True False False 51m 4. updating to latest 4.4 nightly # oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --force --allow-explicit-upgrade Updating to release image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-06-01-043839 True True 45m Unable to apply 4.4.0-0.nightly-2020-06-01-021027: the cluster operator machine-config has not yet successfully rolled out # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-06-01T07:35:53Z Generation: 1 Resource Version: 107409 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 85d2801c-a3da-11ea-9656-0050568b03c3 Spec: Status: Conditions: Last Transition Time: 2020-06-01T10:33:58Z Message: Cluster not available for 4.4.0-0.nightly-2020-06-01-021027 Status: False Type: Available Last Transition Time: 2020-06-01T10:26:25Z Message: Working towards 4.4.0-0.nightly-2020-06-01-021027 Status: True Type: Progressing Last Transition Time: 2020-06-01T10:33:58Z Message: Unable to apply 4.4.0-0.nightly-2020-06-01-021027: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Reason: MachineConfigControllerFailed Status: True Type: Degraded Last Transition Time: 2020-06-01T08:55:50Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.3.0-0.nightly-2020-06-01-043839 Events: <none> # oc logs -n openshift-machine-config-operator -f machine-config-controller-7488656c67-7phnq I0601 10:46:37.401973 1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:37.688608 1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:38.027834 1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status I0601 10:46:41.898894 1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true) 5. Make sure the latest 4.4 nightly has the required fix # oc adm release info registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --pullspecs | grep machine-config-operator machine-config-operator quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f # oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f | grep commit io.openshift.build.commit.id=e3f4e2596eaf47a0081a4df04607eec9acd88e05 io.openshift.build.commit.url=https://github.com/openshift/machine-config-operator/commit/e3f4e2596eaf47a0081a4df04607eec9acd88e05 # git log e3f4e2596eaf47a0081a4df04607eec9acd88e05 | grep '#1735' Merge pull request #1735 from jcpowermac/CP1728 Assigning back, you can also take a look at cluster using the kubeconfig attached Sorry for the inconvenience this has caused. The templates that we added to MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on various variables was to ensure the difference between UPI and IPI. In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) to implement a CI job that will upgrade a cluster on vSphere through the releases in the hope that we would catch this failure before a customer does. Workaround: 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' 2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml 3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller` 5.) Then perform an update (In reply to Joseph Callen from comment #38) > Sorry for the inconvenience this has caused. The templates that we added to > MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on > various variables was to ensure the difference between UPI and IPI. > > In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) > to implement a CI job that will upgrade a cluster on vSphere through the > releases in the hope that we would catch this failure before a customer does. > > Workaround: > 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": > "/status/platformStatus", "value": {"type": "VSphere"}}]' > 2.) oc get controllerconfigs.machineconfiguration.openshift.io > machine-config-controller -o yaml > mcc.yaml > 3.) oc delete controllerconfigs.machineconfiguration.openshift.io > machine-config-controller > 4.) confirm the above ^ is regenerated `oc get > controllerconfigs.machineconfiguration.openshift.io > machine-config-controller` > 5.) Then perform an update Hello Joseph, thanks for your response. I just managed to successfully upgrade our cluster to 4.4.6 without failure. The only issue that remains on my side is that the cluster complains about "unhealthy etcd members generated from openshift-cluster-etcd-operator-etcd-client" after etcd-operator had this issue "Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again" # oc logs etcd-operator-dd8898d94-gcdmx I0603 14:50:41.896787 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.234:2379 0 <nil>} {https://172.16.11.230:2379 0 <nil>} {https://172.16.11.233:2379 0 <nil>} {https://172.16.11.232:2379 0 <nil>} {https://172.16.11.231:2379 0 <nil>}] I0603 14:50:41.922569 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922768 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922778 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922785 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:41.922791 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.907107 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0603 14:50:44.908266 1 client.go:361] parsed scheme: "endpoint" I0603 14:50:44.908327 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.231:2379 0 <nil>} {https://172.16.11.232:2379 0 <nil>} {https://172.16.11.230:2379 0 <nil>} {https://172.16.11.233:2379 0 <nil>} {https://172.16.11.234:2379 0 <nil>}] I0603 14:50:44.927584 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.933897 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935793 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935804 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:44.935834 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0603 14:50:47.917208 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I dont know if it has anything to do with this bug? But it did not appear before the upgrade. Cheers Hi Andreas, If you can please open a new BZ for that issue or contact customer support I think that would be the best way forward for an issue not related to the BZ subject. Thanks! Thanks! FYI, this completed on my cluster, which was already hung on 4.3 -> 4.4. 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]' This was a noop, since this field had already been set for a 4.1 -> 4.2 issue. 2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml 3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller Done and done. 4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller` This took about a minute to regenerate. 5.) Then perform an update My 4.3 -> 4.4 was already in progress. Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage? It seems that `oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller` was the wrong thing to do on a 4.3 cluster. I ran it on my production cluster in preparation for the move to 4.3. Several nodes are now in NotReady state. I see new machine CSRs. Upon approving one, I saw a new entry in `oc get nodes` listed as "localhost.localdomain". Approving that machine's CSR, it progressed to "Ready" status, and the original node entry is still showing NotReady. With 4.3 out of full support I'm trying to stay in a current configuration. It would help a lot if Red Hat could test upgrades before dropping full support for the (n-1) point release. The workaround in comment 38 is only relevant to a cluster that's already stuck on an upgrade and should not be used in any proactive manner. vSphere UPI clusters installed in 4.1 currently running 4.3 should delay upgrading to 4.4 until this bug has resolved. In general please engage with support rather than directly via bugzilla as they'll have the most context regarding your specific environment and requirements. I have been engaged with support. Here's an excerpt from that conversation:
> >> Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage?
>
> As of now, the engineering team was able to test this workaround before initiating an upgrade only.
> So yes, if you have any different cluster where an upgrade is due then you can apply this workaround prior to the upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2445 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |