Hide Forgot
Description of problem: Upgrade an OCP on OSP13 cluster from 4.6.5 to 4.7.0-0.nightly-2020-11-24-080601. Upgrade is stuck on "the cluster operator machine-config has not yet successfully rolled out" Version-Release number of selected component (if applicable): How reproducible: 1/1 Steps to Reproduce: 1. Install an OCP cluster on PSI 2. Add label "node-workload=app" to all worker nodes. 3. Update the scheduler as: $ oc get scheduler cluster -ojson|jq .spec { "defaultNodeSelector": "node-workload=app", "mastersSchedulable": false, "policy": { "name": "" } } 4. Upgrade from 4.6.5 to 4.7.0-0.nightly-2020-11-24-080601 with cmd: oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-11-24-080601 --force --allow-explicit-upgrade Actual results: The upgrade is stuck on "the cluster operator machine-config has not yet successfully rolled out" more than 10 hours $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.5 True True 11h Unable to apply 4.7.0-0.nightly-2020-11-24-080601: the cluster operator machine-config has not yet successfully rolled out Error msg from machine-config-operator: E1124 18:19:58.749041 1 operator.go:321] timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true) Error msg from machine-config-controller: I1125 01:25:28.967749 1 render_controller.go:376] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true) E1125 01:25:33.933436 1 kubelet_config_controller.go:318] GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec I1125 01:25:33.933478 1 kubelet_config_controller.go:319] Dropping featureconfig "cluster" out of the queue: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec I1125 01:25:34.419123 1 template_controller.go:366] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec Expected results: Upgrade successfully. Additional info:
Issue seems to be related to some churn in this on prem template: https://github.com/openshift/machine-config-operator/commit/ab7b38da3889421ffc0b41d096eba8360bb3f582 @Qin Was this a bad nightly, have you seen this again using another nightly? @Ben have you seen failing 4.6-> 4.7 upgrades on this or was this a one off failure?
I don't know whether we've started upgrade testing on baremetal. I'll check with our qe people. I'm surprised to see this though. I see in the must-gather that we have the new controllerconfig with the dns field, but it's not populated. I would have expected that to happen before templates got processed again. It's kind of the opposite of the previous upgrade bug with this where old templates were generated using new data. This time it looks like the new templates are being processed using some old data. :-/
Issue is also reproduced on upi-on-vsphere cluster when upgrading from 4.6.0-0.nightly-2020-12-08-021151 to 4.7.0-0.nightly-2020-12-04-013308
*** Bug 1897048 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633