Bug 1901376

Summary: [on-prem] Upgrade from 4.6 to 4.7 failed with "timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true"
Product: OpenShift Container Platform Reporter: Qin Ping <piqin>
Component: Machine Config OperatorAssignee: Ben Nemec <bnemec>
Status: CLOSED ERRATA QA Contact: Ori Michaeli <omichael>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: bnemec, jima, kgarriso, mkrejci, omichael, wking, yanyang
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:35:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qin Ping 2020-11-25 01:38:56 UTC
Description of problem:
Upgrade an OCP on OSP13 cluster from 4.6.5 to 4.7.0-0.nightly-2020-11-24-080601.
Upgrade is stuck on "the cluster operator machine-config has not yet successfully rolled out"

Version-Release number of selected component (if applicable):


How reproducible:
1/1

Steps to Reproduce:
1. Install an OCP cluster on PSI
2. Add label "node-workload=app" to all worker nodes.
3. Update the scheduler as:
$ oc get scheduler cluster -ojson|jq .spec
{
  "defaultNodeSelector": "node-workload=app",
  "mastersSchedulable": false,
  "policy": {
    "name": ""
  }
}
4. Upgrade from 4.6.5 to 4.7.0-0.nightly-2020-11-24-080601 with cmd:
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-11-24-080601 --force --allow-explicit-upgrade

Actual results:
The upgrade is stuck on "the cluster operator machine-config has not yet successfully rolled out" more than 10 hours
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.5     True        True          11h     Unable to apply 4.7.0-0.nightly-2020-11-24-080601: the cluster operator machine-config has not yet successfully rolled out

Error msg from machine-config-operator:
E1124 18:19:58.749041       1 operator.go:321] timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)

Error msg from machine-config-controller:
I1125 01:25:28.967749       1 render_controller.go:376] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true)
E1125 01:25:33.933436       1 kubelet_config_controller.go:318] GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec
I1125 01:25:33.933478       1 kubelet_config_controller.go:319] Dropping featureconfig "cluster" out of the queue: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec
I1125 01:25:34.419123       1 template_controller.go:366] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml:53:22: executing "/etc/mcc/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml" at <.DNS.Spec.BaseDomain>: nil pointer evaluating *v1.DNS.Spec


Expected results:
Upgrade successfully.

Additional info:

Comment 2 Kirsten Garrison 2020-11-30 20:38:01 UTC
Issue seems to be related to some churn in this on prem template: https://github.com/openshift/machine-config-operator/commit/ab7b38da3889421ffc0b41d096eba8360bb3f582

@Qin Was this a bad nightly, have you seen this again using another nightly?

@Ben have you seen failing 4.6-> 4.7 upgrades on this or was this a one off failure?

Comment 4 Ben Nemec 2020-12-03 22:41:40 UTC
I don't know whether we've started upgrade testing on baremetal. I'll check with our qe people.

I'm surprised to see this though. I see in the must-gather that we have the new controllerconfig with the dns field, but it's not populated. I would have expected that to happen before templates got processed again. It's kind of the opposite of the previous upgrade bug with this where old templates were generated using new data. This time it looks like the new templates are being processed using some old data. :-/

Comment 9 jima 2020-12-10 08:11:08 UTC
Issue is also reproduced on upi-on-vsphere cluster when upgrading from 4.6.0-0.nightly-2020-12-08-021151 to 4.7.0-0.nightly-2020-12-04-013308

Comment 11 Ben Nemec 2020-12-16 22:40:57 UTC
*** Bug 1897048 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2021-02-24 15:35:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633