Bug 1834925 - [vsphere] upgrade from 4.1 -> 4.2 -> 4.3 -> 4.4 upgrade failed at waitForControllerConfigToBeCompleted
Summary: [vsphere] upgrade from 4.1 -> 4.2 -> 4.3 -> 4.4 upgrade failed at waitForCont...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Joseph Callen
QA Contact: jima
URL:
Whiteboard:
Depends On: 1842952
Blocks: 1834194
TreeView+ depends on / blocked
 
Reported: 2020-05-12 16:53 UTC by Joseph Callen
Modified: 2021-11-03 05:59 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1834194
: 1842952 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:37:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1728 0 None closed bug 1834925: vsphere: check if .Infra.Status and .Infra.Status.PlatformStatus is nil 2021-01-13 03:24:05 UTC
Github openshift machine-config-operator pull 1783 0 None closed [release-4.5] Bug 1834925: vsphere templates check if Infra is nil 2021-01-13 03:24:06 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:38:20 UTC

Description Joseph Callen 2020-05-12 16:53:46 UTC
+++ This bug was initially created as a clone of Bug #1834194 +++

Description of problem:
during the upgradeVersion-Release number of selected component (if applicable):
during upgrade from 4.1.0-0.nightly-2020-05-04-100857 -> 4.2.0-0.nightly-2020-05-07-194422 -> 4.3.0-0.nightly-2020-05-07-171148 -> 4.4.0-0.nightly-2020-05-08-224132

How reproducible:
sometimes, found twice in upgrade CI one day

Steps to Reproduce:
1. setup 4.1 cluster, and upgrade to 4.4 step by step
2. monitor upgrade progress


Actual results:
2. failed to upgrade to 4.4.0-0.nightly-2020-05-08-224132 due to error: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed

#oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h3m
cloud-credential                           4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
cluster-autoscaler                         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
console                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      105m
csi-snapshot-controller                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      103m
dns                                        4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
etcd                                       4.4.0-0.nightly-2020-05-08-224132   True        False         False      119m
image-registry                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      143m
ingress                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h9m
insights                                   4.4.0-0.nightly-2020-05-08-224132   True        False         False      3h33m
kube-apiserver                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      118m
kube-controller-manager                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      116m
kube-scheduler                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      116m
kube-storage-version-migrator              4.4.0-0.nightly-2020-05-08-224132   True        False         False      110m
machine-api                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
machine-config                             4.3.0-0.nightly-2020-05-07-171148   False       True          True       88m
marketplace                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      109m
monitoring                                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      136m
network                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
node-tuning                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      100m
openshift-apiserver                        4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
openshift-controller-manager               4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h13m
openshift-samples                          4.4.0-0.nightly-2020-05-08-224132   True        False         False      4m53s
operator-lifecycle-manager                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h11m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h11m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
service-ca                                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
service-catalog-apiserver                  4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
service-catalog-controller-manager         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h2m
storage                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      110m

# oc describe co machine-config 
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-05-09T22:22:18Z
  Generation:          1
  Resource Version:    137088
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 8b48ddbf-9243-11ea-b1ae-0050568b3a4a
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Cluster not available for 4.4.0-0.nightly-2020-05-08-224132
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-05-10T01:04:34Z
    Message:               Working towards 4.4.0-0.nightly-2020-05-08-224132
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-09T23:34:49Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.3.0-0.nightly-2020-05-07-171148
Events:       <none>
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-05-09T22:22:18Z
  Generation:          1
  Resource Version:    137088
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 8b48ddbf-9243-11ea-b1ae-0050568b3a4a
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Cluster not available for 4.4.0-0.nightly-2020-05-08-224132
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-05-10T01:04:34Z
    Message:               Working towards 4.4.0-0.nightly-2020-05-08-224132
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-09T23:34:49Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.3.0-0.nightly-2020-05-07-171148
Events:       <none>


Expected results:
2. upgrade from 4.3 -> 4.4 should succeed

Additional info:
please get must-gather logs from comment

--- Additional comment from Yadan Pei on 2020-05-11 09:49:52 UTC ---

must-gather logs:

http://10.73.131.57:9000/minio/openshift-must-gather/2020-05-09-22-45-29/must-gather.local.6006406755885421395.tar.gz          Access Key: 'openshift' Secret Key: 'am5bM2Es8SRYe$^A'
http://10.73.131.57:9000/minio/openshift-must-gather/2020-05-10-22-33-11/must-gather.local.1373645183016184295.tar.gz          Access Key: 'openshift' Secret Key: 'am5bM2Es8SRYe$^A'

--- Additional comment from Yu Qi Zhang on 2020-05-11 21:11:13 UTC ---

First glance: the error blocking the upgrade is:

2020-05-10T02:42:20.322072985Z E0510 02:42:20.322026       1 container_runtime_config_controller.go:374] could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
2020-05-10T02:42:20.322072985Z I0510 02:42:20.322048       1 container_runtime_config_controller.go:375] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
2020-05-10T02:42:20.355788752Z I0510 02:42:20.355743       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status

You can see this in the machine-config-controller logs. Basically a nil pointer error for PlatformStatus on vsphere. A question: you say this happens "sometimes". How reproduceable is this within vsphere?

CC'ing Christian since he worked on this recently and may have a better idea what the root cause is.

--- Additional comment from Yu Qi Zhang on 2020-05-11 21:19:27 UTC ---

Actually, I can see that the vsphere file in question was last updated in Feb for release-4.4 branch: https://github.com/openshift/machine-config-operator/blob/release-4.4/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml.

Adding Joseph to see if he knows what's up

--- Additional comment from Joseph Callen on 2020-05-11 21:35:39 UTC ---

4.4 does not include vSphere IPI.
This check was added so that UPI did not get the in-network services.
Perhaps an additional check for .Infra.Status.PlatformStatus or .Infra.Status

--- Additional comment from Yu Qi Zhang on 2020-05-11 21:46:44 UTC ---

Should we add a

{{ if .Infra.Status -}}

to everything in

https://github.com/openshift/machine-config-operator/commit/49dbfb23527502b7201241dcb865cd197d088a0e

or is that a bit overkill

--- Additional comment from Joseph Callen on 2020-05-12 14:47:50 UTC ---

PR:
https://github.com/openshift/machine-config-operator/pull/1728

Need to clone this bug for 4.4 and 4.5

Comment 3 liujia 2020-05-20 01:43:42 UTC
Version: 4.5.0-0.nightly-2020-05-19-031245

Upgrade ocp/vsphere from v4.4.4 to 4.5.0-0.nightly-2020-05-19-031245 successfully. Machine-config-operator works well.

# ./oc get co machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.5.0-0.nightly-2020-05-19-031245   True        False         False      12h

Comment 5 Lalatendu Mohanty 2020-05-26 12:36:34 UTC
This is a high priority bug it effects upgrade of OpenShift cluster to 4.4. Hence increasing the sev and adding upgradeblocker keyword. The bug for 4.4 backport also have similar severity https://bugzilla.redhat.com/show_bug.cgi?id=1834194#c11

Comment 6 Lalatendu Mohanty 2020-05-26 12:44:22 UTC
Is there a manual workaround for the issue which customer can do and get unblocked?

Comment 8 Joseph Callen 2020-06-02 00:03:40 UTC
Moving back to assigned based on 4.4 testing

Comment 9 Joseph Callen 2020-06-03 12:56:35 UTC
The customer cases attached to this BZ should be on: https://bugzilla.redhat.com/show_bug.cgi?id=1834194



Sorry for the inconvenience this has caused. The templates that we added to MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on various variables was to ensure the difference between UPI and IPI.

In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) to implement a CI job that will upgrade a cluster on vSphere through the releases in the hope that we would catch this failure before a customer does.

Workaround:
1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]'
2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml
3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 
4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller`
5.) Then perform an update

Comment 11 jima 2020-06-19 05:44:54 UTC
verified on 4.5.0-0.nightly-2020-06-18-114733
upgrade ocp from 4.4.0-0.nightly-2020-06-18-212632 to 4.5.0-0.nightly-2020-06-18-114733, it is successful and machine-config-operator works well.
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-06-18-114733   True        False         20m     Cluster version is 4.5.0-0.nightly-2020-06-18-114733
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-06-18-114733   True        False         False      98m
cloud-credential                           4.5.0-0.nightly-2020-06-18-114733   True        False         False      114m
cluster-autoscaler                         4.5.0-0.nightly-2020-06-18-114733   True        False         False      105m
config-operator                            4.5.0-0.nightly-2020-06-18-114733   True        False         False      73m
console                                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      33m
csi-snapshot-controller                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      103m
dns                                        4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
etcd                                       4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
image-registry                             4.5.0-0.nightly-2020-06-18-114733   True        False         False      33m
ingress                                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      103m
insights                                   4.5.0-0.nightly-2020-06-18-114733   True        False         False      106m
kube-apiserver                             4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
kube-controller-manager                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      108m
kube-scheduler                             4.5.0-0.nightly-2020-06-18-114733   True        False         False      108m
kube-storage-version-migrator              4.5.0-0.nightly-2020-06-18-114733   True        False         False      33m
machine-api                                4.5.0-0.nightly-2020-06-18-114733   True        False         False      106m
machine-approver                           4.5.0-0.nightly-2020-06-18-114733   True        False         False      65m
machine-config                             4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
marketplace                                4.5.0-0.nightly-2020-06-18-114733   True        False         False      32m
monitoring                                 4.5.0-0.nightly-2020-06-18-114733   True        False         False      62m
network                                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      111m
node-tuning                                4.5.0-0.nightly-2020-06-18-114733   True        False         False      65m
openshift-apiserver                        4.5.0-0.nightly-2020-06-18-114733   True        False         False      106m
openshift-controller-manager               4.5.0-0.nightly-2020-06-18-114733   True        False         False      106m
openshift-samples                          4.5.0-0.nightly-2020-06-18-114733   True        False         False      65m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-06-18-114733   True        False         False      109m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-06-18-114733   True        False         False      32m
service-ca                                 4.5.0-0.nightly-2020-06-18-114733   True        False         False      110m
storage                                    4.5.0-0.nightly-2020-06-18-114733   True        False         False      65m

Comment 12 errata-xmlrpc 2020-07-13 17:37:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.