Bug 1834194 - upgrade from 4.1 -> 4.2 -> 4.3 -> 4.4 upgrade failed at waitForControllerConfigToBeCompleted [NEEDINFO]
Summary: upgrade from 4.1 -> 4.2 -> 4.3 -> 4.4 upgrade failed at waitForControllerConf...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.z
Assignee: Joseph Callen
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On: 1834925
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-11 09:48 UTC by Yadan Pei
Modified: 2020-06-25 10:52 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1834925 (view as bug list)
Environment:
Last Closed: 2020-06-17 22:26:36 UTC
Target Upstream Version:
vlaad: needinfo? (mnguyen)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1735 None closed Bug 1834194: vsphere: check if .Infra.Status and .Infra.Status.PlatformStatus is nil 2020-11-23 13:22:33 UTC
Github openshift machine-config-operator pull 1769 None closed Bug 1834194: vsphere: 4.1 to 4.4 upgrade bug 2020-11-23 13:22:35 UTC
Red Hat Knowledge Base (Solution) 5098731 None None None 2020-06-01 07:45:50 UTC
Red Hat Product Errata RHBA-2020:2445 None None None 2020-06-17 22:26:53 UTC

Description Yadan Pei 2020-05-11 09:48:24 UTC
Description of problem:
during the upgradeVersion-Release number of selected component (if applicable):
during upgrade from 4.1.0-0.nightly-2020-05-04-100857 -> 4.2.0-0.nightly-2020-05-07-194422 -> 4.3.0-0.nightly-2020-05-07-171148 -> 4.4.0-0.nightly-2020-05-08-224132

How reproducible:
sometimes, found twice in upgrade CI one day

Steps to Reproduce:
1. setup 4.1 cluster, and upgrade to 4.4 step by step
2. monitor upgrade progress


Actual results:
2. failed to upgrade to 4.4.0-0.nightly-2020-05-08-224132 due to error: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed

#oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h3m
cloud-credential                           4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
cluster-autoscaler                         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
console                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      105m
csi-snapshot-controller                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      103m
dns                                        4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
etcd                                       4.4.0-0.nightly-2020-05-08-224132   True        False         False      119m
image-registry                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      143m
ingress                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h9m
insights                                   4.4.0-0.nightly-2020-05-08-224132   True        False         False      3h33m
kube-apiserver                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      118m
kube-controller-manager                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      116m
kube-scheduler                             4.4.0-0.nightly-2020-05-08-224132   True        False         False      116m
kube-storage-version-migrator              4.4.0-0.nightly-2020-05-08-224132   True        False         False      110m
machine-api                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
machine-config                             4.3.0-0.nightly-2020-05-07-171148   False       True          True       88m
marketplace                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      109m
monitoring                                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      136m
network                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
node-tuning                                4.4.0-0.nightly-2020-05-08-224132   True        False         False      100m
openshift-apiserver                        4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
openshift-controller-manager               4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h13m
openshift-samples                          4.4.0-0.nightly-2020-05-08-224132   True        False         False      4m53s
operator-lifecycle-manager                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h11m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h11m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
service-ca                                 4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h18m
service-catalog-apiserver                  4.4.0-0.nightly-2020-05-08-224132   True        False         False      104m
service-catalog-controller-manager         4.4.0-0.nightly-2020-05-08-224132   True        False         False      4h2m
storage                                    4.4.0-0.nightly-2020-05-08-224132   True        False         False      110m

# oc describe co machine-config 
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-05-09T22:22:18Z
  Generation:          1
  Resource Version:    137088
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 8b48ddbf-9243-11ea-b1ae-0050568b3a4a
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Cluster not available for 4.4.0-0.nightly-2020-05-08-224132
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-05-10T01:04:34Z
    Message:               Working towards 4.4.0-0.nightly-2020-05-08-224132
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-09T23:34:49Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.3.0-0.nightly-2020-05-07-171148
Events:       <none>
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-05-09T22:22:18Z
  Generation:          1
  Resource Version:    137088
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 8b48ddbf-9243-11ea-b1ae-0050568b3a4a
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Cluster not available for 4.4.0-0.nightly-2020-05-08-224132
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-05-10T01:04:34Z
    Message:               Working towards 4.4.0-0.nightly-2020-05-08-224132
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-10T01:11:44Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-05-08-224132: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-09T23:34:49Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.3.0-0.nightly-2020-05-07-171148
Events:       <none>


Expected results:
2. upgrade from 4.3 -> 4.4 should succeed

Additional info:
please get must-gather logs from comment

Comment 2 Yu Qi Zhang 2020-05-11 21:11:13 UTC
First glance: the error blocking the upgrade is:

2020-05-10T02:42:20.322072985Z E0510 02:42:20.322026       1 container_runtime_config_controller.go:374] could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
2020-05-10T02:42:20.322072985Z I0510 02:42:20.322048       1 container_runtime_config_controller.go:375] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
2020-05-10T02:42:20.355788752Z I0510 02:42:20.355743       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status

You can see this in the machine-config-controller logs. Basically a nil pointer error for PlatformStatus on vsphere. A question: you say this happens "sometimes". How reproduceable is this within vsphere?

CC'ing Christian since he worked on this recently and may have a better idea what the root cause is.

Comment 3 Yu Qi Zhang 2020-05-11 21:19:27 UTC
Actually, I can see that the vsphere file in question was last updated in Feb for release-4.4 branch: https://github.com/openshift/machine-config-operator/blob/release-4.4/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml.

Adding Joseph to see if he knows what's up

Comment 4 Joseph Callen 2020-05-11 21:35:39 UTC
4.4 does not include vSphere IPI.
This check was added so that UPI did not get the in-network services.
Perhaps an additional check for .Infra.Status.PlatformStatus or .Infra.Status

Comment 5 Yu Qi Zhang 2020-05-11 21:46:44 UTC
Should we add a

{{ if .Infra.Status -}}

to everything in

https://github.com/openshift/machine-config-operator/commit/49dbfb23527502b7201241dcb865cd197d088a0e

or is that a bit overkill

Comment 6 Joseph Callen 2020-05-12 14:47:50 UTC
PR:
https://github.com/openshift/machine-config-operator/pull/1728

Need to clone this bug for 4.4 and 4.5

Comment 7 W. Trevor King 2020-05-12 22:18:41 UTC
Setting this up as the 4.4.z backport version of bug 1834925.  My understanding is that we don't need to bother manually cloning backport bugs anymore now, because `/cherrypick ...` will create them as needed on our behalf.

Comment 8 Fatima 2020-05-22 09:41:06 UTC
Please note that we have a customer facing this issue during 4.3 to 4.4 (OCP + vsphere) upgrade.

Comment 9 Chet Hosey 2020-05-22 22:22:47 UTC
I'd thought that this should have been caught by the job mentioned in bug 1787765. Hopefully someone's reviewing that, since it's not ideal to have an upgrade to a stable-4.4 release fail.

FWIW I'm the customer Fatima mentioned. The cluster in question started with 4.1 and has been upgraded through to 4.4, with the 4.3 -> 4.4 failing on the MCO.

I'm happy to provide any info I can if there are any questions.

Comment 10 Andreas Söhnlein 2020-05-25 22:48:51 UTC
Hello,

I am facing the exact same problem.
Upgrading from 4.3.18 to 4.4.5 on vSphere.
Is there any fix?

Cheers



# oc logs machine-config-controller-5d58b57c47-bffvq

I0525 22:39:03.615421       1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
I0525 22:39:07.489762       1 render_controller.go:376] Error syncing machineconfigpool worker: ControllerConfig has not completed: completed(false) running(false) failing(true)
I0525 22:39:07.490300       1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true)
I0525 22:39:10.772401       1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status


# oc describe co machine-config

Status:
  Conditions:
    Last Transition Time:  2020-05-25T14:41:08Z
    Message:               Cluster not available for 4.4.5
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-05-25T14:31:22Z
    Message:               Working towards 4.4.5
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-25T14:41:08Z
    Message:               Unable to apply 4.4.5: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-04-23T01:00:59Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable

Comment 12 Scott Dodson 2020-05-26 13:17:23 UTC
The expectation is that the assignee answers these questions.

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.
 
Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 13 Joseph Callen 2020-05-26 14:24:52 UTC
Who is impacted?
- All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x
What is the impact?
- From my perspective I would think MCO would always fail
How involved is remediation?
- Upgrade to a 4.4.x version with this PR change in place.
Is this a regression?
- No, it’s always been like this we just never noticed. The templates were apart of 4.4 work for vSphere IPI.

Comment 14 Scott Dodson 2020-05-26 14:42:19 UTC
Thanks, an existing upgrade that's stuck on this can be re-targeted to the version with this fix (once it becomes available) and the upgrade should complete?

QE can we make sure to test the last question? It's ok if that part fails but we'll need to write up additional doc around how to unstick stuck upgrades.

Comment 18 Vadim Rutkovsky 2020-05-26 19:00:11 UTC
(In reply to Joseph Callen from comment #13)
> Who is impacted?
> - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x

Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster must start at 4.1?

Comment 19 Joseph Callen 2020-05-26 19:12:04 UTC
(In reply to Vadim Rutkovsky from comment #18)
> (In reply to Joseph Callen from comment #13)
> > Who is impacted?
> > - All customers running on vSphere and OCP 4.x.x upgrading to 4.4.x
> 
> Does it impact 4.2.x clusters upgraded to 4.3 and then 4.4 - or the cluster
> must start at 4.1?

If a customer starts with 4.4 they should not have a problem.
If they upgrade from 4.1, 4.2 or 4.3 they will experience the bug

Comment 20 Vadim Rutkovsky 2020-05-26 20:21:49 UTC
Telemtery data shows we have 4.2 clusters which upgraded to 4.3 and then to 4.4 successfully.

Seems this bug affects clusters born in 4.1 only

Comment 21 Chet Hosey 2020-05-26 23:08:33 UTC
Curious, why is a cluster that started at 4.1 and was upgraded through to 4.3 different than a cluster that started with 4.3?

I'd thought that the combination of custom resources, operators, and RHCOS was to mitigate this sort of drift.

Comment 22 W. Trevor King 2020-05-27 03:24:52 UTC
Yeah, ideally everything is managed by the cluster and there is no drift.  However, there are still a few things, like the Infrastructure config, that aren't currently managed by an in-cluster operator.  In this case, the Infrastructure object grew providerStatus in 4.2 [1] but there was, at the time, no suitable operator to migrate existing clusters.  We've since grown out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1 Infrastructure configs.  But the PR that closed bug 1814332 only addressed AWS, not vSphere, and was also not backported to 4.4.  There should be some similar way to recover in this case with:

$ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]'
$ oc -n openshift-machine-config-operator get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep machine-config-controller- | while read POD; do oc -n openshift-machine-config-operator delete pod "${POD}"; done

or some such, but we haven't worked out the details yet.

[1]: https://github.com/openshift/api/pull/300

Comment 23 Chet Hosey 2020-05-27 04:55:34 UTC
Thanks. It's good to hear this is being addressed. That schema change had hit us in 4.1 -> 4.2 (bug 1773870), but it didn't seem like there was an appropriate place to handle this specific change.

We'd worked around it by adding the new field after upgrading to 4.2. So our current value includes the following. I'd expect to mirror an install originating from 4.2.

  status:
    platform: VSphere
    platformStatus:
      type: VSphere

The full value from our cluster is included in the comments on support case 02659494, in case that helps.

Comment 24 Andreas Söhnlein 2020-05-27 23:44:48 UTC
(In reply to W. Trevor King from comment #22)
> Yeah, ideally everything is managed by the cluster and there is no drift. 
> However, there are still a few things, like the Infrastructure config, that
> aren't currently managed by an in-cluster operator.  In this case, the
> Infrastructure object grew providerStatus in 4.2 [1] but there was, at the
> time, no suitable operator to migrate existing clusters.  We've since grown
> out the config operator, and bug 1814332 landed in 4.5 to port born-in-4.1
> Infrastructure configs.  But the PR that closed bug 1814332 only addressed
> AWS, not vSphere, and was also not backported to 4.4.  There should be some
> similar way to recover in this case with:
> 
> $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path":
> "/status/platformStatus", "value": {"type": "VSphere"}}]'
> $ oc -n openshift-machine-config-operator get -o jsonpath='{range
> .items[*]}{.metadata.name}{"\n"}{end}' pods | grep
> machine-config-controller- | while read POD; do oc -n
> openshift-machine-config-operator delete pod "${POD}"; done
> 
> or some such, but we haven't worked out the details yet.
> 
> [1]: https://github.com/openshift/api/pull/300

Should we apply that two commands to our failing cluster upgrade?

Comment 25 W. Trevor King 2020-05-28 08:25:17 UTC
> Should we apply that two commands to our failing cluster upgrade?

They should be safe enough, but we haven't had time to work out whether they are sufficient to unstick things.  If your cluster-version operator is just blocked on the machine-config operator (and not some earlier manifest), then yeah, go ahead and try and report back.  If that feels too risky, wait a bit, and we'll get a more formal recovery procedure out (possibly re-targeting your update to a new 4.4.z with the fix this bug is backporting).

Comment 26 Chet Hosey 2020-05-28 08:53:00 UTC
That patch reported no change on my cluster:

    $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]'
    infrastructure.config.openshift.io/cluster patched (no change)

And after terminating the machine-config-controller pod, its replacement is still logging a lot of entries like the following:

    I0528 08:50:46.470311       1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
    I0528 08:50:46.551618       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
    I0528 08:50:46.850498       1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status

Comment 27 W. Trevor King 2020-05-28 09:10:02 UTC
Ah, we need to stick something in /status/platformStatus/vsphere too.  How about:

$ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]'

or some such?

Comment 28 Chet Hosey 2020-05-28 09:27:56 UTC
    oc get infrastructure cluster -o json | jq .status.platformStatus
    {
      "type": "VSphere",
      "vsphere": {}
    }

    I0528 09:22:31.283915       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status
    I0528 09:22:31.504563       1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status.PlatformStatus.VSphere>: nil pointer evaluating *v1.Infrastructure.Status

I'm not sure how case-sensitive this is. I note that the log is generally LikeThis, and the JSON shows likeThis. Since the log says VSphere, I tried lowercasing the first letter (producing vSphere instead of vsphere):

    $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]'

That didn't work either; the new key disappeared entirely after that:

    $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere", "vSphere": {}}}]'
    infrastructure.config.openshift.io/cluster patched
    $ oc get infrastructure cluster -o json | jq .status.platformStatus
    {
      "type": "VSphere"
    }

Comment 29 W. Trevor King 2020-05-28 10:10:59 UTC
The PlatformStatus.VSphere from the logs is probably Go's casing, while platformstatus/vsphere is JSON's casing, per [1].  I don't understand the machine-config controller implementation well enough to know why it was still complaining about a nil pointer in <.Infra.Status.PlatformStatus.VSphere> when there were no nils in that chain.  Maybe the old Infrastructure object without the patching is getting cached somewhere that killing the MCC pod does not reset.

[1]: https://github.com/openshift/api/blob/0f159fee64dbf711d40dac3fa2ec8b563a2aaca8/config/v1/types_infrastructure.go#L170

Comment 34 Andreas Söhnlein 2020-05-29 15:08:07 UTC
(In reply to W. Trevor King from comment #27)
> Ah, we need to stick something in /status/platformStatus/vsphere too.  How
> about:
> 
> $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path":
> "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]'
> 
> or some such?

I am getting this error afterwards:

I0529 15:06:57.578864       1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
I0529 15:06:57.582976       1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
I0529 15:06:57.702248       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status

Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?!

Comment 35 Andreas Söhnlein 2020-05-29 15:14:10 UTC
(In reply to Andreas Söhnlein from comment #34)
> (In reply to W. Trevor King from comment #27)
> > Ah, we need to stick something in /status/platformStatus/vsphere too.  How
> > about:
> > 
> > $ oc patch infrastructure cluster --type json -p '[{"op": "add", "path":
> > "/status/platformStatus", "value": {"type": "VSphere", "vsphere": {}}}]'
> > 
> > or some such?
> 
> I am getting this error afterwards:
> 
> I0529 15:06:57.578864       1 template_controller.go:365] Error syncing
> controllerconfig machine-config-controller: failed to create MachineConfig
> for role master: failed to execute template: template:
> /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:
> 6:16: executing
> "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.
> yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
> I0529 15:06:57.582976       1 container_runtime_config_controller.go:369]
> Error syncing image config openshift-config: could not Create/Update
> MachineConfig: could not generate origin ContainerRuntime Configs:
> generateMachineConfigsforRole failed with error failed to execute template:
> template:
> /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:
> 6:16: executing
> "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.
> yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
> I0529 15:06:57.702248       1 kubelet_config_controller.go:313] Error
> syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with
> error failed to execute template: template:
> /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:
> 6:16: executing
> "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.
> yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
> 
> Seems like hes accepting the PlatforumStatus but a new nil pointer appeared?!


Well I didnt check it again before applying the patch. I also get that error message without patching the PlatformStatus.
BTW the cluster should upgrade to 4.4.6 atm. Maybe there was a change from 4.4.5 to 4.4.6 so now this error message appears?

Comment 36 Yadan Pei 2020-06-01 10:49:33 UTC
Tried following upgrade path today: 4.1.41 -> 4.2.34 -> 4.3.0-0.nightly-2020-06-01-043839 -> 4.4.0-0.nightly-2020-06-01-021027

1. Check starting version
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.41    True        False         20m     Cluster version is 4.1.41
# oc get co
NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.41    True        False         False      20m
cloud-credential                     4.1.41    True        False         False      41m
cluster-autoscaler                   4.1.41    True        False         False      41m
console                              4.1.41    True        False         False      29m
dns                                  4.1.41    True        False         False      40m
image-registry                       4.1.41    True        False         False      32m
ingress                              4.1.41    True        False         False      33m
kube-apiserver                       4.1.41    True        False         False      35m
kube-controller-manager              4.1.41    True        False         False      34m
kube-scheduler                       4.1.41    True        False         False      34m
machine-api                          4.1.41    True        False         False      40m
machine-config                       4.1.41    True        False         False      34m
marketplace                          4.1.41    True        False         False      32m
monitoring                           4.1.41    True        False         False      31m
network                              4.1.41    True        False         False      41m
node-tuning                          4.1.41    True        False         False      34m
openshift-apiserver                  4.1.41    True        False         False      34m
openshift-controller-manager         4.1.41    True        False         False      35m
openshift-samples                    4.1.41    True        False         False      29m
operator-lifecycle-manager           4.1.41    True        False         False      36m
operator-lifecycle-manager-catalog   4.1.41    True        False         False      36m
service-ca                           4.1.41    True        False         False      40m
service-catalog-apiserver            4.1.41    True        False         False      34m
service-catalog-controller-manager   4.1.41    True        False         False      34m
storage                              4.1.41    True        False         False      33m


2. update to 4.2.34
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.34    True        False         87s     Cluster version is 4.2.34
# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.34    True        False         False      62m
cloud-credential                           4.2.34    True        False         False      82m
cluster-autoscaler                         4.2.34    True        False         False      82m
console                                    4.2.34    True        False         False      8m
dns                                        4.2.34    True        False         False      82m
image-registry                             4.2.34    True        False         False      6m
ingress                                    4.2.34    True        False         False      74m
insights                                   4.2.34    True        False         False      31m
kube-apiserver                             4.2.34    True        False         False      77m
kube-controller-manager                    4.2.34    True        False         False      76m
kube-scheduler                             4.2.34    True        False         False      76m
machine-api                                4.2.34    True        False         False      82m
machine-config                             4.2.34    True        False         False      76m
marketplace                                4.2.34    True        False         False      3m39s
monitoring                                 4.2.34    True        False         False      5m34s
network                                    4.2.34    True        False         False      82m
node-tuning                                4.2.34    True        False         False      2m56s
openshift-apiserver                        4.2.34    True        False         False      2m58s
openshift-controller-manager               4.2.34    True        False         False      77m
openshift-samples                          4.2.34    True        False         False      31m
operator-lifecycle-manager                 4.2.34    True        False         False      78m
operator-lifecycle-manager-catalog         4.2.34    True        False         False      78m
operator-lifecycle-manager-packageserver   4.2.34    True        False         False      3m1s
service-ca                                 4.2.34    True        False         False      82m
service-catalog-apiserver                  4.2.34    True        False         False      75m
service-catalog-controller-manager         4.2.34    True        False         False      75m
storage                                    4.2.34    True        False         False      30m

3. Updating to 4.3
# oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839  --force --allow-explicit-upgrade
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-06-01-043839

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-06-01-043839   True        False         2m46s   Cluster version is 4.3.0-0.nightly-2020-06-01-043839

# oc get node
NAME              STATUS   ROLES    AGE    VERSION
compute-0         Ready    worker   137m   v1.16.2
compute-1         Ready    worker   137m   v1.16.2
compute-2         Ready    worker   137m   v1.16.2
control-plane-0   Ready    master   137m   v1.16.2
control-plane-1   Ready    master   137m   v1.16.2
control-plane-2   Ready    master   137m   v1.16.2
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-06-01-043839   True        False         False      123m
cloud-credential                           4.3.0-0.nightly-2020-06-01-043839   True        False         False      143m
cluster-autoscaler                         4.3.0-0.nightly-2020-06-01-043839   True        False         False      143m
console                                    4.3.0-0.nightly-2020-06-01-043839   True        False         False      18m
dns                                        4.3.0-0.nightly-2020-06-01-043839   True        False         False      142m
image-registry                             4.3.0-0.nightly-2020-06-01-043839   True        False         False      24m
ingress                                    4.3.0-0.nightly-2020-06-01-043839   True        False         False      135m
insights                                   4.3.0-0.nightly-2020-06-01-043839   True        False         False      92m
kube-apiserver                             4.3.0-0.nightly-2020-06-01-043839   True        False         False      137m
kube-controller-manager                    4.3.0-0.nightly-2020-06-01-043839   True        False         False      137m
kube-scheduler                             4.3.0-0.nightly-2020-06-01-043839   True        False         False      136m
machine-api                                4.3.0-0.nightly-2020-06-01-043839   True        False         False      143m
machine-config                             4.3.0-0.nightly-2020-06-01-043839   True        False         False      137m
marketplace                                4.3.0-0.nightly-2020-06-01-043839   True        False         False      23m
monitoring                                 4.3.0-0.nightly-2020-06-01-043839   True        False         False      20m
network                                    4.3.0-0.nightly-2020-06-01-043839   True        False         False      143m
node-tuning                                4.3.0-0.nightly-2020-06-01-043839   True        False         False      13m
openshift-apiserver                        4.3.0-0.nightly-2020-06-01-043839   True        False         False      14m
openshift-controller-manager               4.3.0-0.nightly-2020-06-01-043839   True        False         False      137m
openshift-samples                          4.3.0-0.nightly-2020-06-01-043839   True        False         False      8m43s
operator-lifecycle-manager                 4.3.0-0.nightly-2020-06-01-043839   True        False         False      138m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-06-01-043839   True        False         False      139m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-06-01-043839   True        False         False      19m
service-ca                                 4.3.0-0.nightly-2020-06-01-043839   True        False         False      142m
service-catalog-apiserver                  4.3.0-0.nightly-2020-06-01-043839   True        False         False      136m
service-catalog-controller-manager         4.3.0-0.nightly-2020-06-01-043839   True        False         False      136m
storage                                    4.3.0-0.nightly-2020-06-01-043839   True        False         False      51m


4. updating to latest 4.4 nightly
# oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --force --allow-explicit-upgrade
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-06-01-043839   True        True          45m     Unable to apply 4.4.0-0.nightly-2020-06-01-021027: the cluster operator machine-config has not yet successfully rolled out

# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-06-01T07:35:53Z
  Generation:          1
  Resource Version:    107409
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 85d2801c-a3da-11ea-9656-0050568b03c3
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-06-01T10:33:58Z
    Message:               Cluster not available for 4.4.0-0.nightly-2020-06-01-021027
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-06-01T10:26:25Z
    Message:               Working towards 4.4.0-0.nightly-2020-06-01-021027
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-06-01T10:33:58Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-06-01-021027: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: ControllerConfig has not completed: completed(false) running(false) failing(true)
    Reason:                MachineConfigControllerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-06-01T08:55:50Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.3.0-0.nightly-2020-06-01-043839
Events:       <none>

# oc logs -n openshift-machine-config-operator -f machine-config-controller-7488656c67-7phnq
I0601 10:46:37.401973       1 container_runtime_config_controller.go:369] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not generate origin ContainerRuntime Configs: generateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
I0601 10:46:37.688608       1 kubelet_config_controller.go:313] Error syncing kubeletconfig cluster: GenerateMachineConfigsforRole failed with error failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
I0601 10:46:38.027834       1 template_controller.go:365] Error syncing controllerconfig machine-config-controller: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml:6:16: executing "/etc/mcc/templates/common/vsphere/files/vsphere-NetworkManager-kni-conf.yaml" at <.Infra.Status>: nil pointer evaluating *v1.Infrastructure.Status
I0601 10:46:41.898894       1 render_controller.go:376] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(false) failing(true)

5. Make sure the latest 4.4 nightly has the required fix 
# oc adm release info registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-06-01-021027 --pullspecs | grep machine-config-operator
  machine-config-operator                        quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f
# oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db68e2fe62120fb429c0127e4ea562316150612115d8fb2255f4bcaeaaca690f | grep commit
             io.openshift.build.commit.id=e3f4e2596eaf47a0081a4df04607eec9acd88e05
             io.openshift.build.commit.url=https://github.com/openshift/machine-config-operator/commit/e3f4e2596eaf47a0081a4df04607eec9acd88e05
# git log e3f4e2596eaf47a0081a4df04607eec9acd88e05 | grep '#1735'
    Merge pull request #1735 from jcpowermac/CP1728


Assigning back, you can also take a look at cluster using the kubeconfig attached

Comment 38 Joseph Callen 2020-06-03 12:54:28 UTC
Sorry for the inconvenience this has caused. The templates that we added to MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on various variables was to ensure the difference between UPI and IPI.

In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26) to implement a CI job that will upgrade a cluster on vSphere through the releases in the hope that we would catch this failure before a customer does.

Workaround:
1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]'
2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml
3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 
4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller`
5.) Then perform an update

Comment 39 Andreas Söhnlein 2020-06-03 14:52:23 UTC
(In reply to Joseph Callen from comment #38)
> Sorry for the inconvenience this has caused. The templates that we added to
> MCO in version 4.4 were the precursor to enabling vSphere IPI. The check on
> various variables was to ensure the difference between UPI and IPI.
> 
> In our backlog we have a story (https://issues.redhat.com/browse/SPLAT-26)
> to implement a CI job that will upgrade a cluster on vSphere through the
> releases in the hope that we would catch this failure before a customer does.
> 
> Workaround:
> 1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path":
> "/status/platformStatus", "value": {"type": "VSphere"}}]'
> 2.) oc get controllerconfigs.machineconfiguration.openshift.io
> machine-config-controller -o yaml > mcc.yaml
> 3.) oc delete controllerconfigs.machineconfiguration.openshift.io
> machine-config-controller 
> 4.) confirm the above ^ is regenerated `oc get
> controllerconfigs.machineconfiguration.openshift.io
> machine-config-controller`
> 5.) Then perform an update

Hello Joseph,

thanks for your response.
I just managed to successfully upgrade our cluster to 4.4.6 without failure.
The only issue that remains on my side is that the cluster complains about "unhealthy etcd members generated from openshift-cluster-etcd-operator-etcd-client" after etcd-operator had this issue 
"Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again"

# oc logs etcd-operator-dd8898d94-gcdmx

I0603 14:50:41.896787       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.234:2379 0  <nil>} {https://172.16.11.230:2379 0  <nil>} {https://172.16.11.233:2379 0  <nil>} {https://172.16.11.232:2379 0  <nil>} {https://172.16.11.231:2379 0  <nil>}]
I0603 14:50:41.922569       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:41.922768       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:41.922778       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:41.922785       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:41.922791       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:44.907107       1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
I0603 14:50:44.908266       1 client.go:361] parsed scheme: "endpoint"
I0603 14:50:44.908327       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://172.16.11.231:2379 0  <nil>} {https://172.16.11.232:2379 0  <nil>} {https://172.16.11.230:2379 0  <nil>} {https://172.16.11.233:2379 0  <nil>} {https://172.16.11.234:2379 0  <nil>}]
I0603 14:50:44.927584       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:44.933897       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:44.935793       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:44.935804       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:44.935834       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0603 14:50:47.917208       1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap

I dont know if it has anything to do with this bug? But it did not appear before the upgrade.


Cheers

Comment 40 Joseph Callen 2020-06-03 16:15:37 UTC
Hi Andreas, 

If you can please open a new BZ for that issue or contact customer support I think that would be the best way forward for an issue not related to the BZ subject.

Thanks!

Comment 41 Chet Hosey 2020-06-04 08:34:41 UTC
Thanks! FYI, this completed on my cluster, which was already hung on 4.3 -> 4.4.

1.) oc patch infrastructure cluster --type json -p '[{"op": "add", "path": "/status/platformStatus", "value": {"type": "VSphere"}}]'

This was a noop, since this field had already been set for a 4.1 -> 4.2 issue.

2.) oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller -o yaml > mcc.yaml
3.) oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller 

Done and done.

4.) confirm the above ^ is regenerated `oc get controllerconfigs.machineconfiguration.openshift.io machine-config-controller`

This took about a minute to regenerate.

5.) Then perform an update

My 4.3 -> 4.4 was already in progress. Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage?

Comment 46 Chet Hosey 2020-06-10 05:19:59 UTC
It seems that `oc delete controllerconfigs.machineconfiguration.openshift.io machine-config-controller` was the wrong thing to do on a 4.3 cluster.

I ran it on my production cluster in preparation for the move to 4.3. Several nodes are now in NotReady state. I see new machine CSRs. Upon approving one, I saw a new entry in `oc get nodes` listed as "localhost.localdomain". Approving that machine's CSR, it progressed to "Ready" status, and the original node entry is still showing NotReady.

With 4.3 out of full support I'm trying to stay in a current configuration. It would help a lot if Red Hat could test upgrades before dropping full support for the (n-1) point release.

Comment 49 Scott Dodson 2020-06-10 16:38:27 UTC
The workaround in comment 38 is only relevant to a cluster that's already stuck on an upgrade and should not be used in any proactive manner. vSphere UPI clusters installed in 4.1 currently running 4.3 should delay upgrading to 4.4 until this bug has resolved.

In general please engage with support rather than directly via bugzilla as they'll have the most context regarding your specific environment and requirements.

Comment 50 Chet Hosey 2020-06-10 16:42:33 UTC
I have been engaged with support. Here's an excerpt from that conversation:

> >> Does this mean I can run these steps before starting the upgrade to 4.4 to avoid the stoppage?
> 
> As of now, the engineering team was able to test this workaround before initiating an upgrade only. 
> So yes, if you have any different cluster where an upgrade is due then you can apply this workaround prior to the upgrade.

Comment 53 errata-xmlrpc 2020-06-17 22:26:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2445


Note You need to log in before you can comment on or make changes to this bug.