Bug 2044745 - Upgrading cluster from 4.9 to 4.10 on Azure (ARO) causes the cloud-network-config-controller pod to CrashLoopBackOff
Summary: Upgrading cluster from 4.9 to 4.10 on Azure (ARO) causes the cloud-network-co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Ben Bennett
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-25 06:16 UTC by David Newman
Modified: 2022-03-10 16:42 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:42:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Do not panic when no NIC properties (Primary) are present and gracefully handle when no NICs returned (1.91 KB, patch)
2022-01-25 06:16 UTC, David Newman
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-network-config-controller pull 23 0 None Merged Bug 2044745: Add check on `NetworkInterfaceReferenceProperties` for Azure 2022-02-02 23:19:10 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:42:34 UTC

Description David Newman 2022-01-25 06:16:20 UTC
Created attachment 1853245 [details]
Do not panic when no NIC properties (Primary) are present and gracefully handle when no NICs returned

Description of problem:
When upgrading from 4.9 to 4.10 on Azure (ARO) the cloud-network-config-controller goes into a CrashLoopBackOff state during cluster upgrade.

$ oc get all -n openshift-cloud-network-config-controller
NAME                                                  READY   STATUS             RESTARTS       AGE
pod/cloud-network-config-controller-b57fdcd98-4l8qv   0/1     CrashLoopBackOff   53 (30s ago)   3h51m

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cloud-network-config-controller   0/1     1            0           23h

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/cloud-network-config-controller-b57fdcd98   1         1         0       23h

The reason for this is a panic related to a nil reference from the master subnet NICs:
W0124 08:12:49.912680       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0124 08:12:49.913767       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
I0124 08:12:49.962711       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0124 08:12:49.963517       1 controller.go:88] Starting node controller
I0124 08:12:49.963534       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0124 08:12:49.963631       1 controller.go:88] Starting cloud-private-ip-config controller
I0124 08:12:49.963647       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0124 08:12:49.963823       1 controller.go:88] Starting secret controller
I0124 08:12:49.963839       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0124 08:12:49.973212       1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-0 to node workqueue
I0124 08:12:49.973314       1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-1 to node workqueue
I0124 08:12:49.973399       1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-2 to node workqueue
I0124 08:12:49.973440       1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus1-q2j8g to node workqueue
I0124 08:12:49.973722       1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus2-4t9p8 to node workqueue
I0124 08:12:49.973789       1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus3-8zzns to node workqueue
I0124 08:12:50.063834       1 controller.go:96] Starting node workers
I0124 08:12:50.063872       1 controller.go:96] Starting cloud-private-ip-config workers
I0124 08:12:50.063924       1 controller.go:102] Started cloud-private-ip-config workers
I0124 08:12:50.064017       1 controller.go:102] Started node workers
I0124 08:12:50.064175       1 controller.go:96] Starting secret workers
I0124 08:12:50.064208       1 controller.go:102] Started secret workers
E0124 08:12:50.359395       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 120 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1f0a840, 0x380ef40})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0001e0cc0})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1f0a840, 0x380ef40})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).getNetworkInterfaces(0xc0005c3400, 0xc0002b7900)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:286 +0xb5
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).GetNodeEgressIPConfiguration(0x1ed69e0, 0xc00011c360)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:171 +0x4c
github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc0005bc500, {0xc00011c360, 0x1b})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:93 +0x150
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0001147e0, {0x1d3fdc0, 0xc0001e0cc0})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0001147e0)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f183afef858)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x2615620, 0xc00059c0c0}, 0x1, 0xc0005d0360)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1949db5]

goroutine 120 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0001e0cc0})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1f0a840, 0x380ef40})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).getNetworkInterfaces(0xc0005c3400, 0xc0002b7900)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:286 +0xb5
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).GetNodeEgressIPConfiguration(0x1ed69e0, 0xc00011c360)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:171 +0x4c
github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc0005bc500, {0xc00011c360, 0x1b})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:93 +0x150
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0001147e0, {0x1d3fdc0, 0xc0001e0cc0})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0001147e0)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f183afef858)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x2615620, 0xc00059c0c0}, 0x1, 0xc0005d0360)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398


Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-fc.2   True        False         22h     Cluster version is 4.10.0-fc.2

$ oc get node
NAME                                      STATUS   ROLES    AGE   VERSION
dnewman-test-zmsx5-master-0               Ready    master   28h   v1.23.0+60f5a1c
dnewman-test-zmsx5-master-1               Ready    master   28h   v1.23.0+60f5a1c
dnewman-test-zmsx5-master-2               Ready    master   28h   v1.23.0+60f5a1c
dnewman-test-zmsx5-worker-eastus1-q2j8g   Ready    worker   28h   v1.23.0+60f5a1c
dnewman-test-zmsx5-worker-eastus2-4t9p8   Ready    worker   28h   v1.23.0+60f5a1c
dnewman-test-zmsx5-worker-eastus3-8zzns   Ready    worker   28h   v1.23.0+60f5a1c

How reproducible:
Upgrading a cluster from 4.9 to 4.10 on Azure (ARO).

Steps to Reproduce:
1. Upgrade from 4.9 to 4.10 on Azure (ARO).

Actual results:
Cluster upgrade does not finalise, due to the '"openshift-cloud-network-config-controller/cloud-network-config-controller" is not available' error status in the network operator.

Expected results:
Cluster upgrade succeeds.

Additional info:
I have a patch to circumvent the panic, which I've attached. If the patch is deemed to be a suitable path forward, then I will create a PR for it.

Comment 1 Tim Rozet 2022-01-25 14:32:12 UTC
@dnewman can you please push your PR ASAP?

Comment 2 Alexander Constantinescu 2022-01-25 14:32:43 UTC
Your patch looks good. Please file it and I will review

It's really strange that there are instances that __do not__ have NICs on Azure though...

Comment 3 Alexander Constantinescu 2022-01-25 15:31:14 UTC
Filed a PR myself, since time is of the essence this close to code freeze.

Comment 4 Alexander Constantinescu 2022-01-25 16:07:07 UTC
Updated the QE Contact since zzhao couldn't be found to a linked Github user account, so the bot refused to update the BZ

Comment 5 Leandro Rebosio 2022-01-26 14:43:29 UTC
I was just monitoring the issue since our cluster is affected too, nothing to add as my output is the same as David's.

Comment 11 W. Trevor King 2022-02-02 23:20:58 UTC
I'm dropping UpgradeBlocker from this bug without requesting an impact statement, because UpgradeBlocker is about "do we want to modify the update graph to protect folks from this issue?".  And since this is verified fixed in 4.10 before GA, no supported update edges would be impacted, and we won't need to make any modifications to the graph, even if an impact statement here had been "it's really terrible".

[1]: https://github.com/openshift/enhancements/blob/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle/README.md

Comment 13 errata-xmlrpc 2022-03-10 16:42:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.