Created attachment 1853245 [details] Do not panic when no NIC properties (Primary) are present and gracefully handle when no NICs returned Description of problem: When upgrading from 4.9 to 4.10 on Azure (ARO) the cloud-network-config-controller goes into a CrashLoopBackOff state during cluster upgrade. $ oc get all -n openshift-cloud-network-config-controller NAME READY STATUS RESTARTS AGE pod/cloud-network-config-controller-b57fdcd98-4l8qv 0/1 CrashLoopBackOff 53 (30s ago) 3h51m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/cloud-network-config-controller 0/1 1 0 23h NAME DESIRED CURRENT READY AGE replicaset.apps/cloud-network-config-controller-b57fdcd98 1 1 0 23h The reason for this is a panic related to a nil reference from the master subnet NICs: W0124 08:12:49.912680 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0124 08:12:49.913767 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock... I0124 08:12:49.962711 1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock I0124 08:12:49.963517 1 controller.go:88] Starting node controller I0124 08:12:49.963534 1 controller.go:91] Waiting for informer caches to sync for node workqueue I0124 08:12:49.963631 1 controller.go:88] Starting cloud-private-ip-config controller I0124 08:12:49.963647 1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue I0124 08:12:49.963823 1 controller.go:88] Starting secret controller I0124 08:12:49.963839 1 controller.go:91] Waiting for informer caches to sync for secret workqueue I0124 08:12:49.973212 1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-0 to node workqueue I0124 08:12:49.973314 1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-1 to node workqueue I0124 08:12:49.973399 1 controller.go:182] Assigning key: dnewman-test-zmsx5-master-2 to node workqueue I0124 08:12:49.973440 1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus1-q2j8g to node workqueue I0124 08:12:49.973722 1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus2-4t9p8 to node workqueue I0124 08:12:49.973789 1 controller.go:182] Assigning key: dnewman-test-zmsx5-worker-eastus3-8zzns to node workqueue I0124 08:12:50.063834 1 controller.go:96] Starting node workers I0124 08:12:50.063872 1 controller.go:96] Starting cloud-private-ip-config workers I0124 08:12:50.063924 1 controller.go:102] Started cloud-private-ip-config workers I0124 08:12:50.064017 1 controller.go:102] Started node workers I0124 08:12:50.064175 1 controller.go:96] Starting secret workers I0124 08:12:50.064208 1 controller.go:102] Started secret workers E0124 08:12:50.359395 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 120 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1f0a840, 0x380ef40}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0001e0cc0}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1f0a840, 0x380ef40}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).getNetworkInterfaces(0xc0005c3400, 0xc0002b7900) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:286 +0xb5 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).GetNodeEgressIPConfiguration(0x1ed69e0, 0xc00011c360) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:171 +0x4c github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc0005bc500, {0xc00011c360, 0x1b}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:93 +0x150 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0001147e0, {0x1d3fdc0, 0xc0001e0cc0}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0001147e0) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f183afef858) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x2615620, 0xc00059c0c0}, 0x1, 0xc0005d0360) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25 created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1949db5] goroutine 120 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0001e0cc0}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8 panic({0x1f0a840, 0x380ef40}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).getNetworkInterfaces(0xc0005c3400, 0xc0002b7900) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:286 +0xb5 github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).GetNodeEgressIPConfiguration(0x1ed69e0, 0xc00011c360) /go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:171 +0x4c github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc0005bc500, {0xc00011c360, 0x1b}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:93 +0x150 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc0001147e0, {0x1d3fdc0, 0xc0001e0cc0}) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x126 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc0001147e0) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46 github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...) /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f183afef858) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x2615620, 0xc00059c0c0}, 0x1, 0xc0005d0360) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25 created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run /go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x398 Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-fc.2 True False 22h Cluster version is 4.10.0-fc.2 $ oc get node NAME STATUS ROLES AGE VERSION dnewman-test-zmsx5-master-0 Ready master 28h v1.23.0+60f5a1c dnewman-test-zmsx5-master-1 Ready master 28h v1.23.0+60f5a1c dnewman-test-zmsx5-master-2 Ready master 28h v1.23.0+60f5a1c dnewman-test-zmsx5-worker-eastus1-q2j8g Ready worker 28h v1.23.0+60f5a1c dnewman-test-zmsx5-worker-eastus2-4t9p8 Ready worker 28h v1.23.0+60f5a1c dnewman-test-zmsx5-worker-eastus3-8zzns Ready worker 28h v1.23.0+60f5a1c How reproducible: Upgrading a cluster from 4.9 to 4.10 on Azure (ARO). Steps to Reproduce: 1. Upgrade from 4.9 to 4.10 on Azure (ARO). Actual results: Cluster upgrade does not finalise, due to the '"openshift-cloud-network-config-controller/cloud-network-config-controller" is not available' error status in the network operator. Expected results: Cluster upgrade succeeds. Additional info: I have a patch to circumvent the panic, which I've attached. If the patch is deemed to be a suitable path forward, then I will create a PR for it.
@dnewman can you please push your PR ASAP?
Your patch looks good. Please file it and I will review It's really strange that there are instances that __do not__ have NICs on Azure though...
Filed a PR myself, since time is of the essence this close to code freeze.
Updated the QE Contact since zzhao couldn't be found to a linked Github user account, so the bot refused to update the BZ
I was just monitoring the issue since our cluster is affected too, nothing to add as my output is the same as David's.
I'm dropping UpgradeBlocker from this bug without requesting an impact statement, because UpgradeBlocker is about "do we want to modify the update graph to protect folks from this issue?". And since this is verified fixed in 4.10 before GA, no supported update edges would be impacted, and we won't need to make any modifications to the graph, even if an impact statement here had been "it's really terrible". [1]: https://github.com/openshift/enhancements/blob/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle/README.md
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056