I think this is something that the cloud team owns. Cloud team - sending it your way. Let me know if it is not the case. Looking the must-gather: ``` 2022-06-28T16:29:58.474645585Z /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack/openstack_instances.go:178 +0x8e 2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle.ensureNodeExistsByProviderID(0x55fa968, 0xc0000520c0, 0x5646f58, 0xc000e18280, 0xc00a4b6000, 0xc00011b850, 0x1, 0x1) 2022-06-28T16:29:58.474645585Z /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go:238 +0xc8 2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle.(*CloudNodeLifecycleController).MonitorNodes(0xc0071b56d0) 2022-06-28T16:29:58.474645585Z /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go:146 +0x179 2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc005e6df90) ```
After a first investigation, the issue happens in ensureNodeExistsByProviderID() that is only called when nodes are in NotReady status. Instances() returns an object with an empty compute client when the cloud provider is not initialized. Later InstanceExistsByProviderID() tries to list servers using this nil client, and we get the panic. It's not completely clear to me why you're getting `cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized: cannot initialize cloud provider using data from the secret: Authentication failed` errors from the cloud provider. We used to have a race condition during the initialization of the cloud provider (https://bugzilla.redhat.com/show_bug.cgi?id=2039373) however it was fixed in 4.9.21 and I see you're using 4.9.31. We'll need to figure out why you're getting the Authentication failed, fix what caused it, and also fix the panic when the cloud provider is not initialized.
I'm not reproducing with a fresh install of 4.9. moc-dev ❯ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.41 True False 34m Cluster version is 4.9.41 moc-dev ❯ oc get nodes NAME STATUS ROLES AGE VERSION mandre-lpw9p-master-0 Ready master 59m v1.22.8+f34b40c mandre-lpw9p-master-1 Ready master 59m v1.22.8+f34b40c mandre-lpw9p-master-2 Ready master 58m v1.22.8+f34b40c mandre-lpw9p-worker-0-45zbh Ready worker 43m v1.22.8+f34b40c mandre-lpw9p-worker-0-qbg5g Ready worker 44m v1.22.8+f34b40c mandre-lpw9p-worker-0-rmhrl NotReady worker 42m v1.22.8+f34b40c moc-dev ❯ oc get pods -n openshift-kube-controller-manager NAME READY STATUS RESTARTS AGE kube-controller-manager-mandre-lpw9p-master-0 4/4 Running 0 48m kube-controller-manager-mandre-lpw9p-master-1 4/4 Running 3 (49m ago) 52m kube-controller-manager-mandre-lpw9p-master-2 4/4 Running 0 49m Logs from KCM: I0707 07:33:03.696997 1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.007782123s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:56:15 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status,} I0707 07:33:03.697199 1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.007998509s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,} I0707 07:33:03.697259 1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.008042003s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,} I0707 07:33:03.697283 1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.008084428s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,} I0707 07:33:03.856669 1 controller_utils.go:185] Recording status change NodeNotReady event message for node mandre-lpw9p-worker-0-rmhrl I0707 07:33:03.856793 1 controller_utils.go:122] Update ready status of pods on node [mandre-lpw9p-worker-0-rmhrl] I0707 07:33:03.856896 1 controller_utils.go:140] Updating ready status of pod network-check-target-642wb to false I0707 07:33:03.857424 1 event.go:291] "Event occurred" object="mandre-lpw9p-worker-0-rmhrl" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node mandre-lpw9p-worker-0-rmhrl status is now: NodeNotReady" A bit of a wild guess, but I noticed https://github.com/openshift/kubernetes/pull/1267/ that updates kubelet's secret and configmap handling and that made it into 4.9.35. I wonder if that's our fix. I'll try again with 4.9.31.
Lets ignore the authentication warning for now as we have no direct evidence that it's related to the panic. We can come back to it if it turns out to be related. Do we have a stack trace from the panic? Is there any additional indication what 'runtime.go:78' is?
Concur with Martin's analysis: We're returning an Instances with an empty compute client here after ensureCloudProviderWasInitialized() was called: https://github.com/openshift/kubernetes/blob/f34b40c88df901a828fd50da9c0ccdcae7d72b10/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack_instances.go#L51-L59 Aside from why this call failed, this is a landmine. We should be returning nil for Instances there. Incidentally, the scary reference to deleteURL() in the gophercloud stacktrace is nothing to worry about: the implementation is just common with Get().
This behaviour is downstream-only. If upstream fails to authenticate it returns a nil Instances. Downstream we return a partially initialised Instances where any attempt to call a method requiring the compute client will panic. The cloud provider is primarily used in 2 contexts: kubelet and KCM. KCM makes a call to cloudprovider.SetInformers(), which we use to initialise a secretLister and subsequently read cloud credentials from a secret. Kubelet does not make this call, so when running in kubelet openstack cloud provider never gets cloud credentials. They must either be set in cloud.conf, which we do not do for security reasons, or we must be able to function using only Nova metadata. With these constraints openstack.Instances() must return a partially functional cloudprovider.Instances() in order to be functional at all, hence returning the partially initialised Instances. Presumably it just so happens that kubelet never calls any of the methods which will panic. However, KCM expects to be able to call all methods. If we fail to get a provider for some reason other than "we're kubelet and we have no access to secrets" then we will call methods which panic with a partially initialised Instances. Upstream does not return an Instances object at all if the provider is uninitialised, which will mean that kubelet cannot function at all without an initialised provider. This presumably means one of: * kubelet never gets a functional cloud provider * config is passed in cloud.conf * [Global]/kubeconfig-path in cloud.conf specifies a local kubeconfig which kubelet can use to fetch the secret The former seems unlikely, so I suspect a combination of the latter two in practise. The primary purpose of the above is to explain that this code lives in a non-trivial amount of context, and changes to it would need to be considered and well tested. I think it is safe to say we would not design it this way if we were starting again. Fortunately we are starting again! This code is obsolete in 4.12 when it will be replaced by external CCM. Given that: * The code is downstream-only * The code is obsolete * The code is frozen and even small changes to it are usually rejected * The code does not have as good test coverage as its replacement * The potential for regressions is high * The impact of regression is high I am not inclined to make the structural changes to this code it probably requires. Consequently I'm going to submit a hopefully minimal patch to simply make it not panic in this situation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399