Bug 2102383 - Kube controllers crash when nodes are shut off in OpenStack
Summary: Kube controllers crash when nodes are shut off in OpenStack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.12.0
Assignee: Matthew Booth
QA Contact: Jon Uriarte
Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks: 2109270
TreeView+ depends on / blocked
 
Reported: 2022-06-29 20:03 UTC by bshaw
Modified: 2023-01-17 19:51 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, a bug in the {rh-openstack} legacy cloud provider resulted in a crash if certain {rh-openstack} operations were attempted after authentication had failed. For example, shutting down a server causes the Kubernetes controller manager to fetch server information from {rh-openstack}, which triggered this bug. As a result, if initial cloud authentication failed or was configured incorrectly, shutting down a server caused the Kubernetes controller manager to crash. With this release, the {rh-openstack} legacy cloud provider is updated to not attempt any {rh-openstack} API calls if it has not previously authenticated successfully. Now, shutting down a server with invalid cloud credentials no longer causes Kubernetes controller manager to crash. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2102383[*BZ#2102383*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:51:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 1315 0 None Merged Bug 2102383: UPSTREAM: 89885: Fix panic in openstack.InstanceExistsByProviderID() 2022-08-11 11:19:33 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:51:38 UTC

Comment 2 ravig 2022-07-06 02:38:21 UTC
I think this is something that the cloud team owns. Cloud team - sending it your way. Let me know if it is not the case. Looking the must-gather:

```
2022-06-28T16:29:58.474645585Z 	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack/openstack_instances.go:178 +0x8e
2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle.ensureNodeExistsByProviderID(0x55fa968, 0xc0000520c0, 0x5646f58, 0xc000e18280, 0xc00a4b6000, 0xc00011b850, 0x1, 0x1)
2022-06-28T16:29:58.474645585Z 	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go:238 +0xc8
2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle.(*CloudNodeLifecycleController).MonitorNodes(0xc0071b56d0)
2022-06-28T16:29:58.474645585Z 	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go:146 +0x179
2022-06-28T16:29:58.474645585Z k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc005e6df90)
```

Comment 3 Martin André 2022-07-06 10:45:21 UTC
After a first investigation, the issue happens in ensureNodeExistsByProviderID() that is only called when nodes are in NotReady status.

Instances() returns an object with an empty compute client when the cloud provider is not initialized. Later InstanceExistsByProviderID() tries to list servers using this nil client, and we get the panic.

It's not completely clear to me why you're getting `cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized: cannot initialize cloud provider using data from the secret: Authentication failed` errors from the cloud provider. We used to have a race condition during the initialization of the cloud provider (https://bugzilla.redhat.com/show_bug.cgi?id=2039373) however it was fixed in 4.9.21 and I see you're using 4.9.31.

We'll need to figure out why you're getting the Authentication failed, fix what caused it, and also fix the panic when the cloud provider is not initialized.

Comment 6 Martin André 2022-07-07 08:21:01 UTC
I'm not reproducing with a fresh install of 4.9.

moc-dev ❯ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.41    True        False         34m     Cluster version is 4.9.41

moc-dev ❯ oc get nodes
NAME                          STATUS     ROLES    AGE   VERSION
mandre-lpw9p-master-0         Ready      master   59m   v1.22.8+f34b40c
mandre-lpw9p-master-1         Ready      master   59m   v1.22.8+f34b40c
mandre-lpw9p-master-2         Ready      master   58m   v1.22.8+f34b40c
mandre-lpw9p-worker-0-45zbh   Ready      worker   43m   v1.22.8+f34b40c
mandre-lpw9p-worker-0-qbg5g   Ready      worker   44m   v1.22.8+f34b40c
mandre-lpw9p-worker-0-rmhrl   NotReady   worker   42m   v1.22.8+f34b40c

moc-dev ❯ oc get pods -n openshift-kube-controller-manager                                          
NAME                                            READY   STATUS      RESTARTS      AGE
kube-controller-manager-mandre-lpw9p-master-0   4/4     Running     0             48m
kube-controller-manager-mandre-lpw9p-master-1   4/4     Running     3 (49m ago)   52m
kube-controller-manager-mandre-lpw9p-master-2   4/4     Running     0             49m

Logs from KCM:
I0707 07:33:03.696997       1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.007782123s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:56:15 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status,}
I0707 07:33:03.697199       1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.007998509s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,}
I0707 07:33:03.697259       1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.008042003s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,}
I0707 07:33:03.697283       1 node_lifecycle_controller.go:1092] node mandre-lpw9p-worker-0-rmhrl hasn't been updated for 40.008084428s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTime:2022-07-07 07:31:28 +0000 UTC,LastTransitionTime:2022-07-07 06:54:34 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,}
I0707 07:33:03.856669       1 controller_utils.go:185] Recording status change NodeNotReady event message for node mandre-lpw9p-worker-0-rmhrl
I0707 07:33:03.856793       1 controller_utils.go:122] Update ready status of pods on node [mandre-lpw9p-worker-0-rmhrl]
I0707 07:33:03.856896       1 controller_utils.go:140] Updating ready status of pod network-check-target-642wb to false
I0707 07:33:03.857424       1 event.go:291] "Event occurred" object="mandre-lpw9p-worker-0-rmhrl" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node mandre-lpw9p-worker-0-rmhrl status is now: NodeNotReady"

A bit of a wild guess, but I noticed https://github.com/openshift/kubernetes/pull/1267/ that updates kubelet's secret and configmap handling and that made it into 4.9.35. I wonder if that's our fix. I'll try again with 4.9.31.

Comment 10 Matthew Booth 2022-07-07 20:31:02 UTC
Lets ignore the authentication warning for now as we have no direct evidence that it's related to the panic. We can come back to it if it turns out to be related.

Do we have a stack trace from the panic? Is there any additional indication what 'runtime.go:78' is?

Comment 13 Matthew Booth 2022-07-08 11:29:34 UTC
Concur with Martin's analysis:

We're returning an Instances with an empty compute client here after ensureCloudProviderWasInitialized() was called: https://github.com/openshift/kubernetes/blob/f34b40c88df901a828fd50da9c0ccdcae7d72b10/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack_instances.go#L51-L59

Aside from why this call failed, this is a landmine. We should be returning nil for Instances there.

Incidentally, the scary reference to deleteURL() in the gophercloud stacktrace is nothing to worry about: the implementation is just common with Get().

Comment 14 Matthew Booth 2022-07-08 19:50:06 UTC
This behaviour is downstream-only. If upstream fails to authenticate it returns a nil Instances. Downstream we return a partially initialised Instances where any attempt to call a method requiring the compute client will panic.

The cloud provider is primarily used in 2 contexts: kubelet and KCM. KCM makes a call to cloudprovider.SetInformers(), which we use to initialise a secretLister and subsequently read cloud credentials from a secret. Kubelet does not make this call, so when running in kubelet openstack cloud provider never gets cloud credentials. They must either be set in cloud.conf, which we do not do for security reasons, or we must be able to function using only Nova metadata. With these constraints openstack.Instances() must return a partially functional cloudprovider.Instances() in order to be functional at all, hence returning the partially initialised Instances. Presumably it just so happens that kubelet never calls any of the methods which will panic.

However, KCM expects to be able to call all methods. If we fail to get a provider for some reason other than "we're kubelet and we have no access to secrets" then we will call methods which panic with a partially initialised Instances.

Upstream does not return an Instances object at all if the provider is uninitialised, which will mean that kubelet cannot function at all without an initialised provider. This presumably means one of:
* kubelet never gets a functional cloud provider
* config is passed in cloud.conf
* [Global]/kubeconfig-path in cloud.conf specifies a local kubeconfig which kubelet can use to fetch the secret

The former seems unlikely, so I suspect a combination of the latter two in practise.

The primary purpose of the above is to explain that this code lives in a non-trivial amount of context, and changes to it would need to be considered and well tested. I think it is safe to say we would not design it this way if we were starting again.

Fortunately we are starting again! This code is obsolete in 4.12 when it will be replaced by external CCM. Given that:
* The code is downstream-only
* The code is obsolete
* The code is frozen and even small changes to it are usually rejected
* The code does not have as good test coverage as its replacement
* The potential for regressions is high
* The impact of regression is high

I am not inclined to make the structural changes to this code it probably requires. Consequently I'm going to submit a hopefully minimal patch to simply make it not panic in this situation.

Comment 36 errata-xmlrpc 2023-01-17 19:51:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.