1969633 – SNO node failed to get ready cause kubelet failed to talk with api due to NotAuthorized issue

Bug 1969633 - SNO node failed to get ready cause kubelet failed to talk with api due to NotAuthorized issue

Summary: SNO node failed to get ready cause kubelet failed to talk with api due to Not...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dean West
QA Contact:	Sandeep
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1978417
TreeView+	depends on / blocked

Reported:	2021-06-08 19:14 UTC by Igal Tsoiref
Modified:	2022-04-27 11:29 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-27 11:29:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
api-logs (3.01 MB, application/gzip) 2021-06-08 19:14 UTC, Igal Tsoiref	no flags	Details
sos-report (13.48 MB, application/x-bzip) 2021-06-08 19:15 UTC, Igal Tsoiref	no flags	Details
must-gather-no-pod-logs (2.15 MB, application/gzip) 2021-06-08 19:18 UTC, Igal Tsoiref	no flags	Details
kubelet-logs (506.89 KB, application/gzip) 2021-06-08 19:49 UTC, Igal Tsoiref	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 619	0	None	open	Bug 1969633: pkg/operator/targetconfigcontroller: wait for kcm-o to generate certs before rollout	2021-07-01 15:30:57 UTC
Github	openshift cluster-kube-apiserver-operator pull 1169	0	None	open	Bug 1969633: pkg/operator/targetconfigcontroller: wait for kcm-o to generate certs before rollout	2021-07-01 14:13:55 UTC

Description Igal Tsoiref 2021-06-08 19:14:56 UTC

Created attachment 1789455 [details]
api-logs

Description of problem:
In sno flow after bootstrap-kube-api was removed and kube-api operator spawned master kube-api kubelet stop failed to talk with with due to some auth issues.

Many messages like this:

2021-06-04T23:31:29.227084500+00:00 stderr F E0604 23:31:29.226995      19 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate signed by unknown authority, verifying certificate SN=168548637191173065529478113719475632231, SKID=, AKID=3A:37:BC:B0:BB:1A:31:E0:DD:06:31:4C:9E:1E:14:BC:22:69:E5:65 failed: x509: certificate signed by unknown authority]"
2021-06-04T23:31:29.281116727+00:00 stderr F W0604 23:31:29.280929      19 patch_genericapiserver.go:123] Request to "/apis/storage.k8s.io/v1/csinodes/sno00029" (source IP 198.18.8.146:38316, user agent "kubelet/v1.21.0 (linux/amd64) kubernetes/4b2b6ff") before server is ready, possibly a sign for a broken load balancer setup.
2021-06-04T23:31:29.281184225+00:00 stderr F E0604 23:31:29.281143      19 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate signed by unknown authority, verifying certificate SN=168548637191173065529478113719475632231, SKID=, AKID=3A:37:BC:B0:BB:1A:31:E0:DD:06:31:4C:9E:1E:14:BC:22:69:E5:65 failed: x509: certificate signed by unknown authority]"
2021-06-04T23:31:29.285809712+00:00 stderr F I0604 23:31:29.285677      19 healthz.go:266] api-openshift-apiserver-available,api-openshift-oauth-apiserver-available check failed: readyz
2021-06-04T23:31:29.285809712+00:00 stderr F [-]api-openshift-apiserver-available failed: check is not yet complete
2021-06-04T23:31:29.285809712+00:00 stderr F [-]api-openshift-oauth-apiserver-available failed: check is not yet complete


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Igal Tsoiref 2021-06-08 19:15:54 UTC

Created attachment 1789456 [details]
sos-report

Comment 2 Igal Tsoiref 2021-06-08 19:18:03 UTC

Created attachment 1789457 [details]
must-gather-no-pod-logs

Comment 3 Igal Tsoiref 2021-06-08 19:49:50 UTC

Created attachment 1789459 [details]
kubelet-logs

Comment 4 Stefan Schimanski 2021-06-09 07:22:26 UTC

I see that the kubelet is getting

   Jun 04 23:46:22 sno00029 hyperkube[13597]: E0604 23:46:22.649682   13597 kubelet.go:2303] "Error getting node" err="node \"sno00029\" not found"

which means that kube-apiserver is up and running. It just rejects the client certificate used by CSI. Please prove that the CSI cert is valid and expected to work.

Comment 5 Jan Safranek 2021-06-09 08:23:51 UTC

There are 3 kubelet processes in the log.

The fist one started at 23:29:11 and it was able to talk to the API server (at least I think these comes from a real API server):
Jun 04 23:25:51 sno00029 hyperkube[1955]: I0604 23:25:51.224416    1955 kubelet_node_status.go:76] "Successfully registered node" node="sno00029"
Jun 04 23:29:03 sno00029 hyperkube[1955]: I0604 23:29:03.421153    1955 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver-operator/kube-apiserver-operator-99cdbdcdf-nw4n7]
Jun 04 23:29:10 sno00029 hyperkube[1955]: E0604 23:29:10.610005    1955 secret.go:195] Couldn't get secret openshift-monitoring/node-exporter-tls: secret "node-exporter-tls" not found

The second one started at 23:29:13 and looks like fine too (but very short lived):
Jun 04 23:29:24 sno00029 hyperkube[9476]: I0604 23:29:24.697486    9476 kubelet_node_status.go:76] "Successfully registered node" node="sno00029"

The last kubelet starts at 23:30:05 and it never connects to the API server.
There is couple of "connect: connection refused" (until 23:31:09.720795) and then it's just Unauthorized.

Question is, who changed the certificates for the third kubelet or who did not give it the right certificates - "connection refused" indicates that the API server itself might be restarted.

Comment 6 Igal Tsoiref 2021-06-09 08:46:58 UTC

@jsafrane in sno flow we change bootstrap kube-api with masters one. Kubelet worked fine with bootstrap kube-api but failed to work with masters

Comment 9 Jan Safranek 2021-06-15 11:06:19 UTC

This is not a storage issue.

>    Jun 04 23:46:22 sno00029 hyperkube[13597]: E0604 23:46:22.649682   13597 kubelet.go:2303] "Error getting node" err="node \"sno00029\" not found"

This does not come from the API server, it's from an unsynced informer. As shown in comment #5, whole kubelet can't talk to the API server and gets 401 Unauthorized. I am sorry, but I am not able to judge what is the faulty component, moving back to the API server.

Comment 10 Stefan Schimanski 2021-06-15 12:00:21 UTC

My ask in #4 stands: SNO and/or node team has to analyse which certs are in place at the time. kube-apiserver uses normal golang TLS logic to verify client certs. As this does not happenon HA OpenShift, it's an SNO integration issue.

Comment 11 Sam Batschelet 2021-06-29 20:37:39 UTC

I took a look at this and wanted to add a few nodes.

Order of events for `csr-controller-ca`

1.) csr-controller-ca created during bootstrap by kube-apiserver-operator in openshift-config-managed namespace[1]

2.) kcm operator starts up, first sync of resourcesynccontroller controller will delete the source created in step 1 because there is no csr-controller-ca in openshift-kube-controller-manager-operator namespace yet[2]

3.) targetconfigcontroller will create csr-controller-ca in openshift-kube-controller-manager-operator[3]

4.) resourcesynccontroller controller will copy cm from openshift-kube-controller-manager-operator to openshift-config-managed namespace.

#############

The problem observed is that kcm-o deleted the CM in step 2

```
I0624 22:57:22.546515       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"79601a22-4d54-4667-b0af-e068645caa80", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TargetConfigDeleted' Deleted target configmap openshift-config-managed/csr-controller-ca because source config does not exist
```
But then the operator loses leadership a short time after

```
W0624 22:58:33.739114       1 leaderelection.go:75] leader election lost
```

The node then goes NotReady and the operator never starts again to recreate the resource.

When I described the nodes I see

> machineconfig.machineconfiguration.openshift.io "rendered-master-6e84be82089936a6eb19997113cd255b" not found

MCD logs

```
I0624 23:27:28.486093   12494 node.go:45] Setting initial node config: rendered-master-6e84be82089936a6eb19997113cd255b
I0624 23:27:28.526958   12494 daemon.go:767] In bootstrap mode
E0624 23:27:28.527033   12494 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-6e84be82089936a6eb19997113cd255b" not found
```


```
$ oc get machineconfig -n openshift-machine-config-operator
NAME                              GENERATEDBYCONTROLLER   IGNITIONVERSION   AGE
50-masters-chrony-configuration                           2.2.0             4d21h
50-workers-chrony-configuration                           2.2.0             4d21h
99-master-dnsmasq-configuration                           2.2.0             4d21h
99-master-ssh                                             3.2.0             4d21h
99-worker-ssh                                             3.2.0             4d21h
```

vs output of SNO cluster notice the rendered machine configs.

```
oc get machineconfig -n openshift-machine-config-operator
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
00-worker                                          3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
01-master-container-runtime                        3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
01-master-kubelet                                  3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
01-worker-container-runtime                        3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
01-worker-kubelet                                  3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
99-master-generated-registries                     3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
99-master-ssh                                                                                 3.2.0             5d20h
99-worker-generated-registries                     3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
99-worker-ssh                                                                                 3.2.0             5d20h
rendered-master-2574071d90bd90c0886107b8c4d6673f   3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
rendered-worker-86ce1e4078442603ca20ad7b31459ceb   3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             5d19h
```


[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/bindata/bootkube/manifests/configmap-csr-controller-ca.yaml
[2] https://github.com/openshift/cluster-kube-controller-manager-operator/blob/release-4.8/pkg/operator/resourcesynccontroller/resourcesynccontroller.go#L16

Moving to MCO team to better explain MCD error and missing rendered machineconfig.

Comment 13 Sam Batschelet 2021-06-30 00:41:51 UTC

After further review, it seems the failure observed with MCO is probably just a result of other operators failing.

Going to dig on if we can ensure we rotate csr-controller-ca before any disruption of etcd or kube-apiserver as the result of rollout.

Comment 15 Sam Batschelet 2021-07-02 10:08:34 UTC

An experiment to test if waiting for kcm-o to generate CA has merged a retest using ACM will be used to validate. If the experiment is a success we should consider moving the logic into library go possibly as part of the static pod operator logic.

Comment 16 Stefan Schimanski 2021-08-26 13:11:20 UTC

With https://github.com/openshift/cluster-kube-apiserver-operator/pull/1209 this bug will be back. Moving back to POST.

Comment 17 Lili Cosic 2021-09-01 08:13:50 UTC

This only happens on the scale of 1000 single-node clusters and even then not so frequently. I lowered priority for this reason. 

If we need to ever come back to it as it happens more frequently, here are some steps we might looking into, from my and Sams discussion:

We want to essentially revert this revert https://bugzilla.redhat.com/show_bug.cgi?id=1994857 but also add more conditions, besides the managed field:
- CA resource actually changed content
- signer is the same

We would also need to observe and look for other changes.

Note You need to log in before you can comment on or make changes to this bug.