Created attachment 1789455 [details] api-logs Description of problem: In sno flow after bootstrap-kube-api was removed and kube-api operator spawned master kube-api kubelet stop failed to talk with with due to some auth issues. Many messages like this: 2021-06-04T23:31:29.227084500+00:00 stderr F E0604 23:31:29.226995 19 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate signed by unknown authority, verifying certificate SN=168548637191173065529478113719475632231, SKID=, AKID=3A:37:BC:B0:BB:1A:31:E0:DD:06:31:4C:9E:1E:14:BC:22:69:E5:65 failed: x509: certificate signed by unknown authority]" 2021-06-04T23:31:29.281116727+00:00 stderr F W0604 23:31:29.280929 19 patch_genericapiserver.go:123] Request to "/apis/storage.k8s.io/v1/csinodes/sno00029" (source IP 198.18.8.146:38316, user agent "kubelet/v1.21.0 (linux/amd64) kubernetes/4b2b6ff") before server is ready, possibly a sign for a broken load balancer setup. 2021-06-04T23:31:29.281184225+00:00 stderr F E0604 23:31:29.281143 19 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate signed by unknown authority, verifying certificate SN=168548637191173065529478113719475632231, SKID=, AKID=3A:37:BC:B0:BB:1A:31:E0:DD:06:31:4C:9E:1E:14:BC:22:69:E5:65 failed: x509: certificate signed by unknown authority]" 2021-06-04T23:31:29.285809712+00:00 stderr F I0604 23:31:29.285677 19 healthz.go:266] api-openshift-apiserver-available,api-openshift-oauth-apiserver-available check failed: readyz 2021-06-04T23:31:29.285809712+00:00 stderr F [-]api-openshift-apiserver-available failed: check is not yet complete 2021-06-04T23:31:29.285809712+00:00 stderr F [-]api-openshift-oauth-apiserver-available failed: check is not yet complete Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1789456 [details] sos-report
Created attachment 1789457 [details] must-gather-no-pod-logs
Created attachment 1789459 [details] kubelet-logs
I see that the kubelet is getting Jun 04 23:46:22 sno00029 hyperkube[13597]: E0604 23:46:22.649682 13597 kubelet.go:2303] "Error getting node" err="node \"sno00029\" not found" which means that kube-apiserver is up and running. It just rejects the client certificate used by CSI. Please prove that the CSI cert is valid and expected to work.
There are 3 kubelet processes in the log. The fist one started at 23:29:11 and it was able to talk to the API server (at least I think these comes from a real API server): Jun 04 23:25:51 sno00029 hyperkube[1955]: I0604 23:25:51.224416 1955 kubelet_node_status.go:76] "Successfully registered node" node="sno00029" Jun 04 23:29:03 sno00029 hyperkube[1955]: I0604 23:29:03.421153 1955 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver-operator/kube-apiserver-operator-99cdbdcdf-nw4n7] Jun 04 23:29:10 sno00029 hyperkube[1955]: E0604 23:29:10.610005 1955 secret.go:195] Couldn't get secret openshift-monitoring/node-exporter-tls: secret "node-exporter-tls" not found The second one started at 23:29:13 and looks like fine too (but very short lived): Jun 04 23:29:24 sno00029 hyperkube[9476]: I0604 23:29:24.697486 9476 kubelet_node_status.go:76] "Successfully registered node" node="sno00029" The last kubelet starts at 23:30:05 and it never connects to the API server. There is couple of "connect: connection refused" (until 23:31:09.720795) and then it's just Unauthorized. Question is, who changed the certificates for the third kubelet or who did not give it the right certificates - "connection refused" indicates that the API server itself might be restarted.
@jsafrane in sno flow we change bootstrap kube-api with masters one. Kubelet worked fine with bootstrap kube-api but failed to work with masters
This is not a storage issue. > Jun 04 23:46:22 sno00029 hyperkube[13597]: E0604 23:46:22.649682 13597 kubelet.go:2303] "Error getting node" err="node \"sno00029\" not found" This does not come from the API server, it's from an unsynced informer. As shown in comment #5, whole kubelet can't talk to the API server and gets 401 Unauthorized. I am sorry, but I am not able to judge what is the faulty component, moving back to the API server.
My ask in #4 stands: SNO and/or node team has to analyse which certs are in place at the time. kube-apiserver uses normal golang TLS logic to verify client certs. As this does not happenon HA OpenShift, it's an SNO integration issue.
I took a look at this and wanted to add a few nodes. Order of events for `csr-controller-ca` 1.) csr-controller-ca created during bootstrap by kube-apiserver-operator in openshift-config-managed namespace[1] 2.) kcm operator starts up, first sync of resourcesynccontroller controller will delete the source created in step 1 because there is no csr-controller-ca in openshift-kube-controller-manager-operator namespace yet[2] 3.) targetconfigcontroller will create csr-controller-ca in openshift-kube-controller-manager-operator[3] 4.) resourcesynccontroller controller will copy cm from openshift-kube-controller-manager-operator to openshift-config-managed namespace. ############# The problem observed is that kcm-o deleted the CM in step 2 ``` I0624 22:57:22.546515 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"79601a22-4d54-4667-b0af-e068645caa80", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TargetConfigDeleted' Deleted target configmap openshift-config-managed/csr-controller-ca because source config does not exist ``` But then the operator loses leadership a short time after ``` W0624 22:58:33.739114 1 leaderelection.go:75] leader election lost ``` The node then goes NotReady and the operator never starts again to recreate the resource. When I described the nodes I see > machineconfig.machineconfiguration.openshift.io "rendered-master-6e84be82089936a6eb19997113cd255b" not found MCD logs ``` I0624 23:27:28.486093 12494 node.go:45] Setting initial node config: rendered-master-6e84be82089936a6eb19997113cd255b I0624 23:27:28.526958 12494 daemon.go:767] In bootstrap mode E0624 23:27:28.527033 12494 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-6e84be82089936a6eb19997113cd255b" not found ``` ``` $ oc get machineconfig -n openshift-machine-config-operator NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 50-masters-chrony-configuration 2.2.0 4d21h 50-workers-chrony-configuration 2.2.0 4d21h 99-master-dnsmasq-configuration 2.2.0 4d21h 99-master-ssh 3.2.0 4d21h 99-worker-ssh 3.2.0 4d21h ``` vs output of SNO cluster notice the rendered machine configs. ``` oc get machineconfig -n openshift-machine-config-operator NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 00-worker 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 01-master-container-runtime 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 01-master-kubelet 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 01-worker-container-runtime 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 01-worker-kubelet 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 99-master-generated-registries 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 99-master-ssh 3.2.0 5d20h 99-worker-generated-registries 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h 99-worker-ssh 3.2.0 5d20h rendered-master-2574071d90bd90c0886107b8c4d6673f 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h rendered-worker-86ce1e4078442603ca20ad7b31459ceb 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 5d19h ``` [1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/bindata/bootkube/manifests/configmap-csr-controller-ca.yaml [2] https://github.com/openshift/cluster-kube-controller-manager-operator/blob/release-4.8/pkg/operator/resourcesynccontroller/resourcesynccontroller.go#L16 Moving to MCO team to better explain MCD error and missing rendered machineconfig.
After further review, it seems the failure observed with MCO is probably just a result of other operators failing. Going to dig on if we can ensure we rotate csr-controller-ca before any disruption of etcd or kube-apiserver as the result of rollout.
An experiment to test if waiting for kcm-o to generate CA has merged a retest using ACM will be used to validate. If the experiment is a success we should consider moving the logic into library go possibly as part of the static pod operator logic.
With https://github.com/openshift/cluster-kube-apiserver-operator/pull/1209 this bug will be back. Moving back to POST.
This only happens on the scale of 1000 single-node clusters and even then not so frequently. I lowered priority for this reason. If we need to ever come back to it as it happens more frequently, here are some steps we might looking into, from my and Sams discussion: We want to essentially revert this revert https://bugzilla.redhat.com/show_bug.cgi?id=1994857 but also add more conditions, besides the managed field: - CA resource actually changed content - signer is the same We would also need to observe and look for other changes.