Sometime on Tuesday or Wednesday April 6th+ we started to see a very large spike in tests failing with the error "tls: internal error": --- fail [k8s.io/kubernetes.0/test/e2e/storage/drivers/in_tree.go:1910]: Unexpected error: <*errors.StatusError | 0xc0004c7180>: { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "error dialing backend: remote error: tls: internal error", Reason: "", Details: nil, Code: 500, }, } error dialing backend: remote error: tls: internal error --- This occurs on AWS/GCP/Azure, and before the 6th we were seeing it only 1-4 times a day, but now it's tens of failures a day, which roughly correlates with the rebase. It seems to be on exec - it could be an exec specific behavior, a TLS error, a failure to properly handle a new type of TLS error, a failure in the node, etc, but hard to say https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none We also see it when calling the thanos querier for alert info via exec - I *believe* the message here is purely on the kube-apiserver -> kubelet exec side, but not the curl side (so not a thanos specific problem) fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:454]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ { s: "unable to execute query ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\nError from server: error dialing backend: remote error: tls: internal error\n\nerror:\nexit status 1\n", }, ] unable to execute query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1": Command stdout: stderr: Error from server: error dialing backend: remote error: tls: internal error error: exit status 1 occurred --- Urgent because in some jobs we see 1/2 runs fail (not sure why, possibly due to a platform specific implication, but fails on all platforms that we have seen)
It looks like the TLS error match kubelet serving certificate rotation. Please have a look at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032 or https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp/1382275964557332480/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-p4f8h5y2-16f7c-rzt5v-worker-c-jg9tw/ and grep for "no serving certificate available for the kubelet" and "Rotating certificates" on the worker nodes.
I took a look at the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032 Journal logs for the worker that served this are here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/nodes/ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9/journal Query made at 11:53:09 through 11:53:50 (from test stdout): Apr 14 11:53:09.660: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"' ... Apr 14 11:53:50.684: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"' In the accompanying kubelet logs, looking at that time range, we see a lot of the following errors in the journal: Apr 14 11:53:09.955199 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:09.955133 1505 log.go:184] http: TLS handshake error from 10.0.0.4:58696: no serving certifica te available for the kubelet Apr 14 11:53:13.456495 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:13.456458 1505 log.go:184] http: TLS handshake error from 10.0.128.4:56440: no serving certifi cate available for the kubelet I looked for the first instance of this error, and it seems to start at 11:39:02, which is not that long after when the node booted (11:36:53): Apr 14 11:39:02.235310 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.235272 1505 log.go:184] http: TLS handshake error from 10.0.0.4:40238: no serving certificate available for the kubelet The last instance of this error is at 11:54:04, after which the kubelet successfully starts a CSR reflector: Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688 1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998 1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Tracing the certificate manager: Apr 14 11:38:55.867442 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.867411 1505 certificate_manager.go:282] Certificate rotation is enabled. Apr 14 11:38:55.869072 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.869035 1505 certificate_manager.go:412] Rotating certificates Apr 14 11:38:55.879208 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879148 1505 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/etc/kubernetes/kubelet-ca.crt" Apr 14 11:38:55.879416 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879383 1505 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/kubelet-ca.crt Apr 14 11:38:55.900724 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.900676 1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146 Apr 14 11:38:56.111279 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.111176 1505 csr.go:249] certificate signing request csr-mqgcd is approved, waiting to be issued Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126665 1505 csr.go:245] certificate signing request csr-mqgcd is issued Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126750 1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146 Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128385 1505 certificate_manager.go:556] Certificate expiration is 2021-04-15 11:18:03 +0000 UTC, rotation deadline is 2021-04-15 04:59:43.812583687 +0000 UTC Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128444 1505 certificate_manager.go:288] Waiting 17h20m46.68414809s for next certificate rotation First run of CSRs is fine, we're supposed to wait 17h. Apr 14 11:39:01.216428 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:01.216406 1505 certificate_manager.go:282] Certificate rotation is enabled. But then, 5 seconds later, we rotate them again? Apr 14 11:39:02.217698 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.217652 1505 certificate_manager.go:412] Rotating certificates Apr 14 11:39:02.232684 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.232560 1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146 Apr 14 11:54:02.229274 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:02.229248 1505 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition This round fails, but it takes 15m to time out. Then, in the next run of the certificate manager, it successfully pulls certs, and the errors go away. Apr 14 11:54:04.366789 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.366731 1505 certificate_manager.go:412] Rotating certificates Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688 1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998 1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146 Apr 14 11:54:04.410213 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.410156 1505 csr.go:249] certificate signing request csr-sz2fb is approved, waiting to be issued Apr 14 11:54:04.456706 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456618 1505 csr.go:245] certificate signing request csr-sz2fb is issued Apr 14 11:54:04.456895 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456839 1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146 I have some questions: - Why does the kubelet request another cert after only 5s? - Why does the controller manager fail to respond with a cert?
Apr 14 11:54:03.377045 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:03.376886 1505 kubelet_pods.go:1285] "Killing unwanted pod" podName="execpod" Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:03.377757 1505 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: goroutine 24750 [running]: Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41fffe0, 0x75e46a0) Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95 Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86 Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: panic(0x41fffe0, 0x75e46a0) Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult.func1(0xc0007976f0, 0xc000b3a580, 0x0, 0x0, 0xc001c432c0, 0xc0008a62a0) Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:697 +0x33a Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: created by k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:690 +0x105 Kubelet appears to have crashed at 11:54:03, but this might be a red herring, since the timeout and next CSR request occur within 2s of each other. (We should probably follow up on this regardless.)
Another question: - Why is the node marked as NodeReady if its cert broke? Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880 1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"
Networking was down when the original CSR was filed. See e.g. errors like Apr 14 11:39:01.784836 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:01.784813 1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-multus/network-metrics-daemon-qdt8t" podUID=7be17f6c-5ee9-4e1d-a340-d2602b8f7979 ... Apr 14 11:39:55.371302 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:55.370401 1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: co ntainer runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your netw ork provider started?" pod="openshift-network-diagnostics/network-check-target-zf6tb" podUID=08b8d181-a0eb-4e08-ae4b-d66e972e9f6e after which it finally starts networking and declares the node ready at 11:40:01 Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880 1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady" So we know the answer to: - Why does the controller manager fail to respond with a cert? The networking wasn't up on the node. The certificate manager in client go waits a whole 15m to retry: https://github.com/openshift/kubernetes/blob/74497632f4118e7989da73de0ed6f1268aeb64ad/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L47 Remaining questions: - Why does the kubelet request another cert after only 5s? - Why is the node marked as NodeReady if it doesn't have a cert? (amended slightly)
Cert request at 11:39:02 is neither accepted nor rejected: { "apiVersion": "certificates.k8s.io/v1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2021-04-14T11:39:02Z", "generateName": "csr-", "name": "csr-6jl4f", "resourceVersion": "20413", "uid": "c470f86a-f2b4-4912-9d5f-a4b32d7965b1" }, "spec": { "groups": [ "system:nodes", "system:authenticated" ], "request": "blah", "signerName": "kubernetes.io/kubelet-serving", "usages": [ "digital signature", "key encipherment", "server auth" ], "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" }, "status": {} }, From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5f895cd888-z87rf_machine-approver-controller.log Machine-approver-controller logs say (which is running on "nodeName": "ci-op-7mr744x6-fc795-jdxcp-master-0"): I0414 11:39:02.229047 1 controller.go:114] Reconciling CSR: csr-6jl4f I0414 11:39:02.229676 1 csr_check.go:150] csr-6jl4f: CSR does not appear to be client csr I0414 11:39:02.232695 1 csr_check.go:441] retrieving serving cert from ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 (10.0.128.3:10250) I0414 11:39:02.235536 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0414 11:39:02.235678 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 E0414 11:39:02.235733 1 csr_check.go:196] csr-6jl4f: Serving Cert: No target machine for node "ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" I0414 11:39:02.235757 1 controller.go:172] csr-6jl4f: CSR not authorized Same error: why is this running into the same issue? Second cert request succeeds at 11:54:04: { "apiVersion": "certificates.k8s.io/v1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2021-04-14T11:54:04Z", ... "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" }, "status": { "certificate": "blah", "conditions": [ { "lastTransitionTime": "2021-04-14T11:54:04Z", "lastUpdateTime": "2021-04-14T11:54:04Z", "message": "This CSR was approved by the Node CSR Approver", "reason": "NodeCSRApprove", "status": "True", "type": "Approved" } ] } },
No evidence of this issue occurring on the logs of any of the masters, but the machine-approver-controller is still hitting the "tls: internal error" issue when it fails to approve the cert. Tracing the code from the logs above, it seems to try to dial back to the (known problem) kubelet that requested the cert; that seems a bit circular? https://github.com/openshift/cluster-machine-approver/blob/6768880fc1c3eabdc7b5db9987b69ce5a3f52521/pkg/controller/csr_check.go#L441-L443
Ryan thinks there's an issue where the cluster-machine-approver is on golang 1.15, but the rest of OpenShift has been updated to 1.16, resulting in some incompatibilities: https://bugzilla.redhat.com/show_bug.cgi?id=1949741 We'll see if this change helps. The timing makes sense, given when the go 1.16 bump landed.
Marking as not a blocker as this does not appear to be a regression and impact appears to be more of an issue in CI than prod due to short-lived nodes - we see that this error has been pretty prevalent and is not new, e.g. this log from Dec. 15, 2020: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/98/pull-ci-openshift-cluster-machine-approver-master-e2e-aws/1338899425333874688/artifacts/e2e-aws/gather-extra/pods/openshift-cluster-machine-approver_machine-approver-7f767df9d7-2d92x_machine-approver-controller.log I think there are two issues here: 1. There seems to be a bug with the Kubelet's TLS cert management prior to getting a CSR approved/NodeReady logic. 2. Something changed recently, possibly coinciding with the rebase, causing the "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured" test to fail much more frequently. It's probably worth reviewing how that test is implemented. I'll clone and send a bug to the monitoring team to investigate #2, and will continue using this to track #1.
Also proposed upstream: https://github.com/kubernetes/kubernetes/pull/101200
Where is the follow up to prevent the racing? If this is correct, we would be regressing startup by 15m which is unacceptable. Reopening this - I expect to see the race condition fully debugged and a gap <2m from time master node comes up to it being ready (it's the approvers job to approve masters with low latency especially during install). I could tolerate a known few minute regression install, but not 15m.
Let me describe expected behavior: 1. master starts up during install 2. creates server certificate 3. machine approver accepts certificate <2-3m after machine approver has the necessary info
INFO[2021-04-16T13:06:26Z] Step e2e-aws-serial-ipi-install-install succeeded after 37m20.041593719s. Average startup time is about 32m on AWS, this looks like this regresses AWS by about 5m (just a guess) which would be too significant to just fix by waiting for me given its impact on every customer, CI, and services. So I don't consider "waiting" a single solution and if this won't fix the wait (waiting better is great) but we need to spawn a separate bug to resolve the wait before this merges.
We now have an answer to: - Why does the kubelet request another cert after only 5s? The first cert requested is the kubelet's client cert, which succeeds. The server cert request hangs for 15m. Unfortunately, the logging doesn't differentiate between the two. I'll look into seeing if we can retry on the failed CSR faster.
> I'll look into seeing if we can retry on the failed CSR faster. https://bugzilla.redhat.com/show_bug.cgi?id=1940972 I don't think we can do anything client-side, the controller should not hang for 15m. > Unfortunately, the logging doesn't differentiate between the two. Addressed by https://github.com/kubernetes/kubernetes/pull/101252 And https://github.com/kubernetes/kubernetes/pull/101200 is slightly less urgent but will ensure that kubelets don't mark themselves as ready without a serving cert. No word on this upstream yet. As Clayton mentioned this doesn't address the regression but the approver BZ above should.
Now that this is split and the CSR approval delays will be addressed in #1940972, moving back to POST. Adding BZ link.
Dropping to severity High as the kubelet delays in getting a valid serving cert are not unrecoverable. Leaving priority as Urgent due to CI impact. I'm hoping that the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1940972 will address the majority of this issue. The approach in https://github.com/kubernetes/kubernetes/pull/101200 won't work because we have a number of cases where it would be undesirable to set the kubelet as NotReady when it does not have a serving cert available. For example, if the serving cert expired and the CSR required a manual approval, this could potentially cause a node outage/regression from current behaviour. I will look into surfacing this as a NodeCondition.
Commented on https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c12, I am still seeing 15m timeouts in CI so I am not sure the fix that went in has addressed machine approver responsiveness yet.
Since fix for https://bugzilla.redhat.com/show_bug.cgi?id=1952282 went in on the 23rd, test has stopped failing. Upstream seems lukewarm on https://github.com/kubernetes/kubernetes/pull/101200 even to be surfaced as a condition. Is there any remaining work for kubelet?
Closing as CI is passing. Does not appear to have been a kubelet issue and no remaining follow-ups for kubelet.