1949288 – Significant spike in "tls: internal errors" when contacting apiserver via exec (roughly same timeframe as rebase)

Bug 1949288 - Significant spike in "tls: internal errors" when contacting apiserver via exec (roughly same timeframe as rebase)

Summary: Significant spike in "tls: internal errors" when contacting apiserver via exe...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1952282
Blocks:	1950114 1951213
TreeView+	depends on / blocked

Reported:	2021-04-13 21:05 UTC by Clayton Coleman
Modified:	2021-05-05 15:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1950114 1951213 (view as bug list)
Environment:
Last Closed:	2021-05-05 15:11:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 101200	0	None	open	kubelet: Validate serving certificate is available	2021-04-22 19:52:31 UTC

Description Clayton Coleman 2021-04-13 21:05:29 UTC

Sometime on Tuesday or Wednesday April 6th+ we started to see a very large spike in tests failing with the error "tls: internal error":

---

fail [k8s.io/kubernetes.0/test/e2e/storage/drivers/in_tree.go:1910]: Unexpected error:
    <*errors.StatusError | 0xc0004c7180>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "error dialing backend: remote error: tls: internal error",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    error dialing backend: remote error: tls: internal error

---

This occurs on AWS/GCP/Azure, and before the 6th we were seeing it only 1-4 times a day, but now it's tens of failures a day, which roughly correlates with the rebase.  It seems to be on exec - it could be an exec specific behavior, a TLS error, a failure to properly handle a new type of TLS error, a failure in the node, etc, but hard to say

https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none


We also see it when calling the thanos querier for alert info via exec - I *believe* the message here is purely on the kube-apiserver -> kubelet exec side, but not the curl side (so not a thanos specific problem)


fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:454]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "unable to execute query ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\nError from server: error dialing backend: remote error: tls: internal error\n\nerror:\nexit status 1\n",
        },
    ]
    unable to execute query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1":
    Command stdout:
    
    stderr:
    Error from server: error dialing backend: remote error: tls: internal error
    
    error:
    exit status 1
    
occurred

---

Urgent because in some jobs we see 1/2 runs fail (not sure why, possibly due to a platform specific implication, but fails on all platforms that we have seen)

Comment 1 Lukasz Szaszkiewicz 2021-04-14 14:38:24 UTC

It looks like the TLS error match kubelet serving certificate rotation.

Please have a look at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032 or https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp/1382275964557332480/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-p4f8h5y2-16f7c-rzt5v-worker-c-jg9tw/

and grep for "no serving certificate available for the kubelet" and "Rotating certificates" on the worker nodes.

Comment 2 Elana Hashman 2021-04-14 21:21:37 UTC

I took a look at the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032

Journal logs for the worker that served this are here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/nodes/ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9/journal

Query made at 11:53:09 through 11:53:50 (from test stdout):

Apr 14 11:53:09.660: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"' 
...
Apr 14 11:53:50.684: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'


In the accompanying kubelet logs, looking at that time range, we see a lot of the following errors in the journal:

Apr 14 11:53:09.955199 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:09.955133    1505 log.go:184] http: TLS handshake error from 10.0.0.4:58696: no serving certifica
te available for the kubelet
Apr 14 11:53:13.456495 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:13.456458    1505 log.go:184] http: TLS handshake error from 10.0.128.4:56440: no serving certifi
cate available for the kubelet


I looked for the first instance of this error, and it seems to start at 11:39:02, which is not that long after when the node booted (11:36:53):

Apr 14 11:39:02.235310 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.235272    1505 log.go:184] http: TLS handshake error from 10.0.0.4:40238: no serving certificate available for the kubelet

The last instance of this error is at 11:54:04, after which the kubelet successfully starts a CSR reflector:

Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146

Comment 3 Elana Hashman 2021-04-14 21:29:08 UTC

Tracing the certificate manager:

Apr 14 11:38:55.867442 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.867411    1505 certificate_manager.go:282] Certificate rotation is enabled.
Apr 14 11:38:55.869072 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.869035    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:38:55.879208 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879148    1505 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
Apr 14 11:38:55.879416 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879383    1505 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/kubelet-ca.crt
Apr 14 11:38:55.900724 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.900676    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:56.111279 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.111176    1505 csr.go:249] certificate signing request csr-mqgcd is approved, waiting to be issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126665    1505 csr.go:245] certificate signing request csr-mqgcd is issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126750    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128385    1505 certificate_manager.go:556] Certificate expiration is 2021-04-15 11:18:03 +0000 UTC, rotation deadline is 2021-04-15 04:59:43.812583687 +0000 UTC
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128444    1505 certificate_manager.go:288] Waiting 17h20m46.68414809s for next certificate rotation

First run of CSRs is fine, we're supposed to wait 17h.

Apr 14 11:39:01.216428 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:01.216406    1505 certificate_manager.go:282] Certificate rotation is enabled.

But then, 5 seconds later, we rotate them again?

Apr 14 11:39:02.217698 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.217652    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:39:02.232684 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.232560    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


Apr 14 11:54:02.229274 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:02.229248    1505 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition

This round fails, but it takes 15m to time out.

Then, in the next run of the certificate manager, it successfully pulls certs, and the errors go away.

Apr 14 11:54:04.366789 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.366731    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:54:04.410213 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.410156    1505 csr.go:249] certificate signing request csr-sz2fb is approved, waiting to be issued
Apr 14 11:54:04.456706 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456618    1505 csr.go:245] certificate signing request csr-sz2fb is issued
Apr 14 11:54:04.456895 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456839    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


I have some questions:

- Why does the kubelet request another cert after only 5s?
- Why does the controller manager fail to respond with a cert?

Comment 4 Elana Hashman 2021-04-14 21:37:32 UTC

Apr 14 11:54:03.377045 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:03.376886    1505 kubelet_pods.go:1285] "Killing unwanted pod" podName="execpod"
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:03.377757    1505 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: goroutine 24750 [running]:
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: panic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult.func1(0xc0007976f0, 0xc000b3a580, 0x0, 0x0, 0xc001c432c0, 0xc0008a62a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:697 +0x33a
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: created by k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:690 +0x105

Kubelet appears to have crashed at 11:54:03, but this might be a red herring, since the timeout and next CSR request occur within 2s of each other. (We should probably follow up on this regardless.)

Comment 5 Elana Hashman 2021-04-14 21:54:53 UTC

Another question:

- Why is the node marked as NodeReady if its cert broke?

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"

Comment 6 Elana Hashman 2021-04-14 22:17:53 UTC

Networking was down when the original CSR was filed. See e.g. errors like

Apr 14 11:39:01.784836 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:01.784813    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-multus/network-metrics-daemon-qdt8t" podUID=7be17f6c-5ee9-4e1d-a340-d2602b8f7979
...
Apr 14 11:39:55.371302 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:55.370401    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: co
ntainer runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your netw
ork provider started?" pod="openshift-network-diagnostics/network-check-target-zf6tb" podUID=08b8d181-a0eb-4e08-ae4b-d66e972e9f6e

after which it finally starts networking and declares the node ready at 11:40:01

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"


So we know the answer to:

- Why does the controller manager fail to respond with a cert?

The networking wasn't up on the node. The certificate manager in client go waits a whole 15m to retry: https://github.com/openshift/kubernetes/blob/74497632f4118e7989da73de0ed6f1268aeb64ad/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L47


Remaining questions:

- Why does the kubelet request another cert after only 5s?
- Why is the node marked as NodeReady if it doesn't have a cert? (amended slightly)

Comment 7 Elana Hashman 2021-04-14 22:33:23 UTC

Cert request at 11:39:02 is neither accepted nor rejected:

{
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:39:02Z",
                "generateName": "csr-",
                "name": "csr-6jl4f",
                "resourceVersion": "20413",
                "uid": "c470f86a-f2b4-4912-9d5f-a4b32d7965b1"
            },
            "spec": {
                "groups": [
                    "system:nodes",
                    "system:authenticated"
                ],
                "request": "blah",
                "signerName": "kubernetes.io/kubelet-serving",
                "usages": [
                    "digital signature",
                    "key encipherment",
                    "server auth"
                ],
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {}
        },

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5f895cd888-z87rf_machine-approver-controller.log

Machine-approver-controller logs say (which is running on "nodeName": "ci-op-7mr744x6-fc795-jdxcp-master-0"):

I0414 11:39:02.229047       1 controller.go:114] Reconciling CSR: csr-6jl4f
I0414 11:39:02.229676       1 csr_check.go:150] csr-6jl4f: CSR does not appear to be client csr
I0414 11:39:02.232695       1 csr_check.go:441] retrieving serving cert from ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 (10.0.128.3:10250)
I0414 11:39:02.235536       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0414 11:39:02.235678       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9
E0414 11:39:02.235733       1 csr_check.go:196] csr-6jl4f: Serving Cert: No target machine for node "ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
I0414 11:39:02.235757       1 controller.go:172] csr-6jl4f: CSR not authorized

Same error: why is this running into the same issue?


Second cert request succeeds at 11:54:04:

        {
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:54:04Z",
...
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {
                "certificate": "blah",
                "conditions": [
                    {
                        "lastTransitionTime": "2021-04-14T11:54:04Z",
                        "lastUpdateTime": "2021-04-14T11:54:04Z",
                        "message": "This CSR was approved by the Node CSR Approver",
                        "reason": "NodeCSRApprove",
                        "status": "True",
                        "type": "Approved"
                    }
                ]
            }
        },

Comment 8 Elana Hashman 2021-04-14 23:37:47 UTC

No evidence of this issue occurring on the logs of any of the masters, but the machine-approver-controller is still hitting the "tls: internal error" issue when it fails to approve the cert.

Tracing the code from the logs above, it seems to try to dial back to the (known problem) kubelet that requested the cert; that seems a bit circular? https://github.com/openshift/cluster-machine-approver/blob/6768880fc1c3eabdc7b5db9987b69ce5a3f52521/pkg/controller/csr_check.go#L441-L443

Comment 9 Elana Hashman 2021-04-15 00:44:14 UTC

Ryan thinks there's an issue where the cluster-machine-approver is on golang 1.15, but the rest of OpenShift has been updated to 1.16, resulting in some incompatibilities: https://bugzilla.redhat.com/show_bug.cgi?id=1949741

We'll see if this change helps. The timing makes sense, given when the go 1.16 bump landed.

Comment 10 Elana Hashman 2021-04-15 19:07:21 UTC

Marking as not a blocker as this does not appear to be a regression and impact appears to be more of an issue in CI than prod due to short-lived nodes - we see that this error has been pretty prevalent and is not new, e.g. this log from Dec. 15, 2020: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/98/pull-ci-openshift-cluster-machine-approver-master-e2e-aws/1338899425333874688/artifacts/e2e-aws/gather-extra/pods/openshift-cluster-machine-approver_machine-approver-7f767df9d7-2d92x_machine-approver-controller.log

I think there are two issues here:

1. There seems to be a bug with the Kubelet's TLS cert management prior to getting a CSR approved/NodeReady logic.
2. Something changed recently, possibly coinciding with the rebase, causing the "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured" test to fail much more frequently. It's probably worth reviewing how that test is implemented.

I'll clone and send a bug to the monitoring team to investigate #2, and will continue using this to track #1.

Comment 11 Elana Hashman 2021-04-16 20:22:15 UTC

Also proposed upstream: https://github.com/kubernetes/kubernetes/pull/101200

Comment 12 Clayton Coleman 2021-04-19 13:29:37 UTC

Where is the follow up to prevent the racing?  If this is correct, we would be regressing startup by 15m which is unacceptable.

Reopening this - I expect to see the race condition fully debugged and a gap <2m from time master node comes up to it being ready (it's the approvers job to approve masters with low latency especially during install).

I could tolerate a known few minute regression install, but not 15m.

Comment 13 Clayton Coleman 2021-04-19 13:30:31 UTC

Let me describe expected behavior:

1. master starts up during install
2. creates server certificate
3. machine approver accepts certificate <2-3m after machine approver has the necessary info

Comment 14 Clayton Coleman 2021-04-19 13:38:18 UTC

INFO[2021-04-16T13:06:26Z] Step e2e-aws-serial-ipi-install-install succeeded after 37m20.041593719s.

Average startup time is about 32m on AWS, this looks like this regresses AWS by about 5m (just a guess) which would be too significant to just fix by waiting for me given its impact on every customer, CI, and services.  So I don't consider "waiting" a single solution and if this won't fix the wait (waiting better is great) but we need to spawn a separate bug to resolve the wait before this merges.

Comment 15 Elana Hashman 2021-04-19 16:58:39 UTC

We now have an answer to:

- Why does the kubelet request another cert after only 5s?

The first cert requested is the kubelet's client cert, which succeeds. The server cert request hangs for 15m. Unfortunately, the logging doesn't differentiate between the two.


I'll look into seeing if we can retry on the failed CSR faster.

Comment 16 Elana Hashman 2021-04-19 21:19:49 UTC

> I'll look into seeing if we can retry on the failed CSR faster.

https://bugzilla.redhat.com/show_bug.cgi?id=1940972

I don't think we can do anything client-side, the controller should not hang for 15m.


> Unfortunately, the logging doesn't differentiate between the two.

Addressed by https://github.com/kubernetes/kubernetes/pull/101252


And https://github.com/kubernetes/kubernetes/pull/101200 is slightly less urgent but will ensure that kubelets don't mark themselves as ready without a serving cert. No word on this upstream yet. As Clayton mentioned this doesn't address the regression but the approver BZ above should.

Comment 17 Elana Hashman 2021-04-21 00:26:26 UTC

Now that this is split and the CSR approval delays will be addressed in #1940972, moving back to POST. Adding BZ link.

Comment 18 Elana Hashman 2021-04-22 19:52:39 UTC

Dropping to severity High as the kubelet delays in getting a valid serving cert are not unrecoverable. Leaving priority as Urgent due to CI impact. I'm hoping that the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1940972 will address the majority of this issue.

The approach in https://github.com/kubernetes/kubernetes/pull/101200 won't work because we have a number of cases where it would be undesirable to set the kubelet as NotReady when it does not have a serving cert available. For example, if the serving cert expired and the CSR required a manual approval, this could potentially cause a node outage/regression from current behaviour.

I will look into surfacing this as a NodeCondition.

Comment 19 Elana Hashman 2021-04-27 21:43:00 UTC

Commented on https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c12, I am still seeing 15m timeouts in CI so I am not sure the fix that went in has addressed machine approver responsiveness yet.

Comment 20 Elana Hashman 2021-05-03 22:44:51 UTC

Since fix for https://bugzilla.redhat.com/show_bug.cgi?id=1952282 went in on the 23rd, test has stopped failing.

Upstream seems lukewarm on https://github.com/kubernetes/kubernetes/pull/101200 even to be surfaced as a condition. Is there any remaining work for kubelet?

Comment 21 Elana Hashman 2021-05-05 15:11:42 UTC

Closing as CI is passing. Does not appear to have been a kubelet issue and no remaining follow-ups for kubelet.

Note You need to log in before you can comment on or make changes to this bug.