Bug 1950114

Summary: Significant spike in "tls: internal errors" when contacting apiserver via exec (roughly same timeframe as rebase)
Product: OpenShift Container Platform Reporter: Elana Hashman <ehashman>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED WORKSFORME QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.8CC: alegrand, anpicker, aos-bugs, ccoleman, dgrisonn, erooth, kakkoyun, lcosic, mfojtik, pkrupa, schoudha, spasquie, surbania, wking, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1949288 Environment:
Last Closed: 2021-04-26 07:57:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1940972, 1949288, 1951213    
Bug Blocks:    
Attachments:
Description Flags
kubelet_server_expiration_renew_errors none

Description Elana Hashman 2021-04-15 19:08:19 UTC
+++ This bug was initially created as a clone of Bug #1949288 +++

Sometime on Tuesday or Wednesday April 6th+ we started to see a very large spike in tests failing with the error "tls: internal error":

---

fail [k8s.io/kubernetes.0/test/e2e/storage/drivers/in_tree.go:1910]: Unexpected error:
    <*errors.StatusError | 0xc0004c7180>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "error dialing backend: remote error: tls: internal error",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    error dialing backend: remote error: tls: internal error

---

This occurs on AWS/GCP/Azure, and before the 6th we were seeing it only 1-4 times a day, but now it's tens of failures a day, which roughly correlates with the rebase.  It seems to be on exec - it could be an exec specific behavior, a TLS error, a failure to properly handle a new type of TLS error, a failure in the node, etc, but hard to say

https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none


We also see it when calling the thanos querier for alert info via exec - I *believe* the message here is purely on the kube-apiserver -> kubelet exec side, but not the curl side (so not a thanos specific problem)


fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:454]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "unable to execute query ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\nError from server: error dialing backend: remote error: tls: internal error\n\nerror:\nexit status 1\n",
        },
    ]
    unable to execute query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1":
    Command stdout:
    
    stderr:
    Error from server: error dialing backend: remote error: tls: internal error
    
    error:
    exit status 1
    
occurred

---

Urgent because in some jobs we see 1/2 runs fail (not sure why, possibly due to a platform specific implication, but fails on all platforms that we have seen)

--- Additional comment from Lukasz Szaszkiewicz on 2021-04-14 14:38:24 UTC ---

It looks like the TLS error match kubelet serving certificate rotation.

Please have a look at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032 or https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp/1382275964557332480/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-p4f8h5y2-16f7c-rzt5v-worker-c-jg9tw/

and grep for "no serving certificate available for the kubelet" and "Rotating certificates" on the worker nodes.

--- Additional comment from Elana Hashman on 2021-04-14 21:21:37 UTC ---

I took a look at the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032

Journal logs for the worker that served this are here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/nodes/ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9/journal

Query made at 11:53:09 through 11:53:50 (from test stdout):

Apr 14 11:53:09.660: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"' 
...
Apr 14 11:53:50.684: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'


In the accompanying kubelet logs, looking at that time range, we see a lot of the following errors in the journal:

Apr 14 11:53:09.955199 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:09.955133    1505 log.go:184] http: TLS handshake error from 10.0.0.4:58696: no serving certifica
te available for the kubelet
Apr 14 11:53:13.456495 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:13.456458    1505 log.go:184] http: TLS handshake error from 10.0.128.4:56440: no serving certifi
cate available for the kubelet


I looked for the first instance of this error, and it seems to start at 11:39:02, which is not that long after when the node booted (11:36:53):

Apr 14 11:39:02.235310 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.235272    1505 log.go:184] http: TLS handshake error from 10.0.0.4:40238: no serving certificate available for the kubelet

The last instance of this error is at 11:54:04, after which the kubelet successfully starts a CSR reflector:

Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146

--- Additional comment from Elana Hashman on 2021-04-14 21:29:08 UTC ---

Tracing the certificate manager:

Apr 14 11:38:55.867442 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.867411    1505 certificate_manager.go:282] Certificate rotation is enabled.
Apr 14 11:38:55.869072 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.869035    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:38:55.879208 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879148    1505 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
Apr 14 11:38:55.879416 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879383    1505 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/kubelet-ca.crt
Apr 14 11:38:55.900724 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.900676    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:56.111279 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.111176    1505 csr.go:249] certificate signing request csr-mqgcd is approved, waiting to be issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126665    1505 csr.go:245] certificate signing request csr-mqgcd is issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126750    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128385    1505 certificate_manager.go:556] Certificate expiration is 2021-04-15 11:18:03 +0000 UTC, rotation deadline is 2021-04-15 04:59:43.812583687 +0000 UTC
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128444    1505 certificate_manager.go:288] Waiting 17h20m46.68414809s for next certificate rotation

First run of CSRs is fine, we're supposed to wait 17h.

Apr 14 11:39:01.216428 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:01.216406    1505 certificate_manager.go:282] Certificate rotation is enabled.

But then, 5 seconds later, we rotate them again?

Apr 14 11:39:02.217698 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.217652    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:39:02.232684 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.232560    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


Apr 14 11:54:02.229274 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:02.229248    1505 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition

This round fails, but it takes 15m to time out.

Then, in the next run of the certificate manager, it successfully pulls certs, and the errors go away.

Apr 14 11:54:04.366789 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.366731    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:54:04.410213 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.410156    1505 csr.go:249] certificate signing request csr-sz2fb is approved, waiting to be issued
Apr 14 11:54:04.456706 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456618    1505 csr.go:245] certificate signing request csr-sz2fb is issued
Apr 14 11:54:04.456895 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456839    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


I have some questions:

- Why does the kubelet request another cert after only 5s?
- Why does the controller manager fail to respond with a cert?

--- Additional comment from Elana Hashman on 2021-04-14 21:37:32 UTC ---

Apr 14 11:54:03.377045 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:03.376886    1505 kubelet_pods.go:1285] "Killing unwanted pod" podName="execpod"
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:03.377757    1505 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: goroutine 24750 [running]:
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: panic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult.func1(0xc0007976f0, 0xc000b3a580, 0x0, 0x0, 0xc001c432c0, 0xc0008a62a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:697 +0x33a
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: created by k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:690 +0x105

Kubelet appears to have crashed at 11:54:03, but this might be a red herring, since the timeout and next CSR request occur within 2s of each other. (We should probably follow up on this regardless.)

--- Additional comment from Elana Hashman on 2021-04-14 21:54:53 UTC ---

Another question:

- Why is the node marked as NodeReady if its cert broke?

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"

--- Additional comment from Elana Hashman on 2021-04-14 22:17:53 UTC ---

Networking was down when the original CSR was filed. See e.g. errors like

Apr 14 11:39:01.784836 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:01.784813    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-multus/network-metrics-daemon-qdt8t" podUID=7be17f6c-5ee9-4e1d-a340-d2602b8f7979
...
Apr 14 11:39:55.371302 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:55.370401    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: co
ntainer runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your netw
ork provider started?" pod="openshift-network-diagnostics/network-check-target-zf6tb" podUID=08b8d181-a0eb-4e08-ae4b-d66e972e9f6e

after which it finally starts networking and declares the node ready at 11:40:01

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"


So we know the answer to:

- Why does the controller manager fail to respond with a cert?

The networking wasn't up on the node. The certificate manager in client go waits a whole 15m to retry: https://github.com/openshift/kubernetes/blob/74497632f4118e7989da73de0ed6f1268aeb64ad/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L47


Remaining questions:

- Why does the kubelet request another cert after only 5s?
- Why is the node marked as NodeReady if it doesn't have a cert? (amended slightly)

--- Additional comment from Elana Hashman on 2021-04-14 22:33:23 UTC ---

Cert request at 11:39:02 is neither accepted nor rejected:

{
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:39:02Z",
                "generateName": "csr-",
                "name": "csr-6jl4f",
                "resourceVersion": "20413",
                "uid": "c470f86a-f2b4-4912-9d5f-a4b32d7965b1"
            },
            "spec": {
                "groups": [
                    "system:nodes",
                    "system:authenticated"
                ],
                "request": "blah",
                "signerName": "kubernetes.io/kubelet-serving",
                "usages": [
                    "digital signature",
                    "key encipherment",
                    "server auth"
                ],
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {}
        },

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5f895cd888-z87rf_machine-approver-controller.log

Machine-approver-controller logs say (which is running on "nodeName": "ci-op-7mr744x6-fc795-jdxcp-master-0"):

I0414 11:39:02.229047       1 controller.go:114] Reconciling CSR: csr-6jl4f
I0414 11:39:02.229676       1 csr_check.go:150] csr-6jl4f: CSR does not appear to be client csr
I0414 11:39:02.232695       1 csr_check.go:441] retrieving serving cert from ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 (10.0.128.3:10250)
I0414 11:39:02.235536       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0414 11:39:02.235678       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9
E0414 11:39:02.235733       1 csr_check.go:196] csr-6jl4f: Serving Cert: No target machine for node "ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
I0414 11:39:02.235757       1 controller.go:172] csr-6jl4f: CSR not authorized

Same error: why is this running into the same issue?


Second cert request succeeds at 11:54:04:

        {
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:54:04Z",
...
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {
                "certificate": "blah",
                "conditions": [
                    {
                        "lastTransitionTime": "2021-04-14T11:54:04Z",
                        "lastUpdateTime": "2021-04-14T11:54:04Z",
                        "message": "This CSR was approved by the Node CSR Approver",
                        "reason": "NodeCSRApprove",
                        "status": "True",
                        "type": "Approved"
                    }
                ]
            }
        },

--- Additional comment from Elana Hashman on 2021-04-14 23:37:47 UTC ---

No evidence of this issue occurring on the logs of any of the masters, but the machine-approver-controller is still hitting the "tls: internal error" issue when it fails to approve the cert.

Tracing the code from the logs above, it seems to try to dial back to the (known problem) kubelet that requested the cert; that seems a bit circular? https://github.com/openshift/cluster-machine-approver/blob/6768880fc1c3eabdc7b5db9987b69ce5a3f52521/pkg/controller/csr_check.go#L441-L443

--- Additional comment from Elana Hashman on 2021-04-15 00:44:14 UTC ---

Ryan thinks there's an issue where the cluster-machine-approver is on golang 1.15, but the rest of OpenShift has been updated to 1.16, resulting in some incompatibilities: https://bugzilla.redhat.com/show_bug.cgi?id=1949741

We'll see if this change helps. The timing makes sense, given when the go 1.16 bump landed.

--- Additional comment from Elana Hashman on 2021-04-15 19:07:21 UTC ---

Marking as not a blocker as this does not appear to be a regression and impact appears to be more of an issue in CI than prod due to short-lived nodes - we see that this error has been pretty prevalent and is not new, e.g. this log from Dec. 15, 2020: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/98/pull-ci-openshift-cluster-machine-approver-master-e2e-aws/1338899425333874688/artifacts/e2e-aws/gather-extra/pods/openshift-cluster-machine-approver_machine-approver-7f767df9d7-2d92x_machine-approver-controller.log

I think there are two issues here:

1. There seems to be a bug with the Kubelet's TLS cert management prior to getting a CSR approved/NodeReady logic.
2. Something changed recently, possibly coinciding with the rebase, causing the "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured" test to fail much more frequently. It's probably worth reviewing how that test is implemented.

I'll clone and send a bug to the monitoring team to investigate #2, and will continue using this to track #1.

Comment 1 Damien Grisonnet 2021-04-16 10:23:34 UTC
There was a change to the CI around a month ago to require no firing alerts during upgrades: https://github.com/openshift/origin/pull/25904

As far as I can tell, this change seems to be the main reason why we are seeing this test fail more often. Before that, the test was already failing quite often as alerts are a very powerful mean to detect bugs and regressions. But now, from what I've seen with the recent Bugzillas, the test has been detecting resiliency and availability issues during upgrades which is to be expected since some operators/operands may not be highly available or might not implement the HA conventions: https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability.

I also had a look at the implementation of the test and nothing really struck me as having possibly been broken by the 1.21 rebase.

Let me know if you think there might be something that I missed, otherwise I think we can safely close this bug.

Comment 2 Simon Pasquier 2021-04-16 13:43:07 UTC
Agree with Damien. My understanding is that the test fails due to something failing between the api and the kubelet.

Looking at one specific CI failure, the pod started by the e2e suite which checks for alerts is scheduled on node ip-10-0-253-175.ec2.internal and runs from 09:50:57 to 09:52:01.
The logs from this node [1] show that it tries to rotate certificates at 09:37:34 but it times out at 09:52:34.


Apr 16 09:37:28.230740 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:28.230712    1413 certificate_manager.go:282] Certificate rotation is enabled.
Apr 16 09:37:28.231028 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:28.230994    1413 certificate_manager.go:412] Rotating certificates
Apr 16 09:37:29.592548 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:29.592471    1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 02:52:32.076574345 +0000 UTC
Apr 16 09:37:29.592548 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:29.592529    1413 certificate_manager.go:288] Waiting 17h15m2.484051627s for next certificate rotation
Apr 16 09:37:33.354568 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:33.354535    1413 certificate_manager.go:282] Certificate rotation is enabled.
Apr 16 09:37:34.355906 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:34.355848    1413 certificate_manager.go:412] Rotating certificates
Apr 16 09:52:34.519284 ip-10-0-253-175 hyperkube[1413]: E0416 09:52:34.519281    1413 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition
Apr 16 09:52:36.716896 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:36.716831    1413 certificate_manager.go:412] Rotating certificates
Apr 16 09:52:37.745923 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:37.745875    1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 05:51:42.525959018 +0000 UTC
Apr 16 09:52:37.745923 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:37.745917    1413 certificate_manager.go:288] Waiting 19h59m4.780049227s for next certificate rotation
Apr 16 09:52:38.747149 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:38.747084    1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 06:26:01.796651049 +0000 UTC
Apr 16 09:52:38.747149 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:38.747133    1413 certificate_manager.go:288] Waiting 20h33m23.049523342s for next certificate rotation


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26054/pull-ci-openshift-origin-master-e2e-aws-disruptive/1382983481180557312
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26054/pull-ci-openshift-origin-master-e2e-aws-disruptive/1382983481180557312/artifacts/e2e-aws-disruptive/gather-extra/artifacts/nodes/ip-10-0-253-175.ec2.internal/journal

Comment 3 Simon Pasquier 2021-04-16 13:53:14 UTC
Created attachment 1772448 [details]
kubelet_server_expiration_renew_errors

Some kubelets report a non zero value for their kubelet_server_expiration_renew_errors metric.

Comment 4 Elana Hashman 2021-04-16 18:08:04 UTC
One possible thing that comes to mind: right now, this test fetches alerts by (I think) execing to call kubectl. This seems a little convoluted and is more likely to pick up spurious failures. Is there another way we could pull the alert data that might be less failure-prone?

Comment 5 W. Trevor King 2021-04-16 18:36:35 UTC
While there may be other mechanisms to make PromQL calls to list alerts, we want 'exec' to be reliable, right?  We could probably cover "we don't want alerts firing" and "we want exec to be reliably available" separately, but I don't think we want to address only the first without having something covering the latter.

Comment 6 Elana Hashman 2021-04-16 20:21:28 UTC
Yes, the issue with the kubelet being marked as Ready when it can't yet exec is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1949288 separately.

Comment 7 W. Trevor King 2021-04-16 23:07:38 UTC
Ok.  And there are unit tests in the kubelet PRs proposed for bug 1949288.  But currently this "shouldn't report any alerts in firing state" test-case is your integration coverage, right?  So "stop using 'exec' for the alerts-are-firing test-case, and use something more reliable" makes sense to me.  But you'd want to couple it with an "make sure 'exec' is reliable" integration test (possibly under the bug 1949288 umbrella), right?  Or are we confident enough in bug 1949288's new unit tests that we don't think we need exec-is-reliable integration coverage?

Comment 8 Elana Hashman 2021-04-17 01:03:35 UTC
Generally I don't think integration coverage should be a side effect of another test's implementation; it's better to be direct.

There's already integration coverage for exec working, see e.g. "[sig-cli] Kubectl client Simple pod should support exec using resource/name [Suite:openshift/conformance/parallel] [Suite:k8s]"

The timing of this test seems to be exposing a race condition at initial kubelet boot which has gotten more frequent recently due to what I suspect is a number of startup timing changes lining up that are OpenShift-specific. I've never seen an instance of this failure in upstream Kubernetes (yet!). In a production environment, I suspect the likelihood and impact of this happening would be low/rare. I think it's also coincidence that an exec exposes this issue, as the impact would not be limited to exec; anything that tries to connect to a kubelet's server would be affected (e.g. oc logs).

Comment 9 W. Trevor King 2021-04-17 02:27:09 UTC
I'm agree that direct tests are better.  Currently this test covers both (a) alerts not firing and (b) 'exec' works.  Moving it to just cover (a) is great, but we want something on top of (b) too.  Can we poll the kubelet APIs on each node from the test suite to get an uptime monitor, like we already do for the Kube and OpenShift APIs?  Or something like that?  Because I expect we will continue to see OpenShift-specific startup timing issues, and we want to notice when we do introduce them, even if they don't have much impact outside of node reboots.  Because node reboots are already painful enough due to rescheduling user's legacy workloads; the smoother our parts are, the better.

Comment 10 Clayton Coleman 2021-04-19 13:29:01 UTC
Exec MAY not be unreliable.  Every exec call to openshift should pass except those that take >2m to complete, and this one does not.

Alerts may not fire during upgrade, that was definite.

I don't think this is a bug at all (please update the description though).

Comment 11 Clayton Coleman 2021-04-19 13:32:17 UTC
Let me describe expected behavior:

1. exec during a normal serial or parallel conformance run has no disruption for exec calls <1m30s
2. apiserver rotation should be transparent to end users of exec
3. tests that fail on exec are only at fault if they take >1m30s to run AND don't retry their operation if the exec session is closed gracefully
4. it should be possible to see a graceful exec session close by the error code returned by exec closure

Comment 12 Simon Pasquier 2021-04-20 11:06:21 UTC
Is there anything that the monitoring team needs to do regarding this bug? As far as I understand, we don't want to increase the retry period of the "kubectl exec" command above 2 minutes in the end-to-end tests because .

Comment 13 W. Trevor King 2021-04-22 01:18:40 UTC
Per [1], I thought this bug was about the implementation of the:

  Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured

test case, and specifically, the use of exec to evaluate PromQL [2].  Bug 1949288 is about actually fixing the unreliable exec behavior.  Per comment 8 here, Elana doesn't like exec availability getting covered indirectly via exec-based-PromQL, and that's fair.  I'm not sure anyone's stepped up yet to own something that directly covers exec availability.  It's possible that could go in under bug 1949288 as well, or via a third bug series.

If for whatever reason the monitoring folks can think of a more canonical approach to executing PromQL within e2e test-cases, this bug would track that work.  But with the hypothetical direct guard on exec availability in place, it's not obvious to me what the motivation would be for moving to another, exec-free PromQL approach.  In which case we can close this NOTABUG or WORKSFORME or some such.

Am I missing something?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1949288#c10
[2]: https://github.com/openshift/origin/blob/e945cb88da780e21c021b6c8b430454bcfb881cf/test/extended/util/prometheus/helpers.go#L155-L156

Comment 14 Elana Hashman 2021-04-24 00:18:04 UTC
That's an accurate summary.

Comment 15 Simon Pasquier 2021-04-26 07:57:37 UTC
Thanks Trevor for the detailed summary! The only other option I can think of is to use go through the public prometheus-k8s route. This is already implemented by the "leader changes are not excessive [Late]" e2e test [1][2] but instead of depending on "oc exec", the test now depends on the availability of the route.

Closing as WORKSFORME for now.

[1] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/test/extended/etcd/leader_changes.go#L20
[2] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/test/extended/prometheus/client/via_route.go#L27