Bug 1951213 - [cluster-machine-approver] CSR approval hangs for 15m+
Summary: [cluster-machine-approver] CSR approval hangs for 15m+
Keywords:
Status: CLOSED DUPLICATE of bug 1940972
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 1949288
Blocks: 1950114
TreeView+ depends on / blocked
 
Reported: 2021-04-19 20:17 UTC by Elana Hashman
Modified: 2021-04-19 20:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1949288
Environment:
Last Closed: 2021-04-19 20:36:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Elana Hashman 2021-04-19 20:17:40 UTC
+++ This bug was initially created as a clone of Bug #1949288 +++

Sometime on Tuesday or Wednesday April 6th+ we started to see a very large spike in tests failing with the error "tls: internal error":

---

fail [k8s.io/kubernetes.0/test/e2e/storage/drivers/in_tree.go:1910]: Unexpected error:
    <*errors.StatusError | 0xc0004c7180>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "error dialing backend: remote error: tls: internal error",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    error dialing backend: remote error: tls: internal error

---

This occurs on AWS/GCP/Azure, and before the 6th we were seeing it only 1-4 times a day, but now it's tens of failures a day, which roughly correlates with the rebase.  It seems to be on exec - it could be an exec specific behavior, a TLS error, a failure to properly handle a new type of TLS error, a failure in the node, etc, but hard to say

https://search.ci.openshift.org/?search=tls%3A+internal+error&maxAge=336h&context=0&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none


We also see it when calling the thanos querier for alert info via exec - I *believe* the message here is purely on the kube-apiserver -> kubelet exec side, but not the curl side (so not a thanos specific problem)


fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:454]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "unable to execute query ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\nError from server: error dialing backend: remote error: tls: internal error\n\nerror:\nexit status 1\n",
        },
    ]
    unable to execute query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-ggqljv22-ce9aa.ci.azure.devcluster.openshift.com:6443 --kubeconfig=/tmp/admin.kubeconfig --namespace=e2e-test-prometheus-frrpm exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1":
    Command stdout:
    
    stderr:
    Error from server: error dialing backend: remote error: tls: internal error
    
    error:
    exit status 1
    
occurred

---

Urgent because in some jobs we see 1/2 runs fail (not sure why, possibly due to a platform specific implication, but fails on all platforms that we have seen)

--- Additional comment from Lukasz Szaszkiewicz on 2021-04-14 14:38:24 UTC ---

It looks like the TLS error match kubelet serving certificate rotation.

Please have a look at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032 or https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp/1382275964557332480/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-p4f8h5y2-16f7c-rzt5v-worker-c-jg9tw/

and grep for "no serving certificate available for the kubelet" and "Rotating certificates" on the worker nodes.

--- Additional comment from Elana Hashman on 2021-04-14 21:21:37 UTC ---

I took a look at the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032

Journal logs for the worker that served this are here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/nodes/ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9/journal

Query made at 11:53:09 through 11:53:50 (from test stdout):

Apr 14 11:53:09.660: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"' 
...
Apr 14 11:53:50.684: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-7mr744x6-fc795.origin-ci-int-gce.dev.openshift.com:6443 --kubeconfig=/tmp/kubeconfig-949214330 --namespace=e2e-test-prometheus-md6xb exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IllLeGJIMW9ES2t1ZExINldVeFNfVWlLN09zTVpzM1YzVVpWRm1xeUN1YjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNHg3cnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjA3NTlkZmYtMGZiZC00MjVlLTljYTItZWJkYWI2YjBhNWI5Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.GQ3FehZWaDDy5Tk7OiKnjZmlrNiQQH1JxNYhg9KTWCZg0f0rcMYZsuqBDKTXS3q90rVrHJRXwKKndWCkR4RuHJFHbOrofpRQ6YTZkfbJQAqG6IbAy8L_kVcbXlnrE0iQ-oNpT-OR6bv64tOiV5Ye6s-XssVaeUIRCdNeJJWJyUO4iK4mTeapMfzM3AtfURM-iV7ZmiJx6fIQnlwCYMZ-wDbhEi9mmTPCd97hr5NVYHoMkWa-9Uc5yWMHHa7tmQihyUHESMwwopUI81pq7fEsrq2NFIpvkzl7qEAuujo8WylL7tR10JBDvlnRuIrVYQohUPof4HlzFdX49zQTFteToQ' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'


In the accompanying kubelet logs, looking at that time range, we see a lot of the following errors in the journal:

Apr 14 11:53:09.955199 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:09.955133    1505 log.go:184] http: TLS handshake error from 10.0.0.4:58696: no serving certifica
te available for the kubelet
Apr 14 11:53:13.456495 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:53:13.456458    1505 log.go:184] http: TLS handshake error from 10.0.128.4:56440: no serving certifi
cate available for the kubelet


I looked for the first instance of this error, and it seems to start at 11:39:02, which is not that long after when the node booted (11:36:53):

Apr 14 11:39:02.235310 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.235272    1505 log.go:184] http: TLS handshake error from 10.0.0.4:40238: no serving certificate available for the kubelet

The last instance of this error is at 11:54:04, after which the kubelet successfully starts a CSR reflector:

Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146

--- Additional comment from Elana Hashman on 2021-04-14 21:29:08 UTC ---

Tracing the certificate manager:

Apr 14 11:38:55.867442 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.867411    1505 certificate_manager.go:282] Certificate rotation is enabled.
Apr 14 11:38:55.869072 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.869035    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:38:55.879208 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879148    1505 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
Apr 14 11:38:55.879416 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.879383    1505 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/kubelet-ca.crt
Apr 14 11:38:55.900724 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:55.900676    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:56.111279 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.111176    1505 csr.go:249] certificate signing request csr-mqgcd is approved, waiting to be issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126665    1505 csr.go:245] certificate signing request csr-mqgcd is issued
Apr 14 11:38:56.126804 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:56.126750    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128385    1505 certificate_manager.go:556] Certificate expiration is 2021-04-15 11:18:03 +0000 UTC, rotation deadline is 2021-04-15 04:59:43.812583687 +0000 UTC
Apr 14 11:38:57.128476 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:38:57.128444    1505 certificate_manager.go:288] Waiting 17h20m46.68414809s for next certificate rotation

First run of CSRs is fine, we're supposed to wait 17h.

Apr 14 11:39:01.216428 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:01.216406    1505 certificate_manager.go:282] Certificate rotation is enabled.

But then, 5 seconds later, we rotate them again?

Apr 14 11:39:02.217698 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.217652    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:39:02.232684 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:39:02.232560    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


Apr 14 11:54:02.229274 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:02.229248    1505 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition

This round fails, but it takes 15m to time out.

Then, in the next run of the certificate manager, it successfully pulls certs, and the errors go away.

Apr 14 11:54:04.366789 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.366731    1505 certificate_manager.go:412] Rotating certificates
Apr 14 11:54:04.380719 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380688    1505 log.go:184] http: TLS handshake error from 10.0.0.4:59510: no serving certificate available for the kubelet
Apr 14 11:54:04.381022 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.380998    1505 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146
Apr 14 11:54:04.410213 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.410156    1505 csr.go:249] certificate signing request csr-sz2fb is approved, waiting to be issued
Apr 14 11:54:04.456706 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456618    1505 csr.go:245] certificate signing request csr-sz2fb is issued
Apr 14 11:54:04.456895 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:04.456839    1505 reflector.go:225] Stopping reflector *v1.CertificateSigningRequest (0s) from k8s.io/client-go/tools/watch/informerwatcher.go:146


I have some questions:

- Why does the kubelet request another cert after only 5s?
- Why does the controller manager fail to respond with a cert?

--- Additional comment from Elana Hashman on 2021-04-14 21:37:32 UTC ---

Apr 14 11:54:03.377045 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:54:03.376886    1505 kubelet_pods.go:1285] "Killing unwanted pod" podName="execpod"
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:54:03.377757    1505 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: goroutine 24750 [running]:
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: panic(0x41fffe0, 0x75e46a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult.func1(0xc0007976f0, 0xc000b3a580, 0x0, 0x0, 0xc001c432c0, 0xc0008a62a0)
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:697 +0x33a
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: created by k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult
Apr 14 11:54:03.377798 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]:         /tmp/openshift/build-rpms/rpm/BUILD/openshift-1.21.0/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:690 +0x105

Kubelet appears to have crashed at 11:54:03, but this might be a red herring, since the timeout and next CSR request occur within 2s of each other. (We should probably follow up on this regardless.)

--- Additional comment from Elana Hashman on 2021-04-14 21:54:53 UTC ---

Another question:

- Why is the node marked as NodeReady if its cert broke?

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"

--- Additional comment from Elana Hashman on 2021-04-14 22:17:53 UTC ---

Networking was down when the original CSR was filed. See e.g. errors like

Apr 14 11:39:01.784836 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:01.784813    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-multus/network-metrics-daemon-qdt8t" podUID=7be17f6c-5ee9-4e1d-a340-d2602b8f7979
...
Apr 14 11:39:55.371302 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: E0414 11:39:55.370401    1505 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: co
ntainer runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your netw
ork provider started?" pod="openshift-network-diagnostics/network-check-target-zf6tb" podUID=08b8d181-a0eb-4e08-ae4b-d66e972e9f6e

after which it finally starts networking and declares the node ready at 11:40:01

Apr 14 11:40:01.667852 ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 hyperkube[1505]: I0414 11:40:01.664880    1505 kubelet_node_status.go:554] "Recording event message for node" node="ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9" event="NodeReady"


So we know the answer to:

- Why does the controller manager fail to respond with a cert?

The networking wasn't up on the node. The certificate manager in client go waits a whole 15m to retry: https://github.com/openshift/kubernetes/blob/74497632f4118e7989da73de0ed6f1268aeb64ad/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L47


Remaining questions:

- Why does the kubelet request another cert after only 5s?
- Why is the node marked as NodeReady if it doesn't have a cert? (amended slightly)

--- Additional comment from Elana Hashman on 2021-04-14 22:33:23 UTC ---

Cert request at 11:39:02 is neither accepted nor rejected:

{
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:39:02Z",
                "generateName": "csr-",
                "name": "csr-6jl4f",
                "resourceVersion": "20413",
                "uid": "c470f86a-f2b4-4912-9d5f-a4b32d7965b1"
            },
            "spec": {
                "groups": [
                    "system:nodes",
                    "system:authenticated"
                ],
                "request": "blah",
                "signerName": "kubernetes.io/kubelet-serving",
                "usages": [
                    "digital signature",
                    "key encipherment",
                    "server auth"
                ],
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {}
        },

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/656/pull-ci-openshift-kubernetes-master-e2e-agnostic-cmd/1382280733116076032/artifacts/e2e-agnostic-cmd/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5f895cd888-z87rf_machine-approver-controller.log

Machine-approver-controller logs say (which is running on "nodeName": "ci-op-7mr744x6-fc795-jdxcp-master-0"):

I0414 11:39:02.229047       1 controller.go:114] Reconciling CSR: csr-6jl4f
I0414 11:39:02.229676       1 csr_check.go:150] csr-6jl4f: CSR does not appear to be client csr
I0414 11:39:02.232695       1 csr_check.go:441] retrieving serving cert from ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9 (10.0.128.3:10250)
I0414 11:39:02.235536       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0414 11:39:02.235678       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9
E0414 11:39:02.235733       1 csr_check.go:196] csr-6jl4f: Serving Cert: No target machine for node "ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
I0414 11:39:02.235757       1 controller.go:172] csr-6jl4f: CSR not authorized

Same error: why is this running into the same issue?


Second cert request succeeds at 11:54:04:

        {
            "apiVersion": "certificates.k8s.io/v1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2021-04-14T11:54:04Z",
...
                "username": "system:node:ci-op-7mr744x6-fc795-jdxcp-worker-d-n5qw9"
            },
            "status": {
                "certificate": "blah",
                "conditions": [
                    {
                        "lastTransitionTime": "2021-04-14T11:54:04Z",
                        "lastUpdateTime": "2021-04-14T11:54:04Z",
                        "message": "This CSR was approved by the Node CSR Approver",
                        "reason": "NodeCSRApprove",
                        "status": "True",
                        "type": "Approved"
                    }
                ]
            }
        },

--- Additional comment from Elana Hashman on 2021-04-14 23:37:47 UTC ---

No evidence of this issue occurring on the logs of any of the masters, but the machine-approver-controller is still hitting the "tls: internal error" issue when it fails to approve the cert.

Tracing the code from the logs above, it seems to try to dial back to the (known problem) kubelet that requested the cert; that seems a bit circular? https://github.com/openshift/cluster-machine-approver/blob/6768880fc1c3eabdc7b5db9987b69ce5a3f52521/pkg/controller/csr_check.go#L441-L443

--- Additional comment from Elana Hashman on 2021-04-15 00:44:14 UTC ---

Ryan thinks there's an issue where the cluster-machine-approver is on golang 1.15, but the rest of OpenShift has been updated to 1.16, resulting in some incompatibilities: https://bugzilla.redhat.com/show_bug.cgi?id=1949741

We'll see if this change helps. The timing makes sense, given when the go 1.16 bump landed.

--- Additional comment from Elana Hashman on 2021-04-15 19:07:21 UTC ---

Marking as not a blocker as this does not appear to be a regression and impact appears to be more of an issue in CI than prod due to short-lived nodes - we see that this error has been pretty prevalent and is not new, e.g. this log from Dec. 15, 2020: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/98/pull-ci-openshift-cluster-machine-approver-master-e2e-aws/1338899425333874688/artifacts/e2e-aws/gather-extra/pods/openshift-cluster-machine-approver_machine-approver-7f767df9d7-2d92x_machine-approver-controller.log

I think there are two issues here:

1. There seems to be a bug with the Kubelet's TLS cert management prior to getting a CSR approved/NodeReady logic.
2. Something changed recently, possibly coinciding with the rebase, causing the "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured" test to fail much more frequently. It's probably worth reviewing how that test is implemented.

I'll clone and send a bug to the monitoring team to investigate #2, and will continue using this to track #1.

--- Additional comment from Elana Hashman on 2021-04-16 20:22:15 UTC ---

Also proposed upstream: https://github.com/kubernetes/kubernetes/pull/101200

--- Additional comment from Clayton Coleman on 2021-04-19 13:29:37 UTC ---

Where is the follow up to prevent the racing?  If this is correct, we would be regressing startup by 15m which is unacceptable.

Reopening this - I expect to see the race condition fully debugged and a gap <2m from time master node comes up to it being ready (it's the approvers job to approve masters with low latency especially during install).

I could tolerate a known few minute regression install, but not 15m.

--- Additional comment from Clayton Coleman on 2021-04-19 13:30:31 UTC ---

Let me describe expected behavior:

1. master starts up during install
2. creates server certificate
3. machine approver accepts certificate <2-3m after machine approver has the necessary info

--- Additional comment from Clayton Coleman on 2021-04-19 13:38:18 UTC ---

INFO[2021-04-16T13:06:26Z] Step e2e-aws-serial-ipi-install-install succeeded after 37m20.041593719s.

Average startup time is about 32m on AWS, this looks like this regresses AWS by about 5m (just a guess) which would be too significant to just fix by waiting for me given its impact on every customer, CI, and services.  So I don't consider "waiting" a single solution and if this won't fix the wait (waiting better is great) but we need to spawn a separate bug to resolve the wait before this merges.

--- Additional comment from Elana Hashman on 2021-04-19 16:58:39 UTC ---

We now have an answer to:

- Why does the kubelet request another cert after only 5s?

The first cert requested is the kubelet's client cert, which succeeds. The server cert request hangs for 15m. Unfortunately, the logging doesn't differentiate between the two.


I'll look into seeing if we can retry on the failed CSR faster.

Comment 1 Michael Gugino 2021-04-19 20:36:37 UTC

*** This bug has been marked as a duplicate of bug 1940972 ***


Note You need to log in before you can comment on or make changes to this bug.