Description of problem: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22273/pull-ci-openshift-origin-master-e2e-aws/5359 Mar 8 19:07:03.048: INFO: Error running &{/usr/bin/oc [oc exec --config=/tmp/configfile276575367 --namespace=e2e-test-multicast-42j7x multicast-3 -- omping -c 5 -T 60 -q -q 10.129.2.18 10.129.2.26 10.128.2.19] [] Error from server: error dialing backend: remote error: tls: internal error Error from server: error dialing backend: remote error: tls: internal error [] <nil> 0xc421f0fd70 exit status 1 <nil> <nil> true [0xc4234df560 0xc4234df588 0xc4234df588] [0xc4234df560 0xc4234df588] [0xc4234df568 0xc4234df580] [0x8be420 0x8be520] 0xc421254420 <nil>}: Error from server: error dialing backend: remote error: tls: internal error How reproducible: Flaking in CI. Actual results: David suggested that this might involve kubelet trust somehow. Expected results: Additional info:
Looks like there is no serving cert installed for that kubelet Mar 08 19:05:06 ip-10-0-153-248 hyperkube[4092]: I0308 19:05:06.013265 4092 log.go:172] http: TLS handshake error from 10.128.2.10:55754: no serving certificate available for the kubelet I also see Mar 08 19:08:25 ip-10-0-153-248 hyperkube[4092]: E0308 19:08:25.363003 4092 certificate_manager.go:386] Certificate request was not signed: timed out waiting for the condition I find the CSR in pending { "apiVersion": "certificates.k8s.io/v1beta1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2019-03-08T18:59:39Z", "generateName": "csr-", "name": "csr-97n6g", "resourceVersion": "12289", "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-97n6g", "uid": "53481e3c-41d4-11e9-a44d-0a0505ac077a" }, "spec": { "groups": [ "system:nodes", "system:authenticated" ], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlJEQ0I3QUlCQURCS01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14TVRBdkJnTlZCQU1US0hONQpjM1JsYlRwdWIyUmxPbWx3TFRFd0xUQXRNVFV6TFRJME9DNWxZekl1YVc1MFpYSnVZV3d3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFScEZqaXZ5cndSM3Z5S01Vb3R4Y2NZZVNzYk9WbnArd1gwZ29wVVVmbFgKaFdITHEyVWc1cTNGNURIc0ZWV0NlVlE3ektsRG5wZ3M0aG1hNzRWYUtOU1hvRUF3UGdZSktvWklodmNOQVFrTwpNVEV3THpBdEJnTlZIUkVFSmpBa2doeHBjQzB4TUMwd0xURTFNeTB5TkRndVpXTXlMbWx1ZEdWeWJtRnNod1FLCkFKbjRNQW9HQ0NxR1NNNDlCQU1DQTBjQU1FUUNJQjRpWUdRNUQ0b2RHOFJVOFFxL2Y1cXNNQWkrcm9XLzVmWU4KeGxJUGpQOGZBaUE3cnh1d0NKVngrL2RsOHRQT0NmM3kzWllKQmFFSU93V0I4ZFJVd05qZ0ZRPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUgUkVRVUVTVC0tLS0tCg==", "usages": [ "digital signature", "key encipherment", "server auth" ], "username": "system:node:ip-10-0-153-248.ec2.internal" }, "status": {} }, Seems like an issue with the machine-approver. It does seem to be up and approving other CSRs for kubelet server certs so... strange.
Additional debug information $ grep 97n6g openshift-cluster-machine-approver_machine-approver-7df6f64fc4-jcr27_machine-approver-controller.log I0308 18:59:39.768367 1 main.go:97] CSR csr-97n6g added I0308 18:59:39.794830 1 main.go:123] CSR csr-97n6g not authorized: No target machine I don't see anywhere in the e2e artifacts where the list of Machines is gathered
Has this continued to come up? Or has the issue resolved itself with recent releases?
I have not seen it recently
Closing for now. Please re-open if it comes back.
It's back/still here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/142/pull-ci-openshift-cluster-network-operator-master-e2e-aws/950 fail [github.com/openshift/origin/test/extended/networking/multicast.go:50]: Expected success, but got an error: <*util.ExitError | 0xc422042900>: { Cmd: "oc exec --config=/tmp/configfile320269332 --namespace=e2e-test-multicast-wlzrp multicast-1 -- omping -c 5 -T 60 -q -q 10.129.2.33 10.129.2.35 10.128.2.52", StdErr: "Error from server: error dialing backend: remote error: tls: internal error", Kubelet logs show lots of Apr 17 12:44:19 ip-10-0-143-105 hyperkube[1114]: I0417 12:44:19.149508 1114 log.go:172] http: TLS handshake error from 10.0.137.30:50458: no serving certificate available for the kubelet
More debug info from cluster-machine-approver pod logs: I0308 18:49:20.257176 1 main.go:128] machine api not available: the server could not find the requested resource (get machines.machine.openshift.io) and I0308 18:59:39.768367 1 main.go:97] CSR csr-97n6g added I0308 18:59:39.794830 1 main.go:123] CSR csr-97n6g not authorized: No target machine and I0308 18:59:31.464412 1 main.go:97] CSR csr-qcb78 added I0308 18:59:31.539715 1 main.go:123] CSR csr-qcb78 not authorized: Doesn't match expected prefix Perhaps cluster-machine-approver needs a getMachine retry if machines/machineStatus is nil. Looking into that now.
Talked to Sally, I think [1] may fix this issue and was committed on April 23, 2019. [1] https://github.com/openshift/cluster-machine-approver/commit/9fa24364770520132612e74f0e05f9ce5936f4fb
That's a fix for (In reply to Ryan Phillips from comment #8) > Talked to Sally, I think [1] may fix this issue and was committed on April > 23, 2019. > > [1] > https://github.com/openshift/cluster-machine-approver/commit/ > 9fa24364770520132612e74f0e05f9ce5936f4fb That's a fix for 1702098 specifically, but I haven't confirmed that it has fixed this flake.. This one I have been trying to reproduce with debug logging but have not been able to make it show up.
https://github.com/openshift/cluster-machine-approver/pull/22
Verified. Run the e2e test locally and passed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758