Bug 1687295
| Summary: | flake: remote error: tls: internal error on oc exec | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michal Fojtik <mfojtik> |
| Component: | apiserver-auth | Assignee: | Matt Rogers <mrogers> |
| Status: | CLOSED ERRATA | QA Contact: | Chuan Yu <chuyu> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.0 | CC: | aos-bugs, danw, evb, jokerman, mmccomas, mrogers, nagrawal, rphillips, sjenning, somalley |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-04 10:45:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Michal Fojtik
2019-03-11 08:44:26 UTC
Looks like there is no serving cert installed for that kubelet
Mar 08 19:05:06 ip-10-0-153-248 hyperkube[4092]: I0308 19:05:06.013265 4092 log.go:172] http: TLS handshake error from 10.128.2.10:55754: no serving certificate available for the kubelet
I also see
Mar 08 19:08:25 ip-10-0-153-248 hyperkube[4092]: E0308 19:08:25.363003 4092 certificate_manager.go:386] Certificate request was not signed: timed out waiting for the condition
I find the CSR in pending
{
"apiVersion": "certificates.k8s.io/v1beta1",
"kind": "CertificateSigningRequest",
"metadata": {
"creationTimestamp": "2019-03-08T18:59:39Z",
"generateName": "csr-",
"name": "csr-97n6g",
"resourceVersion": "12289",
"selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-97n6g",
"uid": "53481e3c-41d4-11e9-a44d-0a0505ac077a"
},
"spec": {
"groups": [
"system:nodes",
"system:authenticated"
],
"request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlJEQ0I3QUlCQURCS01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14TVRBdkJnTlZCQU1US0hONQpjM1JsYlRwdWIyUmxPbWx3TFRFd0xUQXRNVFV6TFRJME9DNWxZekl1YVc1MFpYSnVZV3d3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFScEZqaXZ5cndSM3Z5S01Vb3R4Y2NZZVNzYk9WbnArd1gwZ29wVVVmbFgKaFdITHEyVWc1cTNGNURIc0ZWV0NlVlE3ektsRG5wZ3M0aG1hNzRWYUtOU1hvRUF3UGdZSktvWklodmNOQVFrTwpNVEV3THpBdEJnTlZIUkVFSmpBa2doeHBjQzB4TUMwd0xURTFNeTB5TkRndVpXTXlMbWx1ZEdWeWJtRnNod1FLCkFKbjRNQW9HQ0NxR1NNNDlCQU1DQTBjQU1FUUNJQjRpWUdRNUQ0b2RHOFJVOFFxL2Y1cXNNQWkrcm9XLzVmWU4KeGxJUGpQOGZBaUE3cnh1d0NKVngrL2RsOHRQT0NmM3kzWllKQmFFSU93V0I4ZFJVd05qZ0ZRPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUgUkVRVUVTVC0tLS0tCg==",
"usages": [
"digital signature",
"key encipherment",
"server auth"
],
"username": "system:node:ip-10-0-153-248.ec2.internal"
},
"status": {}
},
Seems like an issue with the machine-approver. It does seem to be up and approving other CSRs for kubelet server certs so... strange.
Additional debug information $ grep 97n6g openshift-cluster-machine-approver_machine-approver-7df6f64fc4-jcr27_machine-approver-controller.log I0308 18:59:39.768367 1 main.go:97] CSR csr-97n6g added I0308 18:59:39.794830 1 main.go:123] CSR csr-97n6g not authorized: No target machine I don't see anywhere in the e2e artifacts where the list of Machines is gathered Has this continued to come up? Or has the issue resolved itself with recent releases? I have not seen it recently Closing for now. Please re-open if it comes back. It's back/still here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/142/pull-ci-openshift-cluster-network-operator-master-e2e-aws/950 fail [github.com/openshift/origin/test/extended/networking/multicast.go:50]: Expected success, but got an error: <*util.ExitError | 0xc422042900>: { Cmd: "oc exec --config=/tmp/configfile320269332 --namespace=e2e-test-multicast-wlzrp multicast-1 -- omping -c 5 -T 60 -q -q 10.129.2.33 10.129.2.35 10.128.2.52", StdErr: "Error from server: error dialing backend: remote error: tls: internal error", Kubelet logs show lots of Apr 17 12:44:19 ip-10-0-143-105 hyperkube[1114]: I0417 12:44:19.149508 1114 log.go:172] http: TLS handshake error from 10.0.137.30:50458: no serving certificate available for the kubelet More debug info from cluster-machine-approver pod logs: I0308 18:49:20.257176 1 main.go:128] machine api not available: the server could not find the requested resource (get machines.machine.openshift.io) and I0308 18:59:39.768367 1 main.go:97] CSR csr-97n6g added I0308 18:59:39.794830 1 main.go:123] CSR csr-97n6g not authorized: No target machine and I0308 18:59:31.464412 1 main.go:97] CSR csr-qcb78 added I0308 18:59:31.539715 1 main.go:123] CSR csr-qcb78 not authorized: Doesn't match expected prefix Perhaps cluster-machine-approver needs a getMachine retry if machines/machineStatus is nil. Looking into that now. Talked to Sally, I think [1] may fix this issue and was committed on April 23, 2019. [1] https://github.com/openshift/cluster-machine-approver/commit/9fa24364770520132612e74f0e05f9ce5936f4fb That's a fix for (In reply to Ryan Phillips from comment #8) > Talked to Sally, I think [1] may fix this issue and was committed on April > 23, 2019. > > [1] > https://github.com/openshift/cluster-machine-approver/commit/ > 9fa24364770520132612e74f0e05f9ce5936f4fb That's a fix for 1702098 specifically, but I haven't confirmed that it has fixed this flake.. This one I have been trying to reproduce with debug logging but have not been able to make it show up. Verified. Run the e2e test locally and passed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |