Bug 1687295 - flake: remote error: tls: internal error on oc exec
Summary: flake: remote error: tls: internal error on oc exec
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Auth
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Matt Rogers
QA Contact: Chuan Yu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-11 08:44 UTC by Michal Fojtik
Modified: 2019-06-04 10:45 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:45:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:45:41 UTC

Description Michal Fojtik 2019-03-11 08:44:26 UTC
Description of problem:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22273/pull-ci-openshift-origin-master-e2e-aws/5359

Mar  8 19:07:03.048: INFO: Error running &{/usr/bin/oc [oc exec --config=/tmp/configfile276575367 --namespace=e2e-test-multicast-42j7x multicast-3 -- omping -c 5 -T 60 -q -q 10.129.2.18 10.129.2.26 10.128.2.19] []   Error from server: error dialing backend: remote error: tls: internal error
 Error from server: error dialing backend: remote error: tls: internal error
 [] <nil> 0xc421f0fd70 exit status 1 <nil> <nil> true [0xc4234df560 0xc4234df588 0xc4234df588] [0xc4234df560 0xc4234df588] [0xc4234df568 0xc4234df580] [0x8be420 0x8be520] 0xc421254420 <nil>}:
Error from server: error dialing backend: remote error: tls: internal error

How reproducible:

Flaking in CI.


Actual results:

David suggested that this might involve kubelet trust somehow.

Expected results:


Additional info:

Comment 1 Seth Jennings 2019-03-20 16:16:04 UTC
Looks like there is no serving cert installed for that kubelet

Mar 08 19:05:06 ip-10-0-153-248 hyperkube[4092]: I0308 19:05:06.013265    4092 log.go:172] http: TLS handshake error from 10.128.2.10:55754: no serving certificate available for the kubelet

I also see 

Mar 08 19:08:25 ip-10-0-153-248 hyperkube[4092]: E0308 19:08:25.363003    4092 certificate_manager.go:386] Certificate request was not signed: timed out waiting for the condition

I find the CSR in pending

        {
            "apiVersion": "certificates.k8s.io/v1beta1",
            "kind": "CertificateSigningRequest",
            "metadata": {
                "creationTimestamp": "2019-03-08T18:59:39Z",
                "generateName": "csr-",
                "name": "csr-97n6g",
                "resourceVersion": "12289",
                "selfLink": "/apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-97n6g",
                "uid": "53481e3c-41d4-11e9-a44d-0a0505ac077a"
            },
            "spec": {
                "groups": [
                    "system:nodes",
                    "system:authenticated"
                ],
                "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlJEQ0I3QUlCQURCS01SVXdFd1lEVlFRS0V3eHplWE4wWlcwNmJtOWtaWE14TVRBdkJnTlZCQU1US0hONQpjM1JsYlRwdWIyUmxPbWx3TFRFd0xUQXRNVFV6TFRJME9DNWxZekl1YVc1MFpYSnVZV3d3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFScEZqaXZ5cndSM3Z5S01Vb3R4Y2NZZVNzYk9WbnArd1gwZ29wVVVmbFgKaFdITHEyVWc1cTNGNURIc0ZWV0NlVlE3ektsRG5wZ3M0aG1hNzRWYUtOU1hvRUF3UGdZSktvWklodmNOQVFrTwpNVEV3THpBdEJnTlZIUkVFSmpBa2doeHBjQzB4TUMwd0xURTFNeTB5TkRndVpXTXlMbWx1ZEdWeWJtRnNod1FLCkFKbjRNQW9HQ0NxR1NNNDlCQU1DQTBjQU1FUUNJQjRpWUdRNUQ0b2RHOFJVOFFxL2Y1cXNNQWkrcm9XLzVmWU4KeGxJUGpQOGZBaUE3cnh1d0NKVngrL2RsOHRQT0NmM3kzWllKQmFFSU93V0I4ZFJVd05qZ0ZRPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUgUkVRVUVTVC0tLS0tCg==",
                "usages": [
                    "digital signature",
                    "key encipherment",
                    "server auth"
                ],
                "username": "system:node:ip-10-0-153-248.ec2.internal"
            },
            "status": {}
        },

Seems like an issue with the machine-approver.  It does seem to be up and approving other CSRs for kubelet server certs so... strange.

Comment 2 Seth Jennings 2019-03-20 16:20:15 UTC
Additional debug information

$ grep 97n6g openshift-cluster-machine-approver_machine-approver-7df6f64fc4-jcr27_machine-approver-controller.log 
I0308 18:59:39.768367       1 main.go:97] CSR csr-97n6g added
I0308 18:59:39.794830       1 main.go:123] CSR csr-97n6g not authorized: No target machine

I don't see anywhere in the e2e artifacts where the list of Machines is gathered

Comment 3 Erica von Buelow 2019-04-08 14:59:13 UTC
Has this continued to come up? Or has the issue resolved itself with recent releases?

Comment 4 Seth Jennings 2019-04-08 15:29:38 UTC
I have not seen it recently

Comment 5 Erica von Buelow 2019-04-08 15:44:35 UTC
Closing for now. Please re-open if it comes back.

Comment 6 Dan Winship 2019-04-17 13:38:05 UTC
It's back/still here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/142/pull-ci-openshift-cluster-network-operator-master-e2e-aws/950

    fail [github.com/openshift/origin/test/extended/networking/multicast.go:50]: Expected success, but got an error:
        <*util.ExitError | 0xc422042900>: {
            Cmd: "oc exec --config=/tmp/configfile320269332 --namespace=e2e-test-multicast-wlzrp multicast-1 -- omping -c 5 -T 60 -q -q 10.129.2.33 10.129.2.35 10.128.2.52",
            StdErr: "Error from server: error dialing backend: remote error: tls: internal error",

Kubelet logs show lots of

    Apr 17 12:44:19 ip-10-0-143-105 hyperkube[1114]: I0417 12:44:19.149508    1114 log.go:172] http: TLS handshake error from 10.0.137.30:50458: no serving certificate available for the kubelet

Comment 7 Sally 2019-04-22 14:53:32 UTC
More debug info from cluster-machine-approver pod logs:
I0308 18:49:20.257176       1 main.go:128] machine api not available: the server could not find the requested resource (get machines.machine.openshift.io)
and
I0308 18:59:39.768367       1 main.go:97] CSR csr-97n6g added
I0308 18:59:39.794830       1 main.go:123] CSR csr-97n6g not authorized: No target machine
and
I0308 18:59:31.464412       1 main.go:97] CSR csr-qcb78 added
I0308 18:59:31.539715       1 main.go:123] CSR csr-qcb78 not authorized: Doesn't match expected prefix

Perhaps cluster-machine-approver needs a getMachine retry if machines/machineStatus is nil.  Looking into that now.

Comment 8 Ryan Phillips 2019-04-26 17:06:40 UTC
Talked to Sally, I think [1] may fix this issue and was committed on April 23, 2019.

[1] https://github.com/openshift/cluster-machine-approver/commit/9fa24364770520132612e74f0e05f9ce5936f4fb

Comment 9 Matt Rogers 2019-04-26 18:10:59 UTC
That's a fix for (In reply to Ryan Phillips from comment #8)
> Talked to Sally, I think [1] may fix this issue and was committed on April
> 23, 2019.
> 
> [1]
> https://github.com/openshift/cluster-machine-approver/commit/
> 9fa24364770520132612e74f0e05f9ce5936f4fb

That's a fix for 1702098 specifically, but I haven't confirmed that it has fixed this flake.. This one I have been trying to reproduce with debug logging but have not been able to make it show up.

Comment 12 Chuan Yu 2019-04-30 02:36:01 UTC
Verified.

Run the e2e test locally and passed.

Comment 14 errata-xmlrpc 2019-06-04 10:45:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.