Bug 1733331 - logs, rsh and exec give remote error tls internal error
Summary: logs, rsh and exec give remote error tls internal error
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-25 17:53 UTC by Steven Walter
Modified: 2023-12-15 16:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-29 19:07:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4307511 0 Troubleshoot None Cannot see logs in console and oc logs, oc exec, etc give tls internal server error 2019-07-26 21:03:51 UTC

Description Steven Walter 2019-07-25 17:53:19 UTC
Description of problem:
Anything which requires websockets errors with internal server error

Version-Release number of selected component (if applicable):
4.1.4

How reproducible:

Seen in 2 unrelated clusters, no reproducer steps known


Actual results:
In the web console:

WebSocket connection to 'wss://console-openshift-console.apps.example.com/api/kubernetes/api/v1/namespaces/openshift-console/pods/console-79b6c7bb87-gt2ck/log?container=console&follow=true&tailLines=1000&x-csrf-token=ESx4l2bhkAyUQ8nx9f0%2FmA3qThlJEI6IOptYX2N%2FSPBDwcQuQ1K91DDjT0I3J99QYF4rogNwgleVtq6FV%2BkL7Q%3D%3D' failed: Error during WebSocket handshake: Unexpected response code: 500


oc logs console-79b6c7bb87-gt2ck

Error from server: Get https://master0.example.com:10250/containerLogs/openshift-console/console-79b6c7bb87-gt2ck/console: remote error: tls: internal error
must gather command fails as well:

oc adm must-gather
namespace/openshift-must-gather-pp7qd created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-zhlt9 created
container logs unavailable: Get https://master0.example.com:10250/containerLogs/openshift-must-gather-pp7qd/must-gather-4zsd8/gather?follow=true: remote error: tls: internal error


Expected results:
No error

Additional info:

Comment 1 Steven Walter 2019-07-25 17:55:11 UTC
https://tools.ietf.org/html/rfc5246#page-33

Comment 2 Eric Rich 2019-07-25 18:00:46 UTC
(In reply to Steven Walter from comment #1)
> https://tools.ietf.org/html/rfc5246#page-33

This is a TLS spec issue (and something the kubelet is not handling properly - or at the VERY least log an issue about) - I don't see any logs from the kubelet.service when these issues happen.

Comment 3 Eric Rich 2019-07-25 18:05:47 UTC
I also see this with 4.1.6

Example of the issues: 

$ oc get pods -n openshift-marketplace
NAME                                   READY   STATUS    RESTARTS   AGE
marketplace-operator-768b99959-9pftm   1/1     Running   1          128m

$ oc rsh marketplace-operator-768b99959-9pftm -n openshift-marketplace
Error from server (NotFound): pods "marketplace-operator-768b99959-9pftm" not found

$ oc exec marketplace-operator-768b99959-9pftm -n openshift-marketplace -- echo foo
Error from server: error dialing backend: remote error: tls: internal error

$ oc logs marketplace-operator-768b99959-9pftm -n openshift-marketplace
Error from server: Get https://master:10250/containerLogs/openshift-marketplace/marketplace-operator-768b99959-9pftm/marketplace-operator: remote error: tls: internal error

--- 
$ sudo crictl ps | grep kube-api
239ec13eeaf4e       beaf65fce4dc16947c5bd5d1ca7e16313234c393e8ca1c4251ac9b85094972bb   About an hour ago   Running             kube-apiserver-operator                   3                   bd197ceb6f882
6f2bdcab072ca       beaf65fce4dc16947c5bd5d1ca7e16313234c393e8ca1c4251ac9b85094972bb   About an hour ago   Running             kube-apiserver-cert-syncer-8              1                   6938a6ebc2c3d
e6b9db2994d07       0d8dcfc307048a0f0400e644fcd1c9929018103b15d0f9b23b4841f1e71937bc   About an hour ago   Running             kube-apiserver-8                          1                   6938a6ebc2c3d

sudo crictl logs e6b9db2994d07
...
E0725 17:38:54.707552       1 status.go:64] apiserver received an error that is not an metav1.Status: &url.Error{Op:"Get", URL:"https://master:10250/containerLogs/openshift-kube-apiserver/kube-apiserver-master/kube-apiserver-8", Err:(*net.OpError)(0xc01ec89270)}
...
> No other relevant logs out side of issues like this.

Comment 4 Steven Walter 2019-07-26 14:22:57 UTC
This is fixed by approving pending CSRs. I wrote a solution to document this: https://access.redhat.com/solutions/4307511

Comment 5 Eric Rich 2019-07-26 21:41:53 UTC
(In reply to Steven Walter from comment #4)
> This is fixed by approving pending CSRs. I wrote a solution to document
> this: https://access.redhat.com/solutions/4307511

We need better error handling or clues that the issue we hit here leads to a lack of approved certificates.

Comment 6 David Eads 2019-07-29 12:25:06 UTC
The API server doesn't have an opinion on CSRs.  In general, an errors about kubelet certificates cannot be fixed by the user getting the message.  It seems that there needs to be better feedback about the relative health of kubelets and the state of their credentials.  I would expect this to come either from something managing kubelets (this comes up a lot, perhaps something should be built) or by the agent handling CSR approval and deciding not to approve open ones.

Reassigning to node team to get it closer to a solution.

Comment 7 Seth Jennings 2019-07-29 19:07:49 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1733331#c4
https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal

This is not a bug. For UPI installs, CU must provide the method for approving kubelet serving CSRs (client CSRs are approved by kube-controller-manager)

Some agent on the node that monitored for the status of the kubelet serving cert and surfaced that to the cluster admin in some way would be an RFE.


Note You need to log in before you can comment on or make changes to this bug.