Description of problem: After install the env, can't exec/rsh to the control-plane pod. Version-Release number of the following components: openshift v3.10.0-0.28.0 kubernetes v1.10.0+b81c8f8 openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7.noarch.rpm How reproducible: Sometime Steps to Reproduce: 1. Check the etcd version in cluster [root@ip-172-18-7-123 ~]# oc get po -n default NAME READY STATUS RESTARTS AGE docker-registry-1-c2hlp 1/1 Running 0 38m registry-console-1-5jd6l 1/1 Running 0 40m router-1-7s4ws 1/1 Running 0 39m router-1-dmrgl 1/1 Running 0 39m [root@ip-172-18-7-123 ~]# oc rsh -n default docker-registry-1-c2hlp sh-4.2$ sh-4.2$ sh-4.2$ exit exit [root@ip-172-18-7-123 ~]# oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-7-123.ec2.internal 1/1 Running 0 42m master-controllers-ip-172-18-7-123.ec2.internal 1/1 Running 0 43m master-etcd-ip-172-18-7-123.ec2.internal 1/1 Running 0 42m [root@ip-172-18-7-123 ~]# oc rsh -n kube-system master-api-ip-172-18-7-123.ec2.internal Error from server: error dialing backend: remote error: tls: internal error [root@ip-172-18-7-123 ~]# [root@ip-172-18-7-123 ~]# oc rsh -n kube-system master-api-ip-172-18-7-123.ec2.internal --loglevel=10 Error from server: error dialing backend: remote error: tls: internal error [root@ip-172-18-7-123 ~]# oc rsh --loglevel=10 -n kube-system master-api-ip-172-18-7-123.ec2.internal I0424 22:46:03.085323 13959 loader.go:357] Config loaded from file /root/.kube/config I0424 22:46:03.086802 13959 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: oc/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8" https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal I0424 22:46:03.100309 13959 round_trippers.go:405] GET https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal 200 OK in 13 milliseconds I0424 22:46:03.100349 13959 round_trippers.go:411] Response Headers: I0424 22:46:03.100361 13959 round_trippers.go:414] Content-Length: 3253 I0424 22:46:03.100379 13959 round_trippers.go:414] Date: Wed, 25 Apr 2018 02:46:03 GMT I0424 22:46:03.100389 13959 round_trippers.go:414] Cache-Control: no-store I0424 22:46:03.100397 13959 round_trippers.go:414] Content-Type: application/json I0424 22:46:03.100491 13959 request.go:874] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"master-api-ip-172-18-7-123.ec2.internal","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal","uid":"ae0b61fd-482c-11e8-93fa-0e34412527e8","resourceVersion":"1721","creationTimestamp":"2018-04-25T02:02:17Z","labels":{"openshift.io/component":"api","openshift.io/control-plane":"true"},"annotations":{"kubernetes.io/config.hash":"69fd8cf417ec055a66ce2ec660ab3dcc","kubernetes.io/config.mirror":"69fd8cf417ec055a66ce2ec660ab3dcc","kubernetes.io/config.seen":"2018-04-24T21:59:29.915059507-04:00","kubernetes.io/config.source":"file"}},"spec":{"volumes":[{"name":"master-config","hostPath":{"path":"/etc/origin/master/","type":""}},{"name":"master-cloud-provider","hostPath":{"path":"/etc/origin/cloudprovider","type":""}},{"name":"master-data","hostPath":{"path":"/var/lib/origin","type":""}}],"containers":[{"name":"api","image":"registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.10","command":["/bin/bash","-c"],"args":["#!/bin/bash\nset -euo pipefail\nif [[ -f /etc/origin/master/master.env ]]; then\n set -o allexport\n source /etc/origin/master/master.env\nfi\nexec openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=${DEBUG_LOGLEVEL:-2}\n"],"resources":{},"volumeMounts":[{"name":"master-config","mountPath":"/etc/origin/master/"},{"name":"master-cloud-provider","mountPath":"/etc/origin/cloudprovider/"},{"name":"master-data","mountPath":"/var/lib/origin/"}],"livenessProbe":{"httpGet":{"path":"healthz","port":8443,"scheme":"HTTPS"},"initialDelaySeconds":45,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"healthz/ready","port":8443,"scheme":"HTTPS"},"initialDelaySeconds":10,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/terminat ion-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent","securityContext":{"privileged":true}}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeName":"ip-172-18-7-123.ec2.internal","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler","tolerations":[{"operator":"Exists","effect":"NoExecute"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T01:59:33Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T02:03:12Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T01:59:33Z"}],"hostIP":"172.18.7.123","podIP":"172.18.7.123","startTime":"2018-04-25T01:59:33Z","containerStatuses":[{"name":"api","state":{"running":{"startedAt":"2018-04-25T02:00:47Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.10","imageID":"docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:1eaa6802938f335271c9835e002374d674e2fbfe46e3436bcd80d15d60b54832","containerID":"docker://4d7601e400e85e2726320cd512a41f2efaa860a8e1c40d78f9a095299ec18556"}],"qosClass":"BestEffort"}} I0424 22:46:03.102293 13959 round_trippers.go:386] curl -k -v -XPOST -H "X-Stream-Protocol-Version: v4.channel.k8s.io" -H "X-Stream-Protocol-Version: v3.channel.k8s.io" -H "X-Stream-Protocol-Version: v2.channel.k8s.io" -H "X-Stream-Protocol-Version: channel.k8s.io" -H "User-Agent: oc/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8" https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22xterm%22+%2Fbin%2Fsh&container=api&container=api&stdin=true&stdout=true&tty=true I0424 22:46:03.116992 13959 round_trippers.go:405] POST https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22xterm%22+%2Fbin%2Fsh&container=api&container=api&stdin=true&stdout=true&tty=true 500 Internal Server Error in 14 milliseconds I0424 22:46:03.117036 13959 round_trippers.go:411] Response Headers: I0424 22:46:03.117052 13959 round_trippers.go:414] Content-Type: application/json I0424 22:46:03.117142 13959 round_trippers.go:414] Date: Wed, 25 Apr 2018 02:46:03 GMT I0424 22:46:03.117159 13959 round_trippers.go:414] Content-Length: 149 I0424 22:46:03.117171 13959 round_trippers.go:414] Cache-Control: no-store I0424 22:46:03.117614 13959 helpers.go:201] server response object: [{ "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "error dialing backend: remote error: tls: internal error", "code": 500 }] F0424 22:46:03.117694 13959 helpers.go:119] Error from server: error dialing backend: remote error: tls: internal error Actual results: 1. [root@ip-172-18-7-123 ~]# oc -n kube-system exec master-etcd-ip-172-18-7-123.ec2.internal -- rpm -q etcd Error from server: error dialing backend: remote error: tls: internal error Expected results: 1. Should success Additional info:
Looks like feature gate RotateKubeletServerCertificate is not always working Apr 25 01:27:09 ip-172-18-7-123.ec2.internal atomic-openshift-node[17928]: I0425 01:27:09.798484 17928 logs.go:49] http: TLS handshake error from 172.18.7.123:52280: no serving certificate available for the kubelet [root@ip-172-18-7-123 ~]# openssl s_client -connect ip-172-18-7-123.ec2.internal:10250 -cert /etc/origin/master/admin.crt -key /etc/origin/master/admin.key -debug CONNECTED(00000003) write to 0x14a4160 [0x14b4b60] (289 bytes => 289 (0x121)) 0000 - 16 03 01 01 1c 01 00 01-18 03 03 f6 e5 4e be a2 .............N.. 0010 - 94 8c b6 d7 06 84 31 5f-9f 40 ce 14 85 2a 1d 08 ......1_.@...*.. 0020 - d3 cc 7c b0 37 04 11 b9-e6 71 a9 00 00 ac c0 30 ..|.7....q.....0 0030 - c0 2c c0 28 c0 24 c0 14-c0 0a 00 a5 00 a3 00 a1 .,.(.$.......... 0040 - 00 9f 00 6b 00 6a 00 69-00 68 00 39 00 38 00 37 ...k.j.i.h.9.8.7 0050 - 00 36 00 88 00 87 00 86-00 85 c0 32 c0 2e c0 2a .6.........2...* 0060 - c0 26 c0 0f c0 05 00 9d-00 3d 00 35 00 84 c0 2f .&.......=.5.../ 0070 - c0 2b c0 27 c0 23 c0 13-c0 09 00 a4 00 a2 00 a0 .+.'.#.......... 0080 - 00 9e 00 67 00 40 00 3f-00 3e 00 33 00 32 00 31 ...g.@.?.>.3.2.1 0090 - 00 30 00 9a 00 99 00 98-00 97 00 45 00 44 00 43 .0.........E.D.C 00a0 - 00 42 c0 31 c0 2d c0 29-c0 25 c0 0e c0 04 00 9c .B.1.-.).%...... 00b0 - 00 3c 00 2f 00 96 00 41-c0 12 c0 08 00 16 00 13 .<./...A........ 00c0 - 00 10 00 0d c0 0d c0 03-00 0a 00 07 c0 11 c0 07 ................ 00d0 - c0 0c c0 02 00 05 00 04-00 ff 01 00 00 43 00 0b .............C.. 00e0 - 00 04 03 00 01 02 00 0a-00 0a 00 08 00 17 00 19 ................ 00f0 - 00 18 00 16 00 23 00 00-00 0d 00 20 00 1e 06 01 .....#..... .... 0100 - 06 02 06 03 05 01 05 02-05 03 04 01 04 02 04 03 ................ 0110 - 03 01 03 02 03 03 02 01-02 02 02 03 00 0f 00 01 ................ 0120 - 01 . read from 0x14a4160 [0x14ba0c0] (7 bytes => 7 (0x7)) 0000 - 15 03 03 00 02 02 50 ......P 139824508856208:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:769: --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 7 bytes and written 289 bytes --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated SSL-Session: Protocol : TLSv1.2 Cipher : 0000 Session-ID: Session-ID-ctx: Master-Key: Key-Arg : None Krb5 Principal: None PSK identity: None PSK identity hint: None Start Time: 1524634937 Timeout : 300 (sec) Verify return code: 0 (ok) ---
And all the csr is in Pending status. [root@ip-172-18-7-123 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-4ngmc 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-56r9t 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-5hblp 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-6hw44 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-9vzhn 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-9w4fq 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-b5q55 24m system:node:ip-172-18-7-123.ec2.internal Pending csr-fhskh 37m system:node:ip-172-18-7-123.ec2.internal Pending csr-h5tgj 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-hrhx4 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-j8gjw 50m system:node:ip-172-18-7-123.ec2.internal Pending csr-mvxdj 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-npc8s 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-qg4gj 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-sd2p6 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-swmdg 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-tczzq 11m system:node:ip-172-18-7-123.ec2.internal Pending csr-txmq9 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-v5zs6 1h system:node:ip-172-18-7-123.ec2.internal Pending
how was this cluster installed? ansible has a mode that approves CSRs, or there is an option to set up an auto-approver. was either of those options used?
(In reply to Jordan Liggitt from comment #3) > how was this cluster installed? Via ansible without openshift_master_bootstrap_auto_approve=true. > > ansible has a mode that approves CSRs, or there is an option to set up an > auto-approver. was either of those options used? We setup via ansible, and when it done, found command with protocol upgrade for specific node can not work, then found that the node is not serving with any certs, and all csr is pending.
If you installed with ansible, there should have been specific tasks that ran called: - name: Approve node certificates when bootstrapping - name: Bootstrap all nodes that were identified with bootstrap metadata - name: approve nodes If you had pending CSRs post installation and any of these tasks ran, then you should have approved CSRs.
Can you verify those ansible tasks ran?
Hit this issue again when running installation with openshift-ansible-3.10.0-0.31.0.git.0.9f771c7.el7.noarch, from my installation log, I did not see the above ansible task ran. - name: Approve node certificates when bootstrapping - name: Bootstrap all nodes that were identified with bootstrap metadata - name: approve nodes I searched those tasks in openshift-ansible code, seem like a refresh install would not call those task. [jialiu@dhcp-141-223 openshift-ansible]$ git checkout openshift-ansible-3.10.0-0.31.0 -b openshift-ansible-3.10.0-0.31.0 Switched to a new branch 'openshift-ansible-3.10.0-0.31.0' [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Approve node certificates when bootstrapping" -i * roles/openshift_node/tasks/upgrade.yml:- name: Approve node certificates when bootstrapping [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Bootstrap all nodes that were identified with bootstrap metadata" -i * roles/openshift_gcp/tasks/configure_master_bootstrap.yml:- name: Bootstrap all nodes that were identified with bootstrap metadata [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "approve nodes" * -i roles/openshift_aws/tasks/accept_nodes.yml:- name: approve nodes
After discussing w/ Jordan and Clayton we believe the problem here is that we're not waiting for both a client and server CSR on each node. We need to extend oc_adm_csr to support a list of expected usages and loop until all of those conditions have been fulfilled. usages: ['client auth','server auth']
Hit such issue again with v3.10.0-0.50.0
*** Bug 1582295 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1579267 ***
SVT hit this on multiple clusters on 3.10.15. The symptom is at least one node is non-functional. Actions for pods running on the node (getting logs, heapster trying to get metrics, etc) get: remote error: tls: internal error The node journal on the affected node is full of: Jul 10 19:58:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:32.936204 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36604: no serving certificate available for the kubelet Jul 10 19:58:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:34.680273 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36614: no serving certificate available for the kubelet Jul 10 19:59:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:32.936305 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36776: no serving certificate available for the kubelet Jul 10 19:59:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:34.680065 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36786: no serving certificate available for the kubelet It seems more prevalent now. Manually approving pending csr and restarting the node seems to fix it? But not sure if the node missed anything in the upgrade because of this.
When do openshift v3.10.15 on openstack integrate testing, also met this bug. improve log level to high.
Jian, What version was comment #22 using? We've fixed some problems and we've just fixed https://bugzilla.redhat.com/show_bug.cgi?id=1594341 which is pretty similar.
Scott, The version is openshift v3.10.15 on OpenStack integrate testing.
We are seeing a similar CSR problem in OpenShift on OpenStack automation for 3.11. The symptom is that CNS will fail the installer. The error message looks to be heketi related and does not mention CSRs. The task that fails is: TASK [openshift_storage_glusterfs : Load heketi topology] ********************** ... (verbose lines omitted) "stdout_lines": [ "Creating cluster ... ID: 495b9db9d299dcf9a8d3e38db62fbb40", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node cns-1.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running", "\tCreating node cns-0.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running", "\tCreating node cns-2.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running" ] While trying to debug the CNS problem we tried rsh or showing logs and got some errors: root@master-0: ~ # oc rsh --namespace=glusterfs glusterfs-storage-btd25 /bin/bash Error from server: error dialing backend: remote error: tls: internal error root@master-0: ~ # oc logs sync-zx2b7 -n openshift-node Error from server: Get https://cns-0.scale-ci.example.com:10250/containerLogs/openshift-node/sync-zx2b7/sync: remote error: tls: internal error This led me to check the CSRs and I saw hundreds of them in pending state: root@master-0: ~ # oc get csr NAME AGE REQUESTOR CONDITION ... csr-c8qph 8h system:node:infra-node-1.scale-ci.example.com Pending csr-c9ckt 6h system:node:app-node-1.scale-ci.example.com Pending csr-cgpb6 7h system:node:app-node-1.scale-ci.example.com Pending csr-ck4mk 8h system:node:cns-1.scale-ci.example.com Pending csr-ckwhf 10h system:node:cns-1.scale-ci.example.com Pending csr-clcjk 7h system:node:infra-node-2.scale-ci.example.com Pending csr-cn4kk 1h system:node:app-node-1.scale-ci.example.com Pending csr-cnkrl 14h system:node:cns-0.scale-ci.example.com Pending csr-cp94d 3h system:node:cns-0.scale-ci.example.com Pending csr-cspsl 7h system:node:infra-node-2.scale-ci.example.com Pending csr-cw62g 9h system:node:cns-1.scale-ci.example.com Pending csr-cxxqz 2h system:node:infra-node-0.scale-ci.example.com Pending csr-czbd5 5h system:node:infra-node-2.scale-ci.example.com Pending csr-czdnz 1h system:node:infra-node-2.scale-ci.example.com Pending ... After destroying this cluster and re-running an install I was able to manually approve all CSRs during an install and confirm that we can get past the CNS task that normally fails. This points to the Pending CSRs preventing something in CNS from completing successfully. So far we can reproduce this problem 100% of the time using our automation. If anyone needs logs or wants to look at a live system that can be arranged.
We'll refactor oc_adm_csr to check for NodeReady as a signal that the client certificate is deployed, you can't post Ready state if you have no client certificate. We'll use `oc get --raw /api/v1/nodes/${node}/proxy/healthz` as signal that the server certificate has been deployed successfully. Example # oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz Error from server (ServiceUnavailable): the server is currently unable to handle the request # oc get csr NAME AGE REQUESTOR CONDITION csr-c7j6n 14s system:node:ose3-node1.example.com Pending # oc adm certificate approve csr-c7j6n certificatesigningrequest.certificates.k8s.io "csr-c7j6n" approved # oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz ok We'll loop approving all certificates for ${node} until the two conditions above are fulfilled.
PR in progress: https://github.com/openshift/openshift-ansible/pull/9711
Matt, Since you've had the most success reproducing this problem with your CNS installs. Can you please give the latest from master branch a try and see if the issue has been resolved?
In openshift-ansible-3.11.0-0.23.0
Re-test this with openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch, did not hit such issue. Move to Verified.
Verified with 3.11.0-0.25.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652