Bug 1571515
Summary: | [3.11] RotateKubeletServerCertificate feature gate sometimes can not return current certificate | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | DeShuai Ma <dma> | |
Component: | Installer | Assignee: | Michael Gugino <mgugino> | |
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.10.0 | CC: | akostadi, aos-bugs, bleanhar, ccoleman, dmoessne, hongli, jialiu, jiazha, jmalde, jokerman, juriarte, mgugino, mifiedle, mmccomas, rsandu, sgaikwad, vlaad, wjiang, wmeng, xtian | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | 3.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1622710 (view as bug list) | Environment: | ||
Last Closed: | 2018-10-11 07:19:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1622710 |
Description
DeShuai Ma
2018-04-25 03:27:00 UTC
Looks like feature gate RotateKubeletServerCertificate is not always working Apr 25 01:27:09 ip-172-18-7-123.ec2.internal atomic-openshift-node[17928]: I0425 01:27:09.798484 17928 logs.go:49] http: TLS handshake error from 172.18.7.123:52280: no serving certificate available for the kubelet [root@ip-172-18-7-123 ~]# openssl s_client -connect ip-172-18-7-123.ec2.internal:10250 -cert /etc/origin/master/admin.crt -key /etc/origin/master/admin.key -debug CONNECTED(00000003) write to 0x14a4160 [0x14b4b60] (289 bytes => 289 (0x121)) 0000 - 16 03 01 01 1c 01 00 01-18 03 03 f6 e5 4e be a2 .............N.. 0010 - 94 8c b6 d7 06 84 31 5f-9f 40 ce 14 85 2a 1d 08 ......1_.@...*.. 0020 - d3 cc 7c b0 37 04 11 b9-e6 71 a9 00 00 ac c0 30 ..|.7....q.....0 0030 - c0 2c c0 28 c0 24 c0 14-c0 0a 00 a5 00 a3 00 a1 .,.(.$.......... 0040 - 00 9f 00 6b 00 6a 00 69-00 68 00 39 00 38 00 37 ...k.j.i.h.9.8.7 0050 - 00 36 00 88 00 87 00 86-00 85 c0 32 c0 2e c0 2a .6.........2...* 0060 - c0 26 c0 0f c0 05 00 9d-00 3d 00 35 00 84 c0 2f .&.......=.5.../ 0070 - c0 2b c0 27 c0 23 c0 13-c0 09 00 a4 00 a2 00 a0 .+.'.#.......... 0080 - 00 9e 00 67 00 40 00 3f-00 3e 00 33 00 32 00 31 ...g.@.?.>.3.2.1 0090 - 00 30 00 9a 00 99 00 98-00 97 00 45 00 44 00 43 .0.........E.D.C 00a0 - 00 42 c0 31 c0 2d c0 29-c0 25 c0 0e c0 04 00 9c .B.1.-.).%...... 00b0 - 00 3c 00 2f 00 96 00 41-c0 12 c0 08 00 16 00 13 .<./...A........ 00c0 - 00 10 00 0d c0 0d c0 03-00 0a 00 07 c0 11 c0 07 ................ 00d0 - c0 0c c0 02 00 05 00 04-00 ff 01 00 00 43 00 0b .............C.. 00e0 - 00 04 03 00 01 02 00 0a-00 0a 00 08 00 17 00 19 ................ 00f0 - 00 18 00 16 00 23 00 00-00 0d 00 20 00 1e 06 01 .....#..... .... 0100 - 06 02 06 03 05 01 05 02-05 03 04 01 04 02 04 03 ................ 0110 - 03 01 03 02 03 03 02 01-02 02 02 03 00 0f 00 01 ................ 0120 - 01 . read from 0x14a4160 [0x14ba0c0] (7 bytes => 7 (0x7)) 0000 - 15 03 03 00 02 02 50 ......P 139824508856208:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:769: --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 7 bytes and written 289 bytes --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated SSL-Session: Protocol : TLSv1.2 Cipher : 0000 Session-ID: Session-ID-ctx: Master-Key: Key-Arg : None Krb5 Principal: None PSK identity: None PSK identity hint: None Start Time: 1524634937 Timeout : 300 (sec) Verify return code: 0 (ok) --- And all the csr is in Pending status. [root@ip-172-18-7-123 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-4ngmc 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-56r9t 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-5hblp 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-6hw44 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-9vzhn 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-9w4fq 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-b5q55 24m system:node:ip-172-18-7-123.ec2.internal Pending csr-fhskh 37m system:node:ip-172-18-7-123.ec2.internal Pending csr-h5tgj 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-hrhx4 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-j8gjw 50m system:node:ip-172-18-7-123.ec2.internal Pending csr-mvxdj 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-npc8s 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-qg4gj 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-sd2p6 2h system:node:ip-172-18-7-123.ec2.internal Pending csr-swmdg 3h system:node:ip-172-18-7-123.ec2.internal Pending csr-tczzq 11m system:node:ip-172-18-7-123.ec2.internal Pending csr-txmq9 1h system:node:ip-172-18-7-123.ec2.internal Pending csr-v5zs6 1h system:node:ip-172-18-7-123.ec2.internal Pending how was this cluster installed? ansible has a mode that approves CSRs, or there is an option to set up an auto-approver. was either of those options used? (In reply to Jordan Liggitt from comment #3) > how was this cluster installed? Via ansible without openshift_master_bootstrap_auto_approve=true. > > ansible has a mode that approves CSRs, or there is an option to set up an > auto-approver. was either of those options used? We setup via ansible, and when it done, found command with protocol upgrade for specific node can not work, then found that the node is not serving with any certs, and all csr is pending. If you installed with ansible, there should have been specific tasks that ran called: - name: Approve node certificates when bootstrapping - name: Bootstrap all nodes that were identified with bootstrap metadata - name: approve nodes If you had pending CSRs post installation and any of these tasks ran, then you should have approved CSRs. Can you verify those ansible tasks ran? Hit this issue again when running installation with openshift-ansible-3.10.0-0.31.0.git.0.9f771c7.el7.noarch, from my installation log, I did not see the above ansible task ran. - name: Approve node certificates when bootstrapping - name: Bootstrap all nodes that were identified with bootstrap metadata - name: approve nodes I searched those tasks in openshift-ansible code, seem like a refresh install would not call those task. [jialiu@dhcp-141-223 openshift-ansible]$ git checkout openshift-ansible-3.10.0-0.31.0 -b openshift-ansible-3.10.0-0.31.0 Switched to a new branch 'openshift-ansible-3.10.0-0.31.0' [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Approve node certificates when bootstrapping" -i * roles/openshift_node/tasks/upgrade.yml:- name: Approve node certificates when bootstrapping [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Bootstrap all nodes that were identified with bootstrap metadata" -i * roles/openshift_gcp/tasks/configure_master_bootstrap.yml:- name: Bootstrap all nodes that were identified with bootstrap metadata [jialiu@dhcp-141-223 openshift-ansible]$ grep -r "approve nodes" * -i roles/openshift_aws/tasks/accept_nodes.yml:- name: approve nodes After discussing w/ Jordan and Clayton we believe the problem here is that we're not waiting for both a client and server CSR on each node. We need to extend oc_adm_csr to support a list of expected usages and loop until all of those conditions have been fulfilled. usages: ['client auth','server auth'] Hit such issue again with v3.10.0-0.50.0 *** Bug 1582295 has been marked as a duplicate of this bug. *** *** This bug has been marked as a duplicate of bug 1579267 *** SVT hit this on multiple clusters on 3.10.15. The symptom is at least one node is non-functional. Actions for pods running on the node (getting logs, heapster trying to get metrics, etc) get: remote error: tls: internal error The node journal on the affected node is full of: Jul 10 19:58:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:32.936204 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36604: no serving certificate available for the kubelet Jul 10 19:58:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:34.680273 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36614: no serving certificate available for the kubelet Jul 10 19:59:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:32.936305 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36776: no serving certificate available for the kubelet Jul 10 19:59:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:34.680065 42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36786: no serving certificate available for the kubelet It seems more prevalent now. Manually approving pending csr and restarting the node seems to fix it? But not sure if the node missed anything in the upgrade because of this. When do openshift v3.10.15 on openstack integrate testing, also met this bug. improve log level to high. Jian, What version was comment #22 using? We've fixed some problems and we've just fixed https://bugzilla.redhat.com/show_bug.cgi?id=1594341 which is pretty similar. Scott, The version is openshift v3.10.15 on OpenStack integrate testing. We are seeing a similar CSR problem in OpenShift on OpenStack automation for 3.11. The symptom is that CNS will fail the installer. The error message looks to be heketi related and does not mention CSRs. The task that fails is: TASK [openshift_storage_glusterfs : Load heketi topology] ********************** ... (verbose lines omitted) "stdout_lines": [ "Creating cluster ... ID: 495b9db9d299dcf9a8d3e38db62fbb40", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node cns-1.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running", "\tCreating node cns-0.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running", "\tCreating node cns-2.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running" ] While trying to debug the CNS problem we tried rsh or showing logs and got some errors: root@master-0: ~ # oc rsh --namespace=glusterfs glusterfs-storage-btd25 /bin/bash Error from server: error dialing backend: remote error: tls: internal error root@master-0: ~ # oc logs sync-zx2b7 -n openshift-node Error from server: Get https://cns-0.scale-ci.example.com:10250/containerLogs/openshift-node/sync-zx2b7/sync: remote error: tls: internal error This led me to check the CSRs and I saw hundreds of them in pending state: root@master-0: ~ # oc get csr NAME AGE REQUESTOR CONDITION ... csr-c8qph 8h system:node:infra-node-1.scale-ci.example.com Pending csr-c9ckt 6h system:node:app-node-1.scale-ci.example.com Pending csr-cgpb6 7h system:node:app-node-1.scale-ci.example.com Pending csr-ck4mk 8h system:node:cns-1.scale-ci.example.com Pending csr-ckwhf 10h system:node:cns-1.scale-ci.example.com Pending csr-clcjk 7h system:node:infra-node-2.scale-ci.example.com Pending csr-cn4kk 1h system:node:app-node-1.scale-ci.example.com Pending csr-cnkrl 14h system:node:cns-0.scale-ci.example.com Pending csr-cp94d 3h system:node:cns-0.scale-ci.example.com Pending csr-cspsl 7h system:node:infra-node-2.scale-ci.example.com Pending csr-cw62g 9h system:node:cns-1.scale-ci.example.com Pending csr-cxxqz 2h system:node:infra-node-0.scale-ci.example.com Pending csr-czbd5 5h system:node:infra-node-2.scale-ci.example.com Pending csr-czdnz 1h system:node:infra-node-2.scale-ci.example.com Pending ... After destroying this cluster and re-running an install I was able to manually approve all CSRs during an install and confirm that we can get past the CNS task that normally fails. This points to the Pending CSRs preventing something in CNS from completing successfully. So far we can reproduce this problem 100% of the time using our automation. If anyone needs logs or wants to look at a live system that can be arranged. We'll refactor oc_adm_csr to check for NodeReady as a signal that the client certificate is deployed, you can't post Ready state if you have no client certificate. We'll use `oc get --raw /api/v1/nodes/${node}/proxy/healthz` as signal that the server certificate has been deployed successfully. Example # oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz Error from server (ServiceUnavailable): the server is currently unable to handle the request # oc get csr NAME AGE REQUESTOR CONDITION csr-c7j6n 14s system:node:ose3-node1.example.com Pending # oc adm certificate approve csr-c7j6n certificatesigningrequest.certificates.k8s.io "csr-c7j6n" approved # oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz ok We'll loop approving all certificates for ${node} until the two conditions above are fulfilled. PR in progress: https://github.com/openshift/openshift-ansible/pull/9711 Matt, Since you've had the most success reproducing this problem with your CNS installs. Can you please give the latest from master branch a try and see if the issue has been resolved? In openshift-ansible-3.11.0-0.23.0 Re-test this with openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch, did not hit such issue. Move to Verified. Verified with 3.11.0-0.25.0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |