Bug 1571515 - [3.11] RotateKubeletServerCertificate feature gate sometimes can not return current certificate
Summary: [3.11] RotateKubeletServerCertificate feature gate sometimes can not return c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Michael Gugino
QA Contact: Johnny Liu
URL:
Whiteboard:
: 1582295 (view as bug list)
Depends On:
Blocks: 1622710
TreeView+ depends on / blocked
 
Reported: 2018-04-25 03:27 UTC by DeShuai Ma
Modified: 2018-11-06 15:00 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1622710 (view as bug list)
Environment:
Last Closed: 2018-10-11 07:19:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 None None None 2018-10-11 07:19:39 UTC

Description DeShuai Ma 2018-04-25 03:27:00 UTC
Description of problem:
After install the env, can't exec/rsh to the control-plane pod.

Version-Release number of the following components:
openshift v3.10.0-0.28.0
kubernetes v1.10.0+b81c8f8
openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7.noarch.rpm

How reproducible:
Sometime

Steps to Reproduce:
1. Check the etcd version in cluster
[root@ip-172-18-7-123 ~]# oc get po -n default
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-c2hlp    1/1       Running   0          38m
registry-console-1-5jd6l   1/1       Running   0          40m
router-1-7s4ws             1/1       Running   0          39m
router-1-dmrgl             1/1       Running   0          39m
[root@ip-172-18-7-123 ~]# oc rsh -n default docker-registry-1-c2hlp
sh-4.2$ 
sh-4.2$ 
sh-4.2$ exit
exit
[root@ip-172-18-7-123 ~]# oc get po -n kube-system
NAME                                              READY     STATUS    RESTARTS   AGE
master-api-ip-172-18-7-123.ec2.internal           1/1       Running   0          42m
master-controllers-ip-172-18-7-123.ec2.internal   1/1       Running   0          43m
master-etcd-ip-172-18-7-123.ec2.internal          1/1       Running   0          42m
[root@ip-172-18-7-123 ~]# oc rsh -n kube-system master-api-ip-172-18-7-123.ec2.internal
Error from server: error dialing backend: remote error: tls: internal error
[root@ip-172-18-7-123 ~]# 
[root@ip-172-18-7-123 ~]# oc rsh -n kube-system master-api-ip-172-18-7-123.ec2.internal --loglevel=10
Error from server: error dialing backend: remote error: tls: internal error
[root@ip-172-18-7-123 ~]# oc rsh --loglevel=10 -n kube-system master-api-ip-172-18-7-123.ec2.internal
I0424 22:46:03.085323   13959 loader.go:357] Config loaded from file /root/.kube/config
I0424 22:46:03.086802   13959 round_trippers.go:386] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: oc/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8" https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal
I0424 22:46:03.100309   13959 round_trippers.go:405] GET https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal 200 OK in 13 milliseconds
I0424 22:46:03.100349   13959 round_trippers.go:411] Response Headers:
I0424 22:46:03.100361   13959 round_trippers.go:414]     Content-Length: 3253
I0424 22:46:03.100379   13959 round_trippers.go:414]     Date: Wed, 25 Apr 2018 02:46:03 GMT
I0424 22:46:03.100389   13959 round_trippers.go:414]     Cache-Control: no-store
I0424 22:46:03.100397   13959 round_trippers.go:414]     Content-Type: application/json
I0424 22:46:03.100491   13959 request.go:874] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"master-api-ip-172-18-7-123.ec2.internal","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal","uid":"ae0b61fd-482c-11e8-93fa-0e34412527e8","resourceVersion":"1721","creationTimestamp":"2018-04-25T02:02:17Z","labels":{"openshift.io/component":"api","openshift.io/control-plane":"true"},"annotations":{"kubernetes.io/config.hash":"69fd8cf417ec055a66ce2ec660ab3dcc","kubernetes.io/config.mirror":"69fd8cf417ec055a66ce2ec660ab3dcc","kubernetes.io/config.seen":"2018-04-24T21:59:29.915059507-04:00","kubernetes.io/config.source":"file"}},"spec":{"volumes":[{"name":"master-config","hostPath":{"path":"/etc/origin/master/","type":""}},{"name":"master-cloud-provider","hostPath":{"path":"/etc/origin/cloudprovider","type":""}},{"name":"master-data","hostPath":{"path":"/var/lib/origin","type":""}}],"containers":[{"name":"api","image":"registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.10","command":["/bin/bash","-c"],"args":["#!/bin/bash\nset -euo pipefail\nif [[ -f /etc/origin/master/master.env ]]; then\n  set -o allexport\n  source /etc/origin/master/master.env\nfi\nexec openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=${DEBUG_LOGLEVEL:-2}\n"],"resources":{},"volumeMounts":[{"name":"master-config","mountPath":"/etc/origin/master/"},{"name":"master-cloud-provider","mountPath":"/etc/origin/cloudprovider/"},{"name":"master-data","mountPath":"/var/lib/origin/"}],"livenessProbe":{"httpGet":{"path":"healthz","port":8443,"scheme":"HTTPS"},"initialDelaySeconds":45,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"healthz/ready","port":8443,"scheme":"HTTPS"},"initialDelaySeconds":10,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/terminat
ion-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent","securityContext":{"privileged":true}}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeName":"ip-172-18-7-123.ec2.internal","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler","tolerations":[{"operator":"Exists","effect":"NoExecute"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T01:59:33Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T02:03:12Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-04-25T01:59:33Z"}],"hostIP":"172.18.7.123","podIP":"172.18.7.123","startTime":"2018-04-25T01:59:33Z","containerStatuses":[{"name":"api","state":{"running":{"startedAt":"2018-04-25T02:00:47Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"registry.reg-aws.openshift.com:443/openshift3/ose-control-plane:v3.10","imageID":"docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:1eaa6802938f335271c9835e002374d674e2fbfe46e3436bcd80d15d60b54832","containerID":"docker://4d7601e400e85e2726320cd512a41f2efaa860a8e1c40d78f9a095299ec18556"}],"qosClass":"BestEffort"}}
I0424 22:46:03.102293   13959 round_trippers.go:386] curl -k -v -XPOST  -H "X-Stream-Protocol-Version: v4.channel.k8s.io" -H "X-Stream-Protocol-Version: v3.channel.k8s.io" -H "X-Stream-Protocol-Version: v2.channel.k8s.io" -H "X-Stream-Protocol-Version: channel.k8s.io" -H "User-Agent: oc/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8" https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22xterm%22+%2Fbin%2Fsh&container=api&container=api&stdin=true&stdout=true&tty=true
                                                                                                                   I0424 22:46:03.116992   13959 round_trippers.go:405] POST https://ip-172-18-7-123.ec2.internal:8443/api/v1/namespaces/kube-system/pods/master-api-ip-172-18-7-123.ec2.internal/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22xterm%22+%2Fbin%2Fsh&container=api&container=api&stdin=true&stdout=true&tty=true 500 Internal Server Error in 14 milliseconds
                                                                                                                                                                                                                                        I0424 22:46:03.117036   13959 round_trippers.go:411] Response Headers:
                                                                 I0424 22:46:03.117052   13959 round_trippers.go:414]     Content-Type: application/json
                                                                                                                                                        I0424 22:46:03.117142   13959 round_trippers.go:414]     Date: Wed, 25 Apr 2018 02:46:03 GMT
       I0424 22:46:03.117159   13959 round_trippers.go:414]     Content-Length: 149
                                                                                   I0424 22:46:03.117171   13959 round_trippers.go:414]     Cache-Control: no-store
                                                                                                                                                                   I0424 22:46:03.117614   13959 helpers.go:201] server response object: [{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "error dialing backend: remote error: tls: internal error",
  "code": 500
}]
F0424 22:46:03.117694   13959 helpers.go:119] Error from server: error dialing backend: remote error: tls: internal error

Actual results:
1. [root@ip-172-18-7-123 ~]# oc -n kube-system exec master-etcd-ip-172-18-7-123.ec2.internal -- rpm -q etcd
Error from server: error dialing backend: remote error: tls: internal error

Expected results:
1. Should success

Additional info:

Comment 1 weiwei jiang 2018-04-25 05:44:51 UTC
Looks like feature gate RotateKubeletServerCertificate is not always working

Apr 25 01:27:09 ip-172-18-7-123.ec2.internal atomic-openshift-node[17928]: I0425 01:27:09.798484   17928 logs.go:49] http: TLS handshake error from 172.18.7.123:52280: no serving certificate available for the kubelet

[root@ip-172-18-7-123 ~]# openssl s_client -connect ip-172-18-7-123.ec2.internal:10250 -cert /etc/origin/master/admin.crt -key /etc/origin/master/admin.key  -debug                                                                                                             
CONNECTED(00000003)                                                                                      
write to 0x14a4160 [0x14b4b60] (289 bytes => 289 (0x121))                                                
0000 - 16 03 01 01 1c 01 00 01-18 03 03 f6 e5 4e be a2   .............N..                                
0010 - 94 8c b6 d7 06 84 31 5f-9f 40 ce 14 85 2a 1d 08   ......1_.@...*..                                
0020 - d3 cc 7c b0 37 04 11 b9-e6 71 a9 00 00 ac c0 30   ..|.7....q.....0                                
0030 - c0 2c c0 28 c0 24 c0 14-c0 0a 00 a5 00 a3 00 a1   .,.(.$..........                                
0040 - 00 9f 00 6b 00 6a 00 69-00 68 00 39 00 38 00 37   ...k.j.i.h.9.8.7                                                               
0050 - 00 36 00 88 00 87 00 86-00 85 c0 32 c0 2e c0 2a   .6.........2...*                                
0060 - c0 26 c0 0f c0 05 00 9d-00 3d 00 35 00 84 c0 2f   .&.......=.5.../                                                                                                                                          
0070 - c0 2b c0 27 c0 23 c0 13-c0 09 00 a4 00 a2 00 a0   .+.'.#..........                                                               
0080 - 00 9e 00 67 00 40 00 3f-00 3e 00 33 00 32 00 31   ...g.@.?.>.3.2.1                                                               
0090 - 00 30 00 9a 00 99 00 98-00 97 00 45 00 44 00 43   .0.........E.D.C                                                               
00a0 - 00 42 c0 31 c0 2d c0 29-c0 25 c0 0e c0 04 00 9c   .B.1.-.).%......                                                               
00b0 - 00 3c 00 2f 00 96 00 41-c0 12 c0 08 00 16 00 13   .<./...A........                                                               
00c0 - 00 10 00 0d c0 0d c0 03-00 0a 00 07 c0 11 c0 07   ................                                                               
00d0 - c0 0c c0 02 00 05 00 04-00 ff 01 00 00 43 00 0b   .............C..                                                               
00e0 - 00 04 03 00 01 02 00 0a-00 0a 00 08 00 17 00 19   ................                                                               
00f0 - 00 18 00 16 00 23 00 00-00 0d 00 20 00 1e 06 01   .....#..... ....                                                               
0100 - 06 02 06 03 05 01 05 02-05 03 04 01 04 02 04 03   ................                                                               
0110 - 03 01 03 02 03 03 02 01-02 02 02 03 00 0f 00 01   ................                                                               
0120 - 01                                                .          
read from 0x14a4160 [0x14ba0c0] (7 bytes => 7 (0x7))
0000 - 15 03 03 00 02 02 50                              ......P    
139824508856208:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:769:                           
---                                                 
no peer certificate available                       
---                                                 
No client certificate CA names sent                 
---                                                 
SSL handshake has read 7 bytes and written 289 bytes
---                                                 
New, (NONE), Cipher is (NONE)                       
Secure Renegotiation IS NOT supported               
Compression: NONE                                   
Expansion: NONE                                     
No ALPN negotiated                                  
SSL-Session:                                                        
    Protocol  : TLSv1.2                                             
    Cipher    : 0000                                                
    Session-ID:                                                     
    Session-ID-ctx:                                                 
    Master-Key:                                                     
    Key-Arg   : None                                                
    Krb5 Principal: None                                            
    PSK identity: None                                              
    PSK identity hint: None                                         
    Start Time: 1524634937                                          
    Timeout   : 300 (sec)                                           
    Verify return code: 0 (ok)                                      
---

Comment 2 weiwei jiang 2018-04-25 06:30:22 UTC
And all the csr is in Pending status.

[root@ip-172-18-7-123 ~]# oc get csr
NAME        AGE       REQUESTOR                                  CONDITION
csr-4ngmc   2h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-56r9t   1h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-5hblp   1h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-6hw44   3h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-9vzhn   3h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-9w4fq   1h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-b5q55   24m       system:node:ip-172-18-7-123.ec2.internal   Pending
csr-fhskh   37m       system:node:ip-172-18-7-123.ec2.internal   Pending
csr-h5tgj   3h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-hrhx4   2h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-j8gjw   50m       system:node:ip-172-18-7-123.ec2.internal   Pending
csr-mvxdj   2h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-npc8s   3h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-qg4gj   2h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-sd2p6   2h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-swmdg   3h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-tczzq   11m       system:node:ip-172-18-7-123.ec2.internal   Pending
csr-txmq9   1h        system:node:ip-172-18-7-123.ec2.internal   Pending
csr-v5zs6   1h        system:node:ip-172-18-7-123.ec2.internal   Pending

Comment 3 Jordan Liggitt 2018-04-30 15:26:16 UTC
how was this cluster installed?

ansible has a mode that approves CSRs, or there is an option to set up an auto-approver. was either of those options used?

Comment 4 weiwei jiang 2018-05-02 03:15:55 UTC
(In reply to Jordan Liggitt from comment #3)
> how was this cluster installed?

Via ansible without openshift_master_bootstrap_auto_approve=true.

> 
> ansible has a mode that approves CSRs, or there is an option to set up an
> auto-approver. was either of those options used?

We setup via ansible, and when it done, found command with protocol upgrade for specific node can not work, then found that the node is not serving with any certs, and all csr is pending.

Comment 5 Clayton Coleman 2018-05-02 03:24:28 UTC
If you installed with ansible, there should have been specific tasks that ran called:

- name: Approve node certificates when bootstrapping
- name: Bootstrap all nodes that were identified with bootstrap metadata
- name: approve nodes

If you had pending CSRs post installation and any of these tasks ran, then you should have approved CSRs.

Comment 6 Jordan Liggitt 2018-05-02 15:26:43 UTC
Can you verify those ansible tasks ran?

Comment 7 Johnny Liu 2018-05-03 09:43:59 UTC
Hit this issue again when running installation with openshift-ansible-3.10.0-0.31.0.git.0.9f771c7.el7.noarch, from my installation log, I did not see the above ansible task ran.
- name: Approve node certificates when bootstrapping
- name: Bootstrap all nodes that were identified with bootstrap metadata
- name: approve nodes

I searched those tasks in openshift-ansible code, seem like a refresh install would not call those task.

[jialiu@dhcp-141-223 openshift-ansible]$ git checkout openshift-ansible-3.10.0-0.31.0 -b openshift-ansible-3.10.0-0.31.0
Switched to a new branch 'openshift-ansible-3.10.0-0.31.0'
[jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Approve node certificates when bootstrapping" -i *
roles/openshift_node/tasks/upgrade.yml:- name: Approve node certificates when bootstrapping
[jialiu@dhcp-141-223 openshift-ansible]$ grep -r "Bootstrap all nodes that were identified with bootstrap metadata" -i *
roles/openshift_gcp/tasks/configure_master_bootstrap.yml:- name: Bootstrap all nodes that were identified with bootstrap metadata
[jialiu@dhcp-141-223 openshift-ansible]$ grep -r "approve nodes" * -i
roles/openshift_aws/tasks/accept_nodes.yml:- name: approve nodes

Comment 8 Scott Dodson 2018-05-21 19:32:52 UTC
After discussing w/ Jordan and Clayton we believe the problem here is that we're not waiting for both a client and server CSR on each node.

We need to extend oc_adm_csr to support a list of expected usages and loop until all of those conditions have been fulfilled.

usages: ['client auth','server auth']

Comment 9 Johnny Liu 2018-05-23 02:13:31 UTC
Hit such issue again with v3.10.0-0.50.0

Comment 10 Jordan Liggitt 2018-05-25 13:38:21 UTC
*** Bug 1582295 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2018-05-30 14:03:37 UTC

*** This bug has been marked as a duplicate of bug 1579267 ***

Comment 19 Mike Fiedler 2018-07-10 20:10:33 UTC
SVT hit this on multiple clusters on 3.10.15.   The symptom is at least one node is non-functional.   Actions for pods running on the node (getting logs, heapster trying to get metrics, etc) get:

remote error: tls: internal error

The node journal on the affected node is full of:


Jul 10 19:58:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:32.936204   42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36604: no serving certificate available for the kubelet
Jul 10 19:58:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:58:34.680273   42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36614: no serving certificate available for the kubelet
Jul 10 19:59:32 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:32.936305   42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36776: no serving certificate available for the kubelet
Jul 10 19:59:34 ip-172-31-81-82.us-west-2.compute.internal atomic-openshift-node[42798]: I0710 19:59:34.680065   42798 logs.go:49] http: TLS handshake error from 172.31.49.212:36786: no serving certificate available for the kubelet


It seems more prevalent now.   Manually approving pending csr and restarting the node seems to fix it?  But not sure if the node missed anything in the upgrade because of this.

Comment 21 DeShuai Ma 2018-07-12 05:45:49 UTC
When do openshift v3.10.15 on openstack integrate testing, also met this bug. improve log level to high.

Comment 23 Scott Dodson 2018-07-18 18:55:42 UTC
Jian,

What version was comment #22 using? We've fixed some problems and we've just fixed https://bugzilla.redhat.com/show_bug.cgi?id=1594341 which is pretty similar.

Comment 25 Jian Zhang 2018-07-20 01:45:36 UTC
Scott,

The version is openshift v3.10.15 on OpenStack integrate testing.

Comment 29 Matt Bruzek 2018-08-09 15:21:28 UTC
We are seeing a similar CSR problem in OpenShift on OpenStack automation for 3.11. The symptom is that CNS will fail the installer. The error message looks to be heketi related and does not mention CSRs.

The task that fails is: 
TASK [openshift_storage_glusterfs : Load heketi topology] **********************
... (verbose lines omitted)
    "stdout_lines": [
        "Creating cluster ... ID: 495b9db9d299dcf9a8d3e38db62fbb40",
        "\tAllowing file volumes on cluster.",
        "\tAllowing block volumes on cluster.",
        "\tCreating node cns-1.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running",
        "\tCreating node cns-0.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running",
        "\tCreating node cns-2.scale-ci.example.com ... Unable to create node: New Node doesn't have glusterd running"
    ]

While trying to debug the CNS problem we tried rsh or showing logs and got some errors:

root@master-0: ~ # oc rsh --namespace=glusterfs glusterfs-storage-btd25 /bin/bash                                                                                                                  
Error from server: error dialing backend: remote error: tls: internal error

root@master-0: ~ # oc logs sync-zx2b7 -n openshift-node
Error from server: Get https://cns-0.scale-ci.example.com:10250/containerLogs/openshift-node/sync-zx2b7/sync: remote error: tls: internal error

This led me to check the CSRs and I saw hundreds of them in pending state:

root@master-0: ~ # oc get csr                                                                                                                                                                      
NAME        AGE       REQUESTOR                                       CONDITION  
...
csr-c8qph   8h        system:node:infra-node-1.scale-ci.example.com   Pending                                                                                                                      
csr-c9ckt   6h        system:node:app-node-1.scale-ci.example.com     Pending                                                                                                                      
csr-cgpb6   7h        system:node:app-node-1.scale-ci.example.com     Pending                                                                                                                      
csr-ck4mk   8h        system:node:cns-1.scale-ci.example.com          Pending                                                                                                                      
csr-ckwhf   10h       system:node:cns-1.scale-ci.example.com          Pending                                                                                                                      
csr-clcjk   7h        system:node:infra-node-2.scale-ci.example.com   Pending                                                                                                                      
csr-cn4kk   1h        system:node:app-node-1.scale-ci.example.com     Pending                                                                                                                      
csr-cnkrl   14h       system:node:cns-0.scale-ci.example.com          Pending                                                                                                                      
csr-cp94d   3h        system:node:cns-0.scale-ci.example.com          Pending                                                                                                                      
csr-cspsl   7h        system:node:infra-node-2.scale-ci.example.com   Pending                                                                                                                      
csr-cw62g   9h        system:node:cns-1.scale-ci.example.com          Pending                                                                                                                      
csr-cxxqz   2h        system:node:infra-node-0.scale-ci.example.com   Pending                                                                                                                      
csr-czbd5   5h        system:node:infra-node-2.scale-ci.example.com   Pending                                                                                                                      
csr-czdnz   1h        system:node:infra-node-2.scale-ci.example.com   Pending 
...

After destroying this cluster and re-running an install I was able to manually approve all CSRs during an install and confirm that we can get past the CNS task that normally fails. This points to the Pending CSRs preventing something in CNS from completing successfully.

So far we can reproduce this problem 100% of the time using our automation. If anyone needs logs or wants to look at a live system that can be arranged.

Comment 34 Scott Dodson 2018-08-17 21:33:51 UTC
We'll refactor oc_adm_csr to check for NodeReady as a signal that the client certificate is deployed, you can't post Ready state if you have no client certificate.

We'll use 
`oc get --raw /api/v1/nodes/${node}/proxy/healthz`
as signal that the server certificate has been deployed successfully. 

Example
# oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz
Error from server (ServiceUnavailable): the server is currently unable to handle the request
# oc get csr
NAME        AGE       REQUESTOR                            CONDITION
csr-c7j6n   14s       system:node:ose3-node1.example.com   Pending
# oc adm certificate approve csr-c7j6n
certificatesigningrequest.certificates.k8s.io "csr-c7j6n" approved
# oc get --raw /api/v1/nodes/ose3-node1.example.com/proxy/healthz
ok

We'll loop approving all certificates for ${node} until the two conditions above are fulfilled.

Comment 35 Michael Gugino 2018-08-22 13:58:24 UTC
PR in progress: https://github.com/openshift/openshift-ansible/pull/9711

Comment 36 Scott Dodson 2018-08-24 15:37:45 UTC
Matt,

Since you've had the most success reproducing this problem with your CNS installs. Can you please give the latest from master branch a try and see if the issue has been resolved?

Comment 37 Scott Dodson 2018-08-27 12:56:16 UTC
In openshift-ansible-3.11.0-0.23.0

Comment 38 Johnny Liu 2018-08-29 10:59:54 UTC
Re-test this with openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch, did not hit such issue. Move to Verified.

Comment 39 Matt Bruzek 2018-09-10 13:05:56 UTC
Verified with 3.11.0-0.25.0

Comment 43 errata-xmlrpc 2018-10-11 07:19:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.