Description of problem: After upgrade from 3.9 to 3.10, the api that kubelet exposed cannot be access due to the certificate is not correct, this will cause the "oc exe" failed to run commands on the pod, will break the master-api->node-api Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch How reproducible: always Steps to Reproduce: 1.Install the 3.9 cluster 2.Run the upgrade playbook 3. curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://<node_name>:10250/healthz curl: (35) Peer reports it experienced an internal error. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
Do you have logs from the node? Seth any ideas?
I was able to reproduce this on one of three hosts. [root@ose3-master ~]# curl -v --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://ose3-node2.example.com:10250/healthz * About to connect() to ose3-node2.example.com port 10250 (#0) * Trying 192.168.122.118... * Connected to ose3-node2.example.com (192.168.122.118) port 10250 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/origin/master/ca.crt CApath: none * NSS error -12188 (SSL_ERROR_INTERNAL_ERROR_ALERT) * Peer reports it experienced an internal error. * Closing connection 0 curl: (35) Peer reports it experienced an internal error. On node 2 -- May 17 11:17:06 ose3-node2.example.com atomic-openshift-node[1783]: I0517 11:17:06.845254 1783 logs.go:49] http: TLS handshake error from 192.168.122.52:57126: no serving certificate available for the kubelet What I find is that there's no /etc/origin/node/certificates/kubelet-server-current.pem. I have pending CSRs, many of them for this node. I deleted all pending CSRs, a new one was created, I approved that and functionality was restored. Now to figure out if this was something that went wrong over time or if it was that way ever since my 3.9 to 3.10 upgrade.
Tried to reproduce this via an upgrade, now one of my masters generates a CSR but it wasn't approved. When I manually approve the cert it never gets issued.
Probably rehashing was is already know but in case it is not. The CSR requests are approved by this: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml If the CSR are being approved but not issued, that would be a certificates controller issue: https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/certificates
This is happening because we never wait for the kubelet server CSR to come through. Likely same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1571515 Solution is to loop on CSR approval until we see both a client and server CSR for each host we care about. This may take potentially 5 minutes or more.
*** Bug 1571515 has been marked as a duplicate of this bug. ***
https://github.com/openshift/openshift-ansible/pull/8578 should fix this
The PR has been merged to 3.10.0-0.56.
Confirmed with latest OCP, the issue has fixed: openshift v3.10.0-0.58.0 Upgrade from ocp3.9. [root@qe-yinzhou-39-master-etcd-1 ~]# curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://qe-yinzhou-39-node-registry-router-1:10250/healthz ok
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816