Bug 1579267
Summary: | Cannot access node api after upgrade from 3.9 to 3.10 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Wang Haoran <haowang> |
Component: | Cluster Version Operator | Assignee: | Andrew Butcher <abutcher> |
Status: | CLOSED ERRATA | QA Contact: | Wang Haoran <haowang> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.10.0 | CC: | abutcher, agawand, aos-bugs, bleanhar, dma, ekuric, haowang, jeder, jokerman, mifiedle, mmccomas, vrutkovs, wmeng, wsun, xtian, yinzhou |
Target Milestone: | --- | Keywords: | TestBlocker |
Target Release: | 3.10.0 | Flags: | sdodson:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-310 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-30 19:15:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Wang Haoran
2018-05-17 09:30:35 UTC
Do you have logs from the node? Seth any ideas? I was able to reproduce this on one of three hosts. [root@ose3-master ~]# curl -v --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://ose3-node2.example.com:10250/healthz * About to connect() to ose3-node2.example.com port 10250 (#0) * Trying 192.168.122.118... * Connected to ose3-node2.example.com (192.168.122.118) port 10250 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/origin/master/ca.crt CApath: none * NSS error -12188 (SSL_ERROR_INTERNAL_ERROR_ALERT) * Peer reports it experienced an internal error. * Closing connection 0 curl: (35) Peer reports it experienced an internal error. On node 2 -- May 17 11:17:06 ose3-node2.example.com atomic-openshift-node[1783]: I0517 11:17:06.845254 1783 logs.go:49] http: TLS handshake error from 192.168.122.52:57126: no serving certificate available for the kubelet What I find is that there's no /etc/origin/node/certificates/kubelet-server-current.pem. I have pending CSRs, many of them for this node. I deleted all pending CSRs, a new one was created, I approved that and functionality was restored. Now to figure out if this was something that went wrong over time or if it was that way ever since my 3.9 to 3.10 upgrade. Tried to reproduce this via an upgrade, now one of my masters generates a CSR but it wasn't approved. When I manually approve the cert it never gets issued. Probably rehashing was is already know but in case it is not. The CSR requests are approved by this: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml If the CSR are being approved but not issued, that would be a certificates controller issue: https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/certificates This is happening because we never wait for the kubelet server CSR to come through. Likely same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1571515 Solution is to loop on CSR approval until we see both a client and server CSR for each host we care about. This may take potentially 5 minutes or more. *** Bug 1571515 has been marked as a duplicate of this bug. *** https://github.com/openshift/openshift-ansible/pull/8578 should fix this The PR has been merged to 3.10.0-0.56. Confirmed with latest OCP, the issue has fixed: openshift v3.10.0-0.58.0 Upgrade from ocp3.9. [root@qe-yinzhou-39-master-etcd-1 ~]# curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://qe-yinzhou-39-node-registry-router-1:10250/healthz ok Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |