Bug 1579267 - Cannot access node api after upgrade from 3.9 to 3.10
Summary: Cannot access node api after upgrade from 3.9 to 3.10
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.10.0
Assignee: Andrew Butcher
QA Contact: Wang Haoran
Whiteboard: aos-scalability-310
Depends On:
TreeView+ depends on / blocked
Reported: 2018-05-17 09:30 UTC by Wang Haoran
Modified: 2019-01-24 13:03 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-07-30 19:15:42 UTC
Target Upstream Version:
sdodson: needinfo-

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:16:03 UTC
Red Hat Bugzilla 1572870 'high' 'CLOSED' 'Fail to upgrade against bootstrap ocp due to node can not start' 2019-12-06 05:02:59 UTC

Internal Links: 1572870

Description Wang Haoran 2018-05-17 09:30:35 UTC
Description of problem:
After upgrade from 3.9 to 3.10, the api that kubelet exposed cannot be access due to the certificate is not correct, this will cause the "oc exe" failed to run commands on the pod, will break the master-api->node-api
Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version


How reproducible:
Steps to Reproduce:
1.Install the 3.9 cluster
2.Run the upgrade playbook
3. curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://<node_name>:10250/healthz
curl: (35) Peer reports it experienced an internal error.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Scott Dodson 2018-05-17 12:34:35 UTC
Do you have logs from the node?

Seth any ideas?

Comment 5 Scott Dodson 2018-05-17 15:49:34 UTC
I was able to reproduce this on one of three hosts.

[root@ose3-master ~]# curl -v --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://ose3-node2.example.com:10250/healthz
* About to connect() to ose3-node2.example.com port 10250 (#0)
*   Trying
* Connected to ose3-node2.example.com ( port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/origin/master/ca.crt
  CApath: none
* Peer reports it experienced an internal error.
* Closing connection 0
curl: (35) Peer reports it experienced an internal error.

On node 2 --

May 17 11:17:06 ose3-node2.example.com atomic-openshift-node[1783]: I0517 11:17:06.845254    1783 logs.go:49] http: TLS handshake error from no serving certificate available for the kubelet

What I find is that there's no /etc/origin/node/certificates/kubelet-server-current.pem. I have pending CSRs, many of them for this node. I deleted all pending CSRs, a new one was created, I approved that and functionality was restored. Now to figure out if this was something that went wrong over time or if it was that way ever since my 3.9 to 3.10 upgrade.

Comment 6 Scott Dodson 2018-05-18 02:06:32 UTC
Tried to reproduce this via an upgrade, now one of my masters generates a CSR but it wasn't approved. When I manually approve the cert it never gets issued.

Comment 7 Seth Jennings 2018-05-18 22:13:50 UTC
Probably rehashing was is already know but in case it is not.

The CSR requests are approved by this:

If the CSR are being approved but not issued, that would be a certificates controller issue:

Comment 8 Scott Dodson 2018-05-22 12:31:26 UTC
This is happening because we never wait for the kubelet server CSR to come through. Likely same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1571515

Solution is to loop on CSR approval until we see both a client and server CSR for each host we care about. This may take potentially 5 minutes or more.

Comment 10 Scott Dodson 2018-05-30 14:03:37 UTC
*** Bug 1571515 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2018-05-31 12:46:28 UTC
https://github.com/openshift/openshift-ansible/pull/8578 should fix this

Comment 12 Wei Sun 2018-06-01 08:49:37 UTC
The PR has been merged to 3.10.0-0.56.

Comment 13 zhou ying 2018-06-04 05:18:30 UTC
Confirmed with latest OCP, the issue has fixed:
openshift v3.10.0-0.58.0

Upgrade from ocp3.9.

[root@qe-yinzhou-39-master-etcd-1 ~]# curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://qe-yinzhou-39-node-registry-router-1:10250/healthz

Comment 17 errata-xmlrpc 2018-07-30 19:15:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.