Bug 1579267

Summary:	Cannot access node api after upgrade from 3.9 to 3.10
Product:	OpenShift Container Platform	Reporter:	Wang Haoran <haowang>
Component:	Cluster Version Operator	Assignee:	Andrew Butcher <abutcher>
Status:	CLOSED ERRATA	QA Contact:	Wang Haoran <haowang>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.10.0	CC:	abutcher, agawand, aos-bugs, bleanhar, dma, ekuric, haowang, jeder, jokerman, mifiedle, mmccomas, vrutkovs, wmeng, wsun, xtian, yinzhou
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	3.10.0	Flags:	sdodson: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-310
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-30 19:15:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Wang Haoran 2018-05-17 09:30:35 UTC

Description of problem:
After upgrade from 3.9 to 3.10, the api that kubelet exposed cannot be access due to the certificate is not correct, this will cause the "oc exe" failed to run commands on the pod, will break the master-api->node-api
Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch

How reproducible:
always
Steps to Reproduce:
1.Install the 3.9 cluster
2.Run the upgrade playbook
3. curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://<node_name>:10250/healthz
curl: (35) Peer reports it experienced an internal error.


Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Scott Dodson 2018-05-17 12:34:35 UTC

Do you have logs from the node?

Seth any ideas?

Comment 5 Scott Dodson 2018-05-17 15:49:34 UTC

I was able to reproduce this on one of three hosts.

[root@ose3-master ~]# curl -v --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://ose3-node2.example.com:10250/healthz
* About to connect() to ose3-node2.example.com port 10250 (#0)
*   Trying 192.168.122.118...
* Connected to ose3-node2.example.com (192.168.122.118) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/origin/master/ca.crt
  CApath: none
* NSS error -12188 (SSL_ERROR_INTERNAL_ERROR_ALERT)
* Peer reports it experienced an internal error.
* Closing connection 0
curl: (35) Peer reports it experienced an internal error.

On node 2 --

May 17 11:17:06 ose3-node2.example.com atomic-openshift-node[1783]: I0517 11:17:06.845254    1783 logs.go:49] http: TLS handshake error from 192.168.122.52:57126: no serving certificate available for the kubelet

What I find is that there's no /etc/origin/node/certificates/kubelet-server-current.pem. I have pending CSRs, many of them for this node. I deleted all pending CSRs, a new one was created, I approved that and functionality was restored. Now to figure out if this was something that went wrong over time or if it was that way ever since my 3.9 to 3.10 upgrade.

Comment 6 Scott Dodson 2018-05-18 02:06:32 UTC

Tried to reproduce this via an upgrade, now one of my masters generates a CSR but it wasn't approved. When I manually approve the cert it never gets issued.

Comment 7 Seth Jennings 2018-05-18 22:13:50 UTC

Probably rehashing was is already know but in case it is not.

The CSR requests are approved by this:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml

If the CSR are being approved but not issued, that would be a certificates controller issue:
https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/certificates

Comment 8 Scott Dodson 2018-05-22 12:31:26 UTC

This is happening because we never wait for the kubelet server CSR to come through. Likely same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1571515

Solution is to loop on CSR approval until we see both a client and server CSR for each host we care about. This may take potentially 5 minutes or more.

Comment 10 Scott Dodson 2018-05-30 14:03:37 UTC

*** Bug 1571515 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2018-05-31 12:46:28 UTC

https://github.com/openshift/openshift-ansible/pull/8578 should fix this

Comment 12 Wei Sun 2018-06-01 08:49:37 UTC

The PR has been merged to 3.10.0-0.56.

Comment 13 zhou ying 2018-06-04 05:18:30 UTC

Confirmed with latest OCP, the issue has fixed:
openshift v3.10.0-0.58.0


Upgrade from ocp3.9.

[root@qe-yinzhou-39-master-etcd-1 ~]# curl --key /etc/origin/master/master.kubelet-client.key --cert /etc/origin/master/master.kubelet-client.crt --cacert /etc/origin/master/ca.crt https://qe-yinzhou-39-node-registry-router-1:10250/healthz
ok

Comment 17 errata-xmlrpc 2018-07-30 19:15:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816