Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1622099

Summary: Client certificates expire in App and Infra nodes and are not rotated
Product: OpenShift Container Platform Reporter: Jon Uriarte <juriarte>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED NOTABUG QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jokerman, mmccomas, sdodson, sjenning
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-29 16:14:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs and inventory file none

Description Jon Uriarte 2018-08-24 12:54:09 UTC
Created attachment 1478526 [details]
Logs and inventory file

Description of problem:

In a successful OCP on OpenStack deployment, and after some hours, the client certificates in App and Infra nodes do expire and are not rotated.
This causes the atomic-openshift-node service in those nodes to be restarted, but cannot start successfully and end up in a service restart loop.
App and Infra nodes remain in NotReady status when this happens.

Version-Release number of the following components:

rpm -q openshift-ansible
openshift-ansible-3.10.34-1.git.0.48df172None.noarch

rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch

ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible: always after some hours, when certificates are not rotated

Steps to Reproduce:
1. Install OpenStack
2. Install OpenShift via OpenStack playbooks (kuryr enabled)
   ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/install.yml
   Deployed 1 master, 1 infra and 2 app nodes 

Actual results:
The ansible playbook ends successfully, with no error.
The nodes are deployed correctly (in Ready status) and certificate rotation seems to be working.
But after some time, the nodes go to NotReady status:
 $ oc get nodes
 NAME                                 STATUS     ROLES     AGE       VERSION
 app-node-0.openshift.example.com     NotReady   compute   1d        v1.10.0+b81c8f8
 app-node-1.openshift.example.com     NotReady   compute   1d        v1.10.0+b81c8f8
 infra-node-0.openshift.example.com   NotReady   infra     1d        v1.10.0+b81c8f8
 master-0.openshift.example.com       Ready      master    1d        v1.10.0+b81c8f8

At this point, atomic-openshift-node service is stopped (see app0-logs.txt) due to client certificate expiration,
and started afterwards. The service cannot start successfully (see app0-logs2.txt), so it enters in a service
restart loop.

It seems to be related to certificate rotation, which apparently stops working so no new certificates
are created after the last one expires.

[root@app-node-0 ~]# ls -ltr /etc/origin/node/certificates/
total 96
-rw-------. 1 root root 1171 ago 22 10:29 kubelet-client-2018-08-22-10-29-27.pem
-rw-------. 1 root root 1293 ago 22 10:29 kubelet-server-2018-08-22-10-29-31.pem
-rw-------. 1 root root 1171 ago 22 10:38 kubelet-client-2018-08-22-10-38-55.pem
-rw-------. 1 root root 1293 ago 22 10:40 kubelet-server-2018-08-22-10-40-09.pem
-rw-------. 1 root root 1293 ago 22 10:50 kubelet-server-2018-08-22-10-50-24.pem
-rw-------. 1 root root 1171 ago 22 10:51 kubelet-client-2018-08-22-10-51-38.pem
-rw-------. 1 root root 1293 ago 22 11:01 kubelet-server-2018-08-22-11-01-20.pem
-rw-------. 1 root root 1171 ago 22 11:04 kubelet-client-2018-08-22-11-04-22.pem
-rw-------. 1 root root 1293 ago 22 11:11 kubelet-server-2018-08-22-11-11-12.pem
-rw-------. 1 root root 1171 ago 22 11:14 kubelet-client-2018-08-22-11-14-30.pem
-rw-------. 1 root root 1293 ago 22 11:23 kubelet-server-2018-08-22-11-23-29.pem
-rw-------. 1 root root 1171 ago 22 11:26 kubelet-client-2018-08-22-11-26-58.pem
-rw-------. 1 root root 1293 ago 22 11:33 kubelet-server-2018-08-22-11-33-35.pem
-rw-------. 1 root root 1171 ago 22 11:36 kubelet-client-2018-08-22-11-36-35.pem
-rw-------. 1 root root 1293 ago 22 11:45 kubelet-server-2018-08-22-11-45-09.pem
-rw-------. 1 root root 1171 ago 22 11:46 kubelet-client-2018-08-22-11-46-30.pem
-rw-------. 1 root root 1293 ago 22 11:56 kubelet-server-2018-08-22-11-56-55.pem
-rw-------. 1 root root 1171 ago 22 11:57 kubelet-client-2018-08-22-11-57-19.pem
-rw-------. 1 root root 1293 ago 22 12:08 kubelet-server-2018-08-22-12-08-52.pem
-rw-------. 1 root root 1171 ago 22 12:09 kubelet-client-2018-08-22-12-09-22.pem
-rw-------. 1 root root 1293 ago 22 12:18 kubelet-server-2018-08-22-12-18-23.pem
-rw-------. 1 root root 1171 ago 22 12:19 kubelet-client-2018-08-22-12-19-58.pem
lrwxrwxrwx. 1 root root   68 ago 22 12:29 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2018-08-22-12-29-40.pem
-rw-------. 1 root root 1171 ago 22 12:29 kubelet-client-2018-08-22-12-29-40.pem
lrwxrwxrwx. 1 root root   68 ago 22 12:30 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2018-08-22-12-30-22.pem
-rw-------. 1 root root 1293 ago 22 12:30 kubelet-server-2018-08-22-12-30-22.pem

When the last valid certificate (kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2018-08-22-12-29-40.pem) expires:
            Not Before: Aug 22 16:25:00 2018 GMT
            Not After : Aug 22 16:45:00 2018 GMT

the atomic-openshift-node service restart loop starts. Note that certificate validity is 20 minutes.

This happens to the App and Infra nodes in the same way.

All the csr-s are in Pending status (see master-csr.txt).

Find the inventory in the attached file (see OSEv3.yml).


Expected results:
All the nodes in Ready status, and client certificates rotated if necessary.

Comment 1 Scott Dodson 2018-08-27 13:24:25 UTC
This appears to be a problem that's occurring after certificate approval and deployment. Moving to Pod team to triage further.