Bug 1738568

Summary: Rotating node serving CSR did not get auto-approved by operator.
Product: OpenShift Container Platform Reporter: Muhammad Aizuddin Zali <mzali>
Component: Cloud ComputeAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: agarcial, aos-bugs, brad.ison, clasohm, maszulik, mfojtik, mzali, nagrawal, rsandu, rsawhill
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-27 10:38:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Muhammad Aizuddin Zali 2019-08-07 13:34:51 UTC
Description of problem:

After fresh installation of 4.1.8. I left the cluster for 25 hours and check for node CSR approval, but seems like the CSR did not get auto-approved by controller.



Version-Release number of selected component (if applicable):
4.1.8 UPI on Libvirt baremetal.

How reproducible:
Install 4.1.8, leave it for 25 hours and check CSR and node journalctl.

CSR:
I have >100 Pending CSR.

Journalctl:

[root@worker02 ~]# journalctl  | grep expire
Aug 06 13:10:43 localhost.localdomain NetworkManager[873]: <info>  [1565097043.0931] dhcp4 (enp1s0):   expires in 582386604 seconds
Aug 06 13:15:53 worker02 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Aug 07 07:00:11 localhost.localdomain NetworkManager[857]: <info>  [1565161211.8523] dhcp4 (enp1s0):   expires in 582322436 seconds
Aug 07 13:00:20 worker02 hyperkube[870]: I0807 13:00:20.154747     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:21 worker02 hyperkube[870]: I0807 13:00:21.850513     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:22 worker02 hyperkube[870]: I0807 13:00:22.418064     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:22 worker02 hyperkube[870]: I0807 13:00:22.816861     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:23 worker02 hyperkube[870]: I0807 13:00:23.926010     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:25 worker02 hyperkube[870]: I0807 13:00:25.077777     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:00:38 worker02 hyperkube[870]: I0807 13:00:38.196123     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:04 worker02 hyperkube[870]: I0807 13:02:04.828882     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:09 worker02 hyperkube[870]: I0807 13:02:09.321297     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:10 worker02 hyperkube[870]: I0807 13:02:10.468860     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:11 worker02 hyperkube[870]: I0807 13:02:11.669977     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:12 worker02 hyperkube[870]: I0807 13:02:12.116674     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:12 worker02 hyperkube[870]: I0807 13:02:12.299825     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:12 worker02 hyperkube[870]: I0807 13:02:12.449696     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:12 worker02 hyperkube[870]: I0807 13:02:12.614234     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:02:12 worker02 hyperkube[870]: I0807 13:02:12.768592     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:26 worker02 hyperkube[870]: I0807 13:03:26.020208     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:26 worker02 hyperkube[870]: I0807 13:03:26.035054     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:42 worker02 hyperkube[870]: I0807 13:03:42.506574     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:42 worker02 hyperkube[870]: I0807 13:03:42.521517     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:56 worker02 hyperkube[870]: I0807 13:03:56.020177     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:03:56 worker02 hyperkube[870]: I0807 13:03:56.035186     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:04:12 worker02 hyperkube[870]: I0807 13:04:12.506578     870 certificate_manager.go:213] Current certificate is expired.
Aug 07 13:04:12 worker02 hyperkube[870]: I0807 13:04:12.521522     870 certificate_manager.go:213] Current certificate is expired.



Steps to Reproduce:
1.
2.
3.

Actual results:
Node CSR didnt get auto-approved, hence causing TLS error due to expired bootstrapping certificate after 24 hours.


Expected results:
First rotating CSR should be auto-approved by controller and customer should not need to approved it manually.

Additional info:
'oc adm must-gather' output uploaded to dropbox.redhat.com. /incoming/must-gather.local.7787674465236119942.tar.gz


[root@worker02 ~]# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac44808ab4dd33b4a01f20102e2ab6af3fc649ef78c91a3bd8bd1e94e8bf072a
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190724.0 (2019-07-24T20:02:52Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190520.0 (2019-05-20T22:55:04Z)

Comment 1 Muhammad Aizuddin Zali 2019-08-07 13:40:02 UTC
Referring to this URL[1]. "After you approve the initial CSRs, the subsequent node client CSRs are automatically approved by the cluster kube-controller-manager. " 

Not sure if this statement only applies when adding RHEL compute node to the cluster, but I also unable to look for information that we required CSR to be manually approved for first rotation. ( Or I might missed/overlooked this from our docs. )





[1]: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/machine_management/adding-rhel-compute#installation-approve-csrs_adding-rhel-compute.

Comment 2 Muhammad Aizuddin Zali 2019-08-08 06:48:52 UTC
After going trough our documentation[1] again, re-read these lines, seem I might be confused between kube client certificate that auto-approved by controller and node serving certificate by machine-approver. However for better experience shouldn't this auto approve since the node already part of the cluster?


"3.1.2.4. Certificate signing requests management
Because your cluster has limited access to automatic machine management when you use infrastructure that you provision, you must provide a mechanism for approving cluster certificate signing requests (CSRs) after installation. The kube-controller-manager only approves the kubelet client CSRs. The machine-approver cannot guarantee the validity of a serving certificate that is requested by using kubelet credentials because it cannot confirm that the correct machine issued the request. You must determine and implement a method of verifying the validity of the kubelet serving certificate requests and approving them."



[1]:https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installing-bare-metal

Comment 3 Ryan Sawhill 2019-08-08 19:23:29 UTC
> However for better experience shouldn't this auto approve since the node already part of the cluster?

Agreed. This is so dumb.

Comment 4 Muhammad Aizuddin Zali 2019-08-08 19:32:58 UTC
(In reply to Ryan Sawhill from comment #3)
> > However for better experience shouldn't this auto approve since the node already part of the cluster?
> 
> Agreed. This is so dumb.

I believed this is fundamental features and shouldn't be skipped even in MVP. 

As workaround I need to create a cronjob that approve the existing node serving cert rotation request and skipped bootstrap node CSR approval request[1]. 

[1]:https://github.com/aizuddin85/openshift4/tree/master/serving-cert-approver-workaround

Comment 5 Maciej Szulik 2019-08-19 14:00:33 UTC
Can you provide me with the full output from oc adm must-gather from your cluster?

Comment 6 Muhammad Aizuddin Zali 2019-08-19 14:31:19 UTC
(In reply to Maciej Szulik from comment #5)
> Can you provide me with the full output from oc adm must-gather from your
> cluster?
due to size constraint, i already uploaded to our dropbox.

Additional info:
'oc adm must-gather' output uploaded to dropbox.redhat.com. /incoming/must-gather.local.7787674465236119942.tar.gz

Comment 9 Michal Fojtik 2019-08-26 12:45:57 UTC
cloud team owns the auto-approver.

Comment 10 Jan Chaloupka 2019-08-26 12:48:15 UTC
Muhammad Aizuddin Zali, can you attach the must gather tar file (/incoming/must-gather.local.7787674465236119942.tar.gz
) into this issue?

Comment 12 Brad Ison 2019-08-27 10:38:12 UTC
This is a known, documented, limitation on UPI installs. The cluster-machine-approver relies on data from the machine-api to authorized CSRs. When that data is not available, it doesn't preform the authorization. We're exploring ways of handling renewals without the need for the machine-api however.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1737611

*** This bug has been marked as a duplicate of bug 1737611 ***