Bug 1717602

Summary: incoming machines make csrs indefinitely if original is not approved
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED CURRENTRELEASE QA Contact: Jianwei Hou <jhou>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.1.0CC: agarcial, bbreard, brad.ison, brad.williams, dustymabe, imcleod, jligon, nstielau
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1746513 (view as bug list) Environment:
Last Closed: 2019-08-28 16:18:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1746513    

Description Justin Pierce 2019-06-05 19:12:34 UTC
Description of problem:
If machinesets are altered to scale up the number of nodes on a cluster, the resulting aws instances create a CSR and expect it to be approved/issued. If that CSR is not approved in a timely manner, it appears that additional CSR are created. 

Version-Release number of selected component (if applicable):
v4.1.0

How reproducible:
100%

Steps to Reproduce:
1. Disable bootstrap csr approval
2. Scale up machinesets on the cluster

Actual results:
Observe that new csrs are created every ~10 minutes for each machine.

Expected results:
No new csrs should be created. As I understand it, the autoapprover is not going to approve the new csr (since it requires close correlation in CSR and machine creation timestamps). The behavior can cause a unlimited number of CSRs to be created if the autoapprover is shutdown for any reason. 

Additional info:
In one scale-up scenario, I requested > 50 new nodes. Rapidly, 100 CSRs were created and this caused the autoapprover to believe it was being fuzzed, so it stopped approving any CSRs. This led to ~50 nodes creating CSRs periodically. Thousands were created before this condition was detected.

Comment 4 Jan Chaloupka 2019-07-03 12:22:31 UTC
Either node has to be modified to generate new CSRs properly or the CRS approved needs to properly react (e.g. by deleting new requests).

Assigning to RHCOS team to decide if this is node issue or not.

Comment 5 Steve Milner 2019-07-03 16:12:46 UTC
I don't think that RHCOS is the cause here. I could be wrong, but I believe it's much more likely to be within the code generating and/or approving CSRs. The openshift installer does create a unit called `approve-csr.service` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/systemd/units/approve-csr.service) which executes `/usr/local/bin/approve-csr.sh` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/approve-csr.sh).

Moving to the installer team for further review.

Comment 6 Abhinav Dahiya 2019-07-03 19:53:55 UTC
cluster-machine-approver as i've been told under cloud team... so moving..

Comment 7 Alberto 2019-07-05 07:38:13 UTC
Justin Pierce, can we get an answer for "did this happen because they have Disabled bootstrap csr approval?" and machine approval logs? Thanks!

Comment 8 Justin Pierce 2019-07-05 14:55:58 UTC
This did not happen because the csr approver was disabled. I offered that as a method to reproduce the issue easily. The reproduce normally requires hitting an AWS limit or scaling machines extremely rapidly.

Comment 10 Michael Gugino 2019-08-27 19:26:37 UTC
There are a couple of issues at play here.

I believe that this bug as originally filed was due to the time-restriction element of CSR approval as evidenced by comment #8.  If a machine-object is created, but the machine-controller fails to provision an actual VM due to api quota or other temporary condition, the CSRs will never be approved.  This has been (partially) addressed here by moving the time limit to 2 hours instead of 10 minutes: https://github.com/openshift/cluster-machine-approver/pull/37

Long term, we should try to capture network-address creation time.  This allows for a long window between machine creation and CSR approval, but a very tight window between actual instance provisioning and CSR approval, and would support a variety of use-cases.

The other issue is too many machines being scaled at once, a potential fix was referenced here: https://github.com/openshift/cluster-machine-approver/pull/33

Unfortunately, we didn't come to a consensus as to how best fix the issue.

In any case, I don't believe this issue is a release blocker as work-arounds exist and this is not particular to 4.2.

Comment 11 Brad Ison 2019-08-28 16:11:16 UTC
We think this is fixed in the 4.2 release. I've cloned it for the 4.1.z stream:
https://bugzilla.redhat.com/show_bug.cgi?id=1746513