1717602 – incoming machines make csrs indefinitely if original is not approved

Bug 1717602 - incoming machines make csrs indefinitely if original is not approved

Summary: incoming machines make csrs indefinitely if original is not approved

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Michael Gugino
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1746513
TreeView+	depends on / blocked

Reported:	2019-06-05 19:12 UTC by Justin Pierce
Modified:	2019-08-28 16:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1746513 (view as bug list)
Environment:
Last Closed:	2019-08-28 16:18:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Justin Pierce 2019-06-05 19:12:34 UTC

Description of problem:
If machinesets are altered to scale up the number of nodes on a cluster, the resulting aws instances create a CSR and expect it to be approved/issued. If that CSR is not approved in a timely manner, it appears that additional CSR are created. 

Version-Release number of selected component (if applicable):
v4.1.0

How reproducible:
100%

Steps to Reproduce:
1. Disable bootstrap csr approval
2. Scale up machinesets on the cluster

Actual results:
Observe that new csrs are created every ~10 minutes for each machine.

Expected results:
No new csrs should be created. As I understand it, the autoapprover is not going to approve the new csr (since it requires close correlation in CSR and machine creation timestamps). The behavior can cause a unlimited number of CSRs to be created if the autoapprover is shutdown for any reason. 

Additional info:
In one scale-up scenario, I requested > 50 new nodes. Rapidly, 100 CSRs were created and this caused the autoapprover to believe it was being fuzzed, so it stopped approving any CSRs. This led to ~50 nodes creating CSRs periodically. Thousands were created before this condition was detected.

Comment 4 Jan Chaloupka 2019-07-03 12:22:31 UTC

Either node has to be modified to generate new CSRs properly or the CRS approved needs to properly react (e.g. by deleting new requests).

Assigning to RHCOS team to decide if this is node issue or not.

Comment 5 Steve Milner 2019-07-03 16:12:46 UTC

I don't think that RHCOS is the cause here. I could be wrong, but I believe it's much more likely to be within the code generating and/or approving CSRs. The openshift installer does create a unit called `approve-csr.service` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/systemd/units/approve-csr.service) which executes `/usr/local/bin/approve-csr.sh` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/approve-csr.sh).

Moving to the installer team for further review.

Comment 6 Abhinav Dahiya 2019-07-03 19:53:55 UTC

cluster-machine-approver as i've been told under cloud team... so moving..

Comment 7 Alberto 2019-07-05 07:38:13 UTC

Justin Pierce, can we get an answer for "did this happen because they have Disabled bootstrap csr approval?" and machine approval logs? Thanks!

Comment 8 Justin Pierce 2019-07-05 14:55:58 UTC

This did not happen because the csr approver was disabled. I offered that as a method to reproduce the issue easily. The reproduce normally requires hitting an AWS limit or scaling machines extremely rapidly.

Comment 9 Alberto 2019-07-26 13:23:09 UTC

https://github.com/openshift/cluster-machine-approver/pull/33

Comment 10 Michael Gugino 2019-08-27 19:26:37 UTC

There are a couple of issues at play here.

I believe that this bug as originally filed was due to the time-restriction element of CSR approval as evidenced by comment #8.  If a machine-object is created, but the machine-controller fails to provision an actual VM due to api quota or other temporary condition, the CSRs will never be approved.  This has been (partially) addressed here by moving the time limit to 2 hours instead of 10 minutes: https://github.com/openshift/cluster-machine-approver/pull/37

Long term, we should try to capture network-address creation time.  This allows for a long window between machine creation and CSR approval, but a very tight window between actual instance provisioning and CSR approval, and would support a variety of use-cases.

The other issue is too many machines being scaled at once, a potential fix was referenced here: https://github.com/openshift/cluster-machine-approver/pull/33

Unfortunately, we didn't come to a consensus as to how best fix the issue.

In any case, I don't believe this issue is a release blocker as work-arounds exist and this is not particular to 4.2.

Comment 11 Brad Ison 2019-08-28 16:11:16 UTC

We think this is fixed in the 4.2 release. I've cloned it for the 4.1.z stream:
https://bugzilla.redhat.com/show_bug.cgi?id=1746513

Note You need to log in before you can comment on or make changes to this bug.