Description of problem: If machinesets are altered to scale up the number of nodes on a cluster, the resulting aws instances create a CSR and expect it to be approved/issued. If that CSR is not approved in a timely manner, it appears that additional CSR are created. Version-Release number of selected component (if applicable): v4.1.0 How reproducible: 100% Steps to Reproduce: 1. Disable bootstrap csr approval 2. Scale up machinesets on the cluster Actual results: Observe that new csrs are created every ~10 minutes for each machine. Expected results: No new csrs should be created. As I understand it, the autoapprover is not going to approve the new csr (since it requires close correlation in CSR and machine creation timestamps). The behavior can cause a unlimited number of CSRs to be created if the autoapprover is shutdown for any reason. Additional info: In one scale-up scenario, I requested > 50 new nodes. Rapidly, 100 CSRs were created and this caused the autoapprover to believe it was being fuzzed, so it stopped approving any CSRs. This led to ~50 nodes creating CSRs periodically. Thousands were created before this condition was detected.
Either node has to be modified to generate new CSRs properly or the CRS approved needs to properly react (e.g. by deleting new requests). Assigning to RHCOS team to decide if this is node issue or not.
I don't think that RHCOS is the cause here. I could be wrong, but I believe it's much more likely to be within the code generating and/or approving CSRs. The openshift installer does create a unit called `approve-csr.service` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/systemd/units/approve-csr.service) which executes `/usr/local/bin/approve-csr.sh` (https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/approve-csr.sh). Moving to the installer team for further review.
cluster-machine-approver as i've been told under cloud team... so moving..
Justin Pierce, can we get an answer for "did this happen because they have Disabled bootstrap csr approval?" and machine approval logs? Thanks!
This did not happen because the csr approver was disabled. I offered that as a method to reproduce the issue easily. The reproduce normally requires hitting an AWS limit or scaling machines extremely rapidly.
https://github.com/openshift/cluster-machine-approver/pull/33
There are a couple of issues at play here. I believe that this bug as originally filed was due to the time-restriction element of CSR approval as evidenced by comment #8. If a machine-object is created, but the machine-controller fails to provision an actual VM due to api quota or other temporary condition, the CSRs will never be approved. This has been (partially) addressed here by moving the time limit to 2 hours instead of 10 minutes: https://github.com/openshift/cluster-machine-approver/pull/37 Long term, we should try to capture network-address creation time. This allows for a long window between machine creation and CSR approval, but a very tight window between actual instance provisioning and CSR approval, and would support a variety of use-cases. The other issue is too many machines being scaled at once, a potential fix was referenced here: https://github.com/openshift/cluster-machine-approver/pull/33 Unfortunately, we didn't come to a consensus as to how best fix the issue. In any case, I don't believe this issue is a release blocker as work-arounds exist and this is not particular to 4.2.
We think this is fixed in the 4.2 release. I've cloned it for the 4.1.z stream: https://bugzilla.redhat.com/show_bug.cgi?id=1746513