Description of problem: During large cluster scale-up operations, it is possible for >100 CSR to be created by incoming machines. If this occurs, the autoapprover decides it is being fuzzed in some way and refuses to approve any CSR thereafter. There is also no alert in alertmanger indicating the approval thresholds have been violated. Version-Release number of selected component (if applicable): v4.1.0 How reproducible: approver stops approving by design Steps to Reproduce: 1. Create > 100 CSRs 2. 3. Actual results: The autoapprover pod log indicates that approval has been stopped. Expected results: - The tolerances for autoapproval should be increased to allow for large clusters to be created. A reasonable upper bound might be ~250 nodes. - Cloud providers may require time to satisfy instance requests due to availability issues. For example, if an aws region is out of m4.2xlarge instances, it may take hours for a machine-api machine to provision a corresponding instance and make original CSR. Since the approver checks the machine creation timestamp vs the CSR creation timestamp to make sure there is tight correlation, the CSR will never be approved. Ideally, the the approver should check whether the CSR timestamp is tightly correlated with the MachineCreated condition of the machine api. - If the approver shuts itself down, an Prometheus alert should be activated to alert operations to the decision. Additional info: Related: https://bugzilla.redhat.com/show_bug.cgi?id=1717602 As described ^^ , in one scale-up scenario, the existing tolerances shutdown the autoapprover and behavior of BZ #1717602 caused thousands of Pending CSRs to be created before the issue was detected and manually resolved.
Do we have logs of the approver?
Justin Pierce, can we get an answer for "did this happen because they have Disabled bootstrap csr approval?" and machine approval logs? Thanks!
@Alberto - csr approval was not disabled at any time. Disabling csr approval in https://bugzilla.redhat.com/show_bug.cgi?id=1717602 was only mentioned as a means to reproduce that specific problem easily. The behaviors described in this BZ are by design in the approval code itself and reproducible. Those behaviors are inconsistent with [1] rapid scale ups since they can create > 100 pending CSRs and [2] slow scale ups where a cloud provider may not be able to supply an instance in a timely manner since the machine timestamp and the CSR timestamp will be significantly different. Both issues can be identified in this log snippet: I0705 14:37:01.863893 1 main.go:164] Error syncing csr csr-4j59g: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC) I0705 14:37:01.944105 1 main.go:107] CSR csr-4j59g added I0705 14:37:01.985857 1 main.go:132] CSR csr-4j59g not authorized: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC) E0705 14:37:01.985885 1 main.go:174] CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC) I0705 14:37:01.985894 1 main.go:175] Dropping CSR "csr-4j59g" out of the queue: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC) I0705 14:37:08.135062 1 main.go:107] CSR csr-qmnck added I0705 14:37:08.135138 1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 101 I0705 14:37:11.790037 1 main.go:107] CSR csr-tcxpt added I0705 14:37:11.790112 1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 102 I0705 14:37:52.393836 1 main.go:107] CSR csr-gk2ql added I0705 14:37:52.393911 1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 103 I0705 14:37:59.393294 1 main.go:107] CSR csr-z9qgx added
Created attachment 1587735 [details] approver logs
https://github.com/openshift/cluster-machine-approver/pull/33
reopened PR in https://github.com/openshift/cluster-machine-approver/pull/43
Verified in 4.2.0-0.nightly-2019-09-08-180038 Scaled over 100 nodes, did not see the problem again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922