Bug 1717610 - autoapproval tolerances too strict for some large scale ups
Summary: autoapproval tolerances too strict for some large scale ups
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Michael Gugino
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-05 19:44 UTC by Justin Pierce
Modified: 2019-10-16 06:31 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:31:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:31:26 UTC

Description Justin Pierce 2019-06-05 19:44:58 UTC
Description of problem:
During large cluster scale-up operations, it is possible for >100 CSR to be created by incoming machines. If this occurs, the autoapprover decides it is being fuzzed in some way and refuses to approve any CSR thereafter. 

There is also no alert in alertmanger indicating the approval thresholds have been violated.

Version-Release number of selected component (if applicable):
v4.1.0

How reproducible:
approver stops approving by design

Steps to Reproduce:
1. Create > 100 CSRs
2.
3.

Actual results:
The autoapprover pod log indicates that approval has been stopped.

Expected results:
- The tolerances for autoapproval should be increased to allow for large clusters to be created. A reasonable upper bound might be ~250 nodes. 

- Cloud providers may require time to satisfy instance requests due to availability issues. For example, if an aws region is out of m4.2xlarge instances, it may take hours for a machine-api machine to provision a corresponding instance and make original CSR. Since the approver checks the machine creation timestamp vs the CSR creation timestamp to make sure there is tight correlation, the CSR will never be approved. Ideally, the the approver should check whether the CSR timestamp is tightly correlated with the MachineCreated condition of the machine api.

- If the approver shuts itself down, an Prometheus alert should be activated to alert operations to the decision.

Additional info:
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1717602
As described ^^ , in one scale-up scenario, the existing tolerances shutdown the autoapprover and behavior of BZ #1717602 caused thousands of Pending CSRs to be created before the issue was detected and manually resolved.

Comment 4 Stefan Schimanski 2019-07-03 08:22:16 UTC
Do we have logs of the approver?

Comment 6 Alberto 2019-07-05 07:38:37 UTC
Justin Pierce, can we get an answer for "did this happen because they have Disabled bootstrap csr approval?" and machine approval logs? Thanks!

Comment 7 Justin Pierce 2019-07-05 14:51:39 UTC
@Alberto - csr approval was not disabled at any time. Disabling csr approval in https://bugzilla.redhat.com/show_bug.cgi?id=1717602 was only mentioned as a means to reproduce that specific problem easily. The behaviors described in this BZ are by design in the approval code itself and reproducible. Those behaviors are inconsistent with [1] rapid scale ups since they can create > 100 pending CSRs   and [2] slow scale ups where a cloud provider may not be able to supply an instance in a timely manner since the machine timestamp and the CSR timestamp will be significantly different. Both issues can be identified in this log snippet:

I0705 14:37:01.863893       1 main.go:164] Error syncing csr csr-4j59g: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC)
I0705 14:37:01.944105       1 main.go:107] CSR csr-4j59g added
I0705 14:37:01.985857       1 main.go:132] CSR csr-4j59g not authorized: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC)
E0705 14:37:01.985885       1 main.go:174] CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC)
I0705 14:37:01.985894       1 main.go:175] Dropping CSR "csr-4j59g" out of the queue: CSR csr-4j59g creation time 2019-07-05 14:37:01 +0000 UTC not in range (2019-07-05 14:15:07 +0000 UTC, 2019-07-05 14:25:17 +0000 UTC)
I0705 14:37:08.135062       1 main.go:107] CSR csr-qmnck added
I0705 14:37:08.135138       1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 101
I0705 14:37:11.790037       1 main.go:107] CSR csr-tcxpt added
I0705 14:37:11.790112       1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 102
I0705 14:37:52.393836       1 main.go:107] CSR csr-gk2ql added
I0705 14:37:52.393911       1 main.go:115] ignoring all CSRs as too many recent pending CSRs seen: 103
I0705 14:37:59.393294       1 main.go:107] CSR csr-z9qgx added

Comment 8 Justin Pierce 2019-07-05 14:52:53 UTC
Created attachment 1587735 [details]
approver logs

Comment 11 Alberto 2019-09-04 08:41:03 UTC
reopened PR in https://github.com/openshift/cluster-machine-approver/pull/43

Comment 14 Jianwei Hou 2019-09-09 03:56:57 UTC
Verified in 4.2.0-0.nightly-2019-09-08-180038

Scaled over 100 nodes, did not see the problem again.

Comment 16 errata-xmlrpc 2019-10-16 06:31:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.