Description of problem:
At scale (500+ worker nodes), to avoid Southbound DB Raft cluster partition we need to increase the raft election-timer value (in the range of 2 second to 60 seconds). During the leadership change, when high number of nodes connect to next raft leader it can get busy for long duration ( in seconds < election-timer) and during that busy time if readiness probe is fired, ovs-appctl might timeout (or crash -- i have seen in one of the test) and it will mark the container Ready:False, which can lead to CNO restarting the pod followed by another leader election.
For example, in one of my scale test, i was scaling from 400 nodes to 500 nodes, with the election-timer set to 36 seconds. During the scale, one of the ovsdb-server instance was busy for 10+ seconds and the readiness probe executed during that time and failed with
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m5s (x280 over 6h38m) kubelet, ip-10-0-174-7.us-west-2.compute.internal Readiness probe failed: command timed out
Version-Release number of selected component (if applicable):
It can be easily reproducible at higher scale (300+ worker nodes and election-timer > 20 seconds)
Steps to Reproduce:
1. Deploy openshift cluster.
2. Set the raft election-timer for sb-db cluster to 20 seconds.
3. Scale the worker nodes to 300+ nodes.
4. Monitor the master pods, and you will see readiness probe failing for sb-db container.
(Note: in my environment i obserbed it at 36 seconds election-timer and 500+ nodes, but i believe it should be reproducible at election-timer=20 and 300+ nodes.
Readiness probe should be bit less aggresive and more adaptive to avoid the false positive probe failures.
Following PR is under review for the bug:
Related PR is merged.
Verified on release:4.6.0-0.nightly-2020-10-03-051134. 500 node AWS cluster with m5.4xlarge masters and m5.large workers. Scaled up 20 nodes at a time to 500 and cluster was stable. No failed readiness probes seen in this cluster.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.