Bug 1835494

Summary: [OVN-Kubernetes SCALE] sb-db readiness probe fails with higher raft election-timer value
Product: OpenShift Container Platform Reporter: Anil Vishnoi <avishnoi>
Component: NetworkingAssignee: Anil Vishnoi <avishnoi>
Networking sub component: ovn-kubernetes QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, anusaxen, jtaleric, mifiedle, mkarg
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:59:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anil Vishnoi 2020-05-13 22:47:48 UTC
Description of problem:

At scale (500+ worker nodes), to avoid Southbound DB Raft cluster partition we need to increase the raft election-timer value (in the range of 2 second to 60 seconds). During the leadership change, when high number of nodes connect to next raft leader it can get busy for long duration ( in seconds < election-timer) and during that busy time if readiness probe is fired,  ovs-appctl might timeout (or crash -- i have seen in one of the test) and it will mark the container Ready:False, which can lead to CNO restarting the pod followed by another leader election.

For example, in one of my scale test, i was scaling from 400 nodes to 500 nodes, with the election-timer set to 36 seconds. During the scale, one of the ovsdb-server instance was busy for 10+ seconds and the readiness probe executed during that time and failed with

```
Type     Reason     Age                     From                                               Message
  ----     ------     ----                    ----                                               -------
  Warning  Unhealthy  3m5s (x280 over 6h38m)  kubelet, ip-10-0-174-7.us-west-2.compute.internal  Readiness probe failed: command timed out
```



Version-Release number of selected component (if applicable):


How reproducible:
It can be easily reproducible at higher scale (300+ worker nodes and election-timer > 20 seconds)

Steps to Reproduce:
1. Deploy openshift cluster.
2. Set the raft election-timer for sb-db cluster to 20 seconds.
3. Scale the worker nodes to 300+ nodes.
4. Monitor the master pods, and you will see readiness probe failing for sb-db container. 
(Note: in my environment i obserbed it at 36 seconds election-timer and 500+ nodes, but i believe it should be reproducible at election-timer=20 and 300+ nodes.
Actual results:


Expected results:
Readiness probe should be bit less aggresive and more adaptive to avoid the false positive probe failures.

Additional info:

Comment 3 Anil Vishnoi 2020-05-29 16:44:57 UTC
Following PR is under review for the bug:

https://github.com/openshift/cluster-network-operator/pull/652

Comment 4 Anil Vishnoi 2020-06-19 06:51:27 UTC
Related PR is merged.

Comment 7 Mike Fiedler 2020-10-09 13:19:18 UTC
Verified on release:4.6.0-0.nightly-2020-10-03-051134.   500 node AWS cluster with m5.4xlarge masters and m5.large workers.   Scaled up 20 nodes at a time to 500 and cluster was stable.  No failed readiness probes seen in this cluster.

Comment 9 errata-xmlrpc 2020-10-27 15:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196