Bug 1835494 - [OVN-Kubernetes SCALE] sb-db readiness probe fails with higher raft election-timer value
Summary: [OVN-Kubernetes SCALE] sb-db readiness probe fails with higher raft election-...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: All
OS: All
Target Milestone: ---
: 4.6.0
Assignee: Anil Vishnoi
QA Contact: Mike Fiedler
Depends On:
TreeView+ depends on / blocked
Reported: 2020-05-13 22:47 UTC by Anil Vishnoi
Modified: 2020-10-27 15:59 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-10-27 15:59:18 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:59:36 UTC

Description Anil Vishnoi 2020-05-13 22:47:48 UTC
Description of problem:

At scale (500+ worker nodes), to avoid Southbound DB Raft cluster partition we need to increase the raft election-timer value (in the range of 2 second to 60 seconds). During the leadership change, when high number of nodes connect to next raft leader it can get busy for long duration ( in seconds < election-timer) and during that busy time if readiness probe is fired,  ovs-appctl might timeout (or crash -- i have seen in one of the test) and it will mark the container Ready:False, which can lead to CNO restarting the pod followed by another leader election.

For example, in one of my scale test, i was scaling from 400 nodes to 500 nodes, with the election-timer set to 36 seconds. During the scale, one of the ovsdb-server instance was busy for 10+ seconds and the readiness probe executed during that time and failed with

Type     Reason     Age                     From                                               Message
  ----     ------     ----                    ----                                               -------
  Warning  Unhealthy  3m5s (x280 over 6h38m)  kubelet, ip-10-0-174-7.us-west-2.compute.internal  Readiness probe failed: command timed out

Version-Release number of selected component (if applicable):

How reproducible:
It can be easily reproducible at higher scale (300+ worker nodes and election-timer > 20 seconds)

Steps to Reproduce:
1. Deploy openshift cluster.
2. Set the raft election-timer for sb-db cluster to 20 seconds.
3. Scale the worker nodes to 300+ nodes.
4. Monitor the master pods, and you will see readiness probe failing for sb-db container. 
(Note: in my environment i obserbed it at 36 seconds election-timer and 500+ nodes, but i believe it should be reproducible at election-timer=20 and 300+ nodes.
Actual results:

Expected results:
Readiness probe should be bit less aggresive and more adaptive to avoid the false positive probe failures.

Additional info:

Comment 3 Anil Vishnoi 2020-05-29 16:44:57 UTC
Following PR is under review for the bug:


Comment 4 Anil Vishnoi 2020-06-19 06:51:27 UTC
Related PR is merged.

Comment 7 Mike Fiedler 2020-10-09 13:19:18 UTC
Verified on release:4.6.0-0.nightly-2020-10-03-051134.   500 node AWS cluster with m5.4xlarge masters and m5.large workers.   Scaled up 20 nodes at a time to 500 and cluster was stable.  No failed readiness probes seen in this cluster.

Comment 9 errata-xmlrpc 2020-10-27 15:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.