1838343 – Improve the sb-db and nb-db readiness check to ensure it fails when cluster is not stable.

Bug 1838343 - Improve the sb-db and nb-db readiness check to ensure it fails when cluster is not stable.

Summary: Improve the sb-db and nb-db readiness check to ensure it fails when cluster i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	All
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Anil Vishnoi
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1934652
TreeView+	depends on / blocked

Reported:	2020-05-21 00:43 UTC by Anil Vishnoi
Modified:	2021-03-03 16:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1934652 (view as bug list)
Environment:
Last Closed:	2021-03-03 15:52:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 655	0	None	open	Bug 1838343 - Improve the sb-db and nb-db readiness check to ensure it fails when cluster is not stable.	2021-02-03 13:55:22 UTC

Description Anil Vishnoi 2020-05-21 00:43:56 UTC

Description of problem:
Current sb-db/nb-db readiness check does the following 

nb-db:
exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound  2>/dev/null | grep ${K8S_NODE_IP} | grep -v Address -q

sb-db:
exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound  2>/dev/null | grep ${K8S_NODE_IP} | grep -v Address -q

This does not really check if the cluster really is in ready state or not. For example, in one of the scale cluster test, following were the state of the 3 nodes of sb-db cluster

# oc exec -n openshift-ovn-kubernetes ovnkube-master-fxf9h -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
9aca
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 9aca (9aca6b5f-bd2f-455c-a98e-19e7a0226913)
Address: ssl:10.0.210.251:9644
Status: cluster member
Role: follower
Term: 273
Leader: unknown
Vote: unknown

Election timer: 50000
Log: [7764, 7855]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: ->1bbc ->6a8c <-6a8c <-1bbc
Servers:
    9aca (9aca at ssl:10.0.210.251:9644) (self)
    1bbc (1bbc at ssl:10.0.181.129:9644)
    6a8c (6a8c at ssl:10.0.148.1:9644)
---------
# oc exec -n openshift-ovn-kubernetes ovnkube-master-k9ph2 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
1bbc
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 1bbc (1bbc51cb-4b41-4c27-8407-0ae991b07bbc)
Address: ssl:10.0.181.129:9644
Status: cluster member
Role: candidate
Term: 274
Leader: unknown
Vote: self

Election timer: 32000
Log: [6853, 6853]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->9aca ->6a8c <-9aca <-6a8c
Servers:
    9aca (9aca at ssl:10.0.210.251:9644)
    1bbc (1bbc at ssl:10.0.181.129:9644) (self) (voted for 1bbc)
    6a8c (6a8c at ssl:10.0.148.1:9644)
------
# oc exec -n openshift-ovn-kubernetes ovnkube-master-nfsvs -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
6a8c
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 6a8c (6a8c859f-f54a-4b61-a028-3c86e3f1bd62)
Address: ssl:10.0.148.1:9644
Status: cluster member
Role: follower
Term: 272
Leader: 9aca
Vote: 9aca

Election timer: 50000
Log: [7764, 7855]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: ->9aca ->1bbc <-9aca <-1bbc
Servers:
    9aca (9aca at ssl:10.0.210.251:9644)
    1bbc (1bbc at ssl:10.0.181.129:9644)
    6a8c (6a8c at ssl:10.0.148.1:9644) (self)


So sb-db cluster was totally broken, and for long duration, but readiness check were passing. We need to improve the readiness check to ensure that cluster has a leader at any giving point of time (no Leader: unknown in cluster/status output). Also we need to ensure that we fail the readiness check atleast after two contineous failed probe.

Version-Release number of selected component (if applicable):


How reproducible:
Easily at scale of more than 300 nodes.

Steps to Reproduce:
1. Create cluster with election-timer set to 10 seconds
2. Scale the cluster (number of nodes) to 300 nodes in the increament of 100
3. Keep checking the statue of the sb-db cluster, Once cluster moves to state where it doesn't election the leader for a while, you will see that readiness probes doesn't fail.

Actual results:
Readiness check doesn't fail if cluster has no leader for prolonged time.

Expected results:
Readiness probe should fail, if cluster has no leader for 2 iteration of readiness check.


Additional info:

Comment 1 Casey Callendrello 2020-05-25 14:35:09 UTC

We need to be careful; readiness checks really should be process-local considerations. Why? Because readiness gates DaemonSet rollouts. That is to say, DaemonSet rollout pauses until all processes are Ready.

So we might find ourselves where we can never push out updates ever again, since we've lost quorum, and need to upgrade at least 2 nodes to get quorum again.

Instead, we should be reporting as Degraded in the CNO if we don't have quorum.

Comment 3 Anil Vishnoi 2020-05-29 17:12:31 UTC

(In reply to Casey Callendrello from comment #1)
> We need to be careful; readiness checks really should be process-local
> considerations. Why? Because readiness gates DaemonSet rollouts. That is to
> say, DaemonSet rollout pauses until all processes are Ready.
> 
Thanks for the comment casey, that's really helpful. I think it might endup like a chicken and egg problem. Purpose of readiness check is to report whenever raft cluster is not in healthy state, that can also mean that cluster nodes are running but they didn't form the consensus for leader to serve the request. If readiness probe doesn't catch this state, CNO won't be able to take any action and CNI will be in bad state will somebody does manual intervention. But as you mentioned this scenario can also happen at the time of initial rollout as well. I am thinking of checking for `Leader` state 'unknown'in the probe, that will mark the container unhealthy after 30 seconds, if there is no leader for the cluster. This is the right way to check the status of the cluster but i now see possible issue at the time of rollout. Let me think a bit on how we can avoid the issue.
> So we might find ourselves where we can never push out updates ever again,
> since we've lost quorum, and need to upgrade at least 2 nodes to get quorum
> again.
I am not sure that's the right state to upgrade the cluster? I believe the upgrade should only happen when your cluster is in healty state, otherwise this upgrade can cause issue in configuring the network, because you are not sure in what state previous cluster left the network. But i believe you can hit possible rollout issue in other scenarios as well, so irrespective of upgrade scenario, we anyways need to address it.
> 
> Instead, we should be reporting as Degraded in the CNO if we don't have
> quorum.
If readiness probe mark the container unhealthy won't CNO report it as degraded?

Comment 4 Anil Vishnoi 2020-06-03 00:57:03 UTC

Following PR is under progress for the bugs :

https://github.com/openshift/cluster-network-operator/pull/655

Comment 7 Anil Vishnoi 2020-11-16 20:22:47 UTC

I tried and tested following options to see if that helps improving the sb/nb db health check.
1. Check the Role state (follower, candidate, leader)
2. Check the Role state with Leader status
3. Above status with the connections info.

Some of 2 & 3 does a better job to determine the state, but unfortunately they can give still give the false positive results (Specially start/restart scenarios).
To solve this problem properly, we need separate entity that can monitor all these raft pods and fetch the cluster/status information from all the pods and deduce the cluster status. There are two options 
(1) Adding this intelligence in the CNO.
(2) Doing it through a new raft monitoring pod, that can monitor and take actions.

I am exploring options (1) as of now. Will discuss options (2) upstream to see if that's acceptable solution. That way we can have the same implementation upstream and downstream.

Note You need to log in before you can comment on or make changes to this bug.