Bug 1838343
Summary: | Improve the sb-db and nb-db readiness check to ensure it fails when cluster is not stable. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anil Vishnoi <avishnoi> | |
Component: | Networking | Assignee: | Anil Vishnoi <avishnoi> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aconstan, anbhat, bbennett, cdc | |
Version: | 4.6 | |||
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | All | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1934652 (view as bug list) | Environment: | ||
Last Closed: | 2021-03-03 15:52:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1934652 |
Description
Anil Vishnoi
2020-05-21 00:43:56 UTC
We need to be careful; readiness checks really should be process-local considerations. Why? Because readiness gates DaemonSet rollouts. That is to say, DaemonSet rollout pauses until all processes are Ready. So we might find ourselves where we can never push out updates ever again, since we've lost quorum, and need to upgrade at least 2 nodes to get quorum again. Instead, we should be reporting as Degraded in the CNO if we don't have quorum. (In reply to Casey Callendrello from comment #1) > We need to be careful; readiness checks really should be process-local > considerations. Why? Because readiness gates DaemonSet rollouts. That is to > say, DaemonSet rollout pauses until all processes are Ready. > Thanks for the comment casey, that's really helpful. I think it might endup like a chicken and egg problem. Purpose of readiness check is to report whenever raft cluster is not in healthy state, that can also mean that cluster nodes are running but they didn't form the consensus for leader to serve the request. If readiness probe doesn't catch this state, CNO won't be able to take any action and CNI will be in bad state will somebody does manual intervention. But as you mentioned this scenario can also happen at the time of initial rollout as well. I am thinking of checking for `Leader` state 'unknown'in the probe, that will mark the container unhealthy after 30 seconds, if there is no leader for the cluster. This is the right way to check the status of the cluster but i now see possible issue at the time of rollout. Let me think a bit on how we can avoid the issue. > So we might find ourselves where we can never push out updates ever again, > since we've lost quorum, and need to upgrade at least 2 nodes to get quorum > again. I am not sure that's the right state to upgrade the cluster? I believe the upgrade should only happen when your cluster is in healty state, otherwise this upgrade can cause issue in configuring the network, because you are not sure in what state previous cluster left the network. But i believe you can hit possible rollout issue in other scenarios as well, so irrespective of upgrade scenario, we anyways need to address it. > > Instead, we should be reporting as Degraded in the CNO if we don't have > quorum. If readiness probe mark the container unhealthy won't CNO report it as degraded? Following PR is under progress for the bugs : https://github.com/openshift/cluster-network-operator/pull/655 I tried and tested following options to see if that helps improving the sb/nb db health check. 1. Check the Role state (follower, candidate, leader) 2. Check the Role state with Leader status 3. Above status with the connections info. Some of 2 & 3 does a better job to determine the state, but unfortunately they can give still give the false positive results (Specially start/restart scenarios). To solve this problem properly, we need separate entity that can monitor all these raft pods and fetch the cluster/status information from all the pods and deduce the cluster status. There are two options (1) Adding this intelligence in the CNO. (2) Doing it through a new raft monitoring pod, that can monitor and take actions. I am exploring options (1) as of now. Will discuss options (2) upstream to see if that's acceptable solution. That way we can have the same implementation upstream and downstream. |