Bug 1889651

Summary: [CNV][Chaos] Ensure MHC is enabled before starting VMs that are required to be highly available
Product: Container Native Virtualization (CNV) Reporter: Piotr Kliczewski <pkliczew>
Component: InstallationAssignee: Erkan Erol <eerol>
Status: CLOSED WONTFIX QA Contact: Inbar Rose <irose>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: futureCC: aasserzo, abeekhof, cnv-qe-bugs, danken, dvossel, kmajcher, msluiter, racedoro, shardy, stirabos
Target Milestone: ---   
Target Release: future   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-26 07:52:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1908661    

Description Piotr Kliczewski 2020-10-20 10:15:40 UTC
In order to make sure chaos scenarios won't affect user workload we need to enable machine health check by default on freshly installed clusters.

Comment 1 Steven Hardy 2020-10-20 11:54:43 UTC
This was previously discussed (and rejected) via https://github.com/openshift/enhancements/pull/330

It sounds like we need to find a way to either enable users to easily opt-in when they want this behavior, or for CNV to automate that as part of the CNV setup

In either case the baremetal-operator is the wrong component for this, so I'll defer triaging until we can decide if this bug should be closed or moved to a different component.

Comment 2 Rob Young 2020-10-20 15:26:21 UTC
@shardy we need to revisit the larger platform problem that our current CNV negative flow testing has exposed, specifically around BM deployments. I propose leaving this bug open until we have a proper hearing on the findings and our options for a platform level solution. We don't want to create a one-off solution specific to CNV; the platform behavior and UX should be consistent regardless of feature that exposes the issue.

Comment 5 Steven Hardy 2020-10-22 14:28:24 UTC
(In reply to Rob Young from comment #2)
> @shardy we need to revisit the larger platform problem that our current CNV
> negative flow testing has exposed, specifically around BM deployments. I
> propose leaving this bug open until we have a proper hearing on the findings
> and our options for a platform level solution. We don't want to create a
> one-off solution specific to CNV; the platform behavior and UX should be
> consistent regardless of feature that exposes the issue.

Ack, we can leave it open, but we need to find a different component I think - the baremetal-operator is unrelated to MHC and it seems the solution is likely to be tracked via CNV teams?

Comment 8 Piotr Kliczewski 2021-04-23 09:28:03 UTC
I was asked to clarify the consequences of not enabling MHC by default. When we loose a node that is running a workload and MHC is not enabled we won't see workload being rescheduled to run on other nodes. Not matter what is runStrategy specified for a vm it won't get rescheduled.

Comment 9 Krzysztof Majcher 2021-04-23 12:34:38 UTC
@dvossel - can you advise whether virt-controller would be aware about node crashing and handle it accordingly? 
Answer to this question will help us prioritize this.

Comment 10 David Vossel 2021-04-23 14:18:46 UTC
> @dvossel - can you advise whether virt-controller would be aware about node crashing and handle it accordingly? 
Answer to this question will help us prioritize this.

The node's status is irrelevant to virt-controller's monitoring of the VMI, virt-controller is only monitoring the pod.

If the pod's phase is running, then the virt-controller sees the VMI as running. The only reason virt-controller would react to a node crashing is if that event results in a VMI pod on the failed node to transition to a finalized state.

This is by design. We don't want the CNV components making decisions about how to handle node failure, because we'll get it wrong. Instead, we're trusting the system to provide us accurate information pertaining to the status of our workload.

Comment 11 Simone Tiraboschi 2021-04-26 07:52:17 UTC
Thanks David,
according to that I think we should close this as WONTFIX.

Comment 12 Piotr Kliczewski 2021-04-26 07:59:25 UTC
I strongly disagree with closing this bug. I am not arguing about whose responsibility it should be to handle this case but without it we develop a toy not a production ready product.
I will raise unpopular opinion to enable MHC in HCO.

Comment 13 David Vossel 2021-04-26 13:02:06 UTC
> I strongly disagree with closing this bug. I am not arguing about whose responsibility it should be to handle this case but without it we develop a toy not a production ready product.
> I will raise unpopular opinion to enable MHC in HCO.

what's the issue with cluster admins opting into the MHC when they want MHC like behavior? Maybe they want that behavior to provide guarantees for a StatefulSet workload, maybe they want it for CNV VMs, maybe they want it for their own custom thing.

Comment 14 Andrew Beekhof 2021-04-26 13:03:19 UTC
Piotr: As long as you're not expecting this to be a one time operation.

The HCO would need to continuously enforce this condition for it to be a valid assumption by the rest of the codebase.

Which leads to the interesting case of what should happen if the admin disables fencing (perhaps during a maintenance window or in order to perform triage).
Should HCO undo that change?  How would it infer admin intent? 

Rather than trying to configure fencing, I would advise testing to see if it is configured.
Pasting from comment #3...

> Rather than focusing on enabling MHC by default, it would be better to address the current 
> behaviour of blindly starting VMs on the assumption that fencing is always available

> If a valid MHC configuration is a hard requirement for CNV, then it would be appropriate 
> for the HCO to check for it before starting it's controllers (and by inference, any VMs).

Or perhaps you only gate starting HA VMs on the presence of fencing. 

>
> This would directly tie the fencing functionality to the component that wishes to consume 
> it, as well as providing actionable feedback before it matters and regardless of the platform 
> or whether the admin is using the CLI or UI console.

Comment 15 Piotr Kliczewski 2021-04-26 13:35:16 UTC
(In reply to Andrew Beekhof from comment #14)
> 
> Rather than trying to configure fencing, I would advise testing to see if it
> is configured.
> Pasting from comment #3...
> 

I do not want to suggest the solution. Andrew your suggestion sounds good to me but I do not want to suggest a solution.
I think we need to handle this case instead of saying it is not my responsibility.