Bug 1889651
Summary: | [CNV][Chaos] Ensure MHC is enabled before starting VMs that are required to be highly available | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Piotr Kliczewski <pkliczew> |
Component: | Installation | Assignee: | Erkan Erol <eerol> |
Status: | CLOSED WONTFIX | QA Contact: | Inbar Rose <irose> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | future | CC: | aasserzo, abeekhof, cnv-qe-bugs, danken, dvossel, kmajcher, msluiter, racedoro, shardy, stirabos |
Target Milestone: | --- | ||
Target Release: | future | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-04-26 07:52:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1908661 |
Description
Piotr Kliczewski
2020-10-20 10:15:40 UTC
This was previously discussed (and rejected) via https://github.com/openshift/enhancements/pull/330 It sounds like we need to find a way to either enable users to easily opt-in when they want this behavior, or for CNV to automate that as part of the CNV setup In either case the baremetal-operator is the wrong component for this, so I'll defer triaging until we can decide if this bug should be closed or moved to a different component. @shardy we need to revisit the larger platform problem that our current CNV negative flow testing has exposed, specifically around BM deployments. I propose leaving this bug open until we have a proper hearing on the findings and our options for a platform level solution. We don't want to create a one-off solution specific to CNV; the platform behavior and UX should be consistent regardless of feature that exposes the issue. (In reply to Rob Young from comment #2) > @shardy we need to revisit the larger platform problem that our current CNV > negative flow testing has exposed, specifically around BM deployments. I > propose leaving this bug open until we have a proper hearing on the findings > and our options for a platform level solution. We don't want to create a > one-off solution specific to CNV; the platform behavior and UX should be > consistent regardless of feature that exposes the issue. Ack, we can leave it open, but we need to find a different component I think - the baremetal-operator is unrelated to MHC and it seems the solution is likely to be tracked via CNV teams? I was asked to clarify the consequences of not enabling MHC by default. When we loose a node that is running a workload and MHC is not enabled we won't see workload being rescheduled to run on other nodes. Not matter what is runStrategy specified for a vm it won't get rescheduled. @dvossel - can you advise whether virt-controller would be aware about node crashing and handle it accordingly? Answer to this question will help us prioritize this. > @dvossel - can you advise whether virt-controller would be aware about node crashing and handle it accordingly?
Answer to this question will help us prioritize this.
The node's status is irrelevant to virt-controller's monitoring of the VMI, virt-controller is only monitoring the pod.
If the pod's phase is running, then the virt-controller sees the VMI as running. The only reason virt-controller would react to a node crashing is if that event results in a VMI pod on the failed node to transition to a finalized state.
This is by design. We don't want the CNV components making decisions about how to handle node failure, because we'll get it wrong. Instead, we're trusting the system to provide us accurate information pertaining to the status of our workload.
Thanks David, according to that I think we should close this as WONTFIX. I strongly disagree with closing this bug. I am not arguing about whose responsibility it should be to handle this case but without it we develop a toy not a production ready product. I will raise unpopular opinion to enable MHC in HCO. > I strongly disagree with closing this bug. I am not arguing about whose responsibility it should be to handle this case but without it we develop a toy not a production ready product.
> I will raise unpopular opinion to enable MHC in HCO.
what's the issue with cluster admins opting into the MHC when they want MHC like behavior? Maybe they want that behavior to provide guarantees for a StatefulSet workload, maybe they want it for CNV VMs, maybe they want it for their own custom thing.
Piotr: As long as you're not expecting this to be a one time operation. The HCO would need to continuously enforce this condition for it to be a valid assumption by the rest of the codebase. Which leads to the interesting case of what should happen if the admin disables fencing (perhaps during a maintenance window or in order to perform triage). Should HCO undo that change? How would it infer admin intent? Rather than trying to configure fencing, I would advise testing to see if it is configured. Pasting from comment #3... > Rather than focusing on enabling MHC by default, it would be better to address the current > behaviour of blindly starting VMs on the assumption that fencing is always available > If a valid MHC configuration is a hard requirement for CNV, then it would be appropriate > for the HCO to check for it before starting it's controllers (and by inference, any VMs). Or perhaps you only gate starting HA VMs on the presence of fencing. > > This would directly tie the fencing functionality to the component that wishes to consume > it, as well as providing actionable feedback before it matters and regardless of the platform > or whether the admin is using the CLI or UI console. (In reply to Andrew Beekhof from comment #14) > > Rather than trying to configure fencing, I would advise testing to see if it > is configured. > Pasting from comment #3... > I do not want to suggest the solution. Andrew your suggestion sounds good to me but I do not want to suggest a solution. I think we need to handle this case instead of saying it is not my responsibility. |