Bug 1887484
| Summary: | [CNV][Chaos] Node disruption do not affect its workload | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Piotr Kliczewski <pkliczew> |
| Component: | Virtualization | Assignee: | sgott |
| Status: | CLOSED NOTABUG | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 2.5.0 | CC: | aasserzo, aos-bugs, cnv-qe-bugs, dvossel, jwang, kbidarka, ycui, zhengwan |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-09 15:59:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1908661 | ||
|
Description
Piotr Kliczewski
2020-10-12 15:30:51 UTC
The controller manager has a timeout (usually 5 minutes) to wait for a node to comeback on it's own. If the timeout expires then reschedule pods. Did you wait over 5 minutes to see if worksloads migrated? Note: Daemonsets and ReplicaSets will not migrate due to how they work. For context, KubeVirt is only going to re-schedule the VM once the VMI's pod has completely terminated and is in a finalized state. If the VMI pods on the restarted node do not transition to a finalized state, the VM controller won't proceed with rescheduling the VM workload somewhere else. Do you suggest it is CNV specific issue? In this scenario we can't assume we will see vmi pod completely terminated. > Do you suggest it is CNV specific issue? In this scenario we can't assume we will see vmi pod completely terminated.
This would only be a CNV specific issue if the Pod reaches a finalized state and the VM controller does not attempt to reschedule the workload.
I believe it's likely that CNV is behaving correctly based on the state of the Pod that it observes. It's up to OCP to determine that the pod has terminated due to node failure and mark it as finalized.
The only way to know for sure is to capture the VM/VMI pod's yaml during the time period that you'd expect the reschedule to occur. From that we can gain an understanding of how the Pod's status is reported and then infer what the correct action CNV should take based on that Pod's status.
I tested node shutdown with workload pod and vm. I observed different behavior for both. After ~5mins the pod is rescheduled whereas vm stays in Running. Based on those findings I am changing the product to CNV since it seems to be CNV specific issue. > I tested node shutdown with workload pod and vm. I observed different behavior for both. After ~5mins the pod is rescheduled whereas vm stays in Running. Based on those findings I am changing the product to CNV since it seems to be CNV specific issue. Try the same experiment with a StatfulSet of size 1. That's what we're modeled after. I believe this works differently than DaemonSets and Deployments [1] with regards to node failure. 1. https://github.com/kubernetes/kubernetes/issues/54368#issuecomment-339537281 Removing the target release from this BZ to ensure we re-triage it. Re-reading Comment #7, I think this is the ancient confusion of "Running" vs "RunStrategy". Running is a request for a state, not a status field--Which is why we re-named it. With that, I am closing this as NOTABUG. Please feel free to re-open if you feel this is in error. |