Bug 1886437

Summary: [Doc] Document VM behaviour when a node crash or die
Product: Container Native Virtualization (CNV) Reporter: Jean-Francois Saucier <jsaucier>
Component: DocumentationAssignee: Pan Ousley <pousley>
Status: CLOSED CURRENTRELEASE QA Contact: Israel Pinto <ipinto>
Severity: high Docs Contact:
Priority: high    
Version: 2.5.0CC: abeekhof, cnv-qe-bugs, ctomasko, danken, fdeutsch, ipinto, jcoscia, jwang, oramraz, pousley, ycui, zhengwan
Target Milestone: ---   
Target Release: 2.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-19 14:53:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Francois Saucier 2020-10-08 12:58:03 UTC
We are currently documenting in our documentation on how to enable live migration for a VM and how to handle normal node maintenance operation.

However, it seems we lack information on what happen (or should happen) when a node in the cluster crash and go out completely without warning. What should happen to VM when the node goes out? Should they restart and in what delay?

Also, is there any restriction for the VM availability (based on storage, etc).

Comment 4 ctomasko 2020-10-12 14:38:41 UTC
This is a documentation bug, only. Known issue: When a node is restarted in UPI baremetal, the VM does not automatically restart on another node. Auto restart is supported in IPI only. See related jira CNV-7548

Comment 5 Pan Ousley 2020-10-12 20:46:57 UTC
Hi Jean-Francois, thanks for bringing this to our attention. I own the similar Jira story CNV-7443, where this work is being tracked, so I have chowned this bug as well. In that story, I am documenting how a user can manually ensure that the VMI fails over when a node fails and MHC is not enabled.

>>>> Also, is there any restriction for the VM availability (based on storage, etc).

I do not know the answer to this but would like to document this type of conceptual information alongside the recovery steps, if possible. Fabian, what do you think?

Comment 7 zhengwan 2020-10-13 01:13:57 UTC
(In reply to ctomasko from comment #4)
> This is a documentation bug, only. Known issue: When a node is restarted in
> UPI baremetal, the VM does not automatically restart on another node. Auto
> restart is supported in IPI only. See related jira CNV-7548

Is there any plan to support this feature, not just document UPI not supported?

Comment 8 Dan Kenigsberg 2020-10-13 06:23:51 UTC
> Is there any plan to support this feature, not just document UPI not supported?

I am afraid there is none. Supporting node recycling on UPI would be a major undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask for this in OpenShift.

Comment 9 Andrew Beekhof 2020-10-13 12:07:59 UTC
(In reply to Dan Kenigsberg from comment #8)
> > Is there any plan to support this feature, not just document UPI not supported?
> 
> I am afraid there is none. Supporting node recycling on UPI would be a major
> undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask
> for this in OpenShift.

Alberto Lamela the Cloud team TL would be a good place to start

Comment 10 zhengwan 2020-10-19 12:29:02 UTC
(In reply to Andrew Beekhof from comment #9)
> (In reply to Dan Kenigsberg from comment #8)
> > > Is there any plan to support this feature, not just document UPI not supported?
> > 
> > I am afraid there is none. Supporting node recycling on UPI would be a major
> > undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask
> > for this in OpenShift.
> 
> Alberto Lamela the Cloud team TL would be a good place to start

Thank you very much.  I will try to reach out him.

Comment 11 Pan Ousley 2020-10-31 21:11:35 UTC
I have a PR ready for review which attempts to cover this BZ and the related Jira issue CNV-7443. I have already requested via Jira that either Fabian or Stu review it:

https://github.com/openshift/openshift-docs/pull/26963

@jsaucier, I would also appreciate your review for the sake of this BZ and the related customer cases. Thanks!

Comment 12 Jean-Francois Saucier 2020-11-02 16:44:29 UTC
@pousley I did put my review in the PR, thanks!

Comment 13 Pan Ousley 2020-11-05 17:02:17 UTC
As I stated on https://issues.redhat.com/browse/CNV-7443, I have merged the PR after applying peer review feedback. Thanks to Fabian and Jean-Francois for your reviews as well.

Vasiliy Sibirskiy had approved the PR from a QE perspective; Israel, can we move this to Verified?

Thanks!