1886437 – [Doc] Document VM behaviour when a node crash or die

Bug 1886437 - [Doc] Document VM behaviour when a node crash or die

Summary: [Doc] Document VM behaviour when a node crash or die

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	2.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.5.0
Assignee:	Pan Ousley
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-08 12:58 UTC by Jean-Francois Saucier
Modified:	2023-12-15 19:46 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-19 14:53:30 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-8188	0	None	None	None	2023-12-15 19:46:17 UTC

Description Jean-Francois Saucier 2020-10-08 12:58:03 UTC

We are currently documenting in our documentation on how to enable live migration for a VM and how to handle normal node maintenance operation.

However, it seems we lack information on what happen (or should happen) when a node in the cluster crash and go out completely without warning. What should happen to VM when the node goes out? Should they restart and in what delay?

Also, is there any restriction for the VM availability (based on storage, etc).

Comment 4 ctomasko 2020-10-12 14:38:41 UTC

This is a documentation bug, only. Known issue: When a node is restarted in UPI baremetal, the VM does not automatically restart on another node. Auto restart is supported in IPI only. See related jira CNV-7548

Comment 5 Pan Ousley 2020-10-12 20:46:57 UTC

Hi Jean-Francois, thanks for bringing this to our attention. I own the similar Jira story CNV-7443, where this work is being tracked, so I have chowned this bug as well. In that story, I am documenting how a user can manually ensure that the VMI fails over when a node fails and MHC is not enabled.

>>>> Also, is there any restriction for the VM availability (based on storage, etc).

I do not know the answer to this but would like to document this type of conceptual information alongside the recovery steps, if possible. Fabian, what do you think?

Comment 7 zhengwan 2020-10-13 01:13:57 UTC

(In reply to ctomasko from comment #4)
> This is a documentation bug, only. Known issue: When a node is restarted in
> UPI baremetal, the VM does not automatically restart on another node. Auto
> restart is supported in IPI only. See related jira CNV-7548

Is there any plan to support this feature, not just document UPI not supported?

Comment 8 Dan Kenigsberg 2020-10-13 06:23:51 UTC

> Is there any plan to support this feature, not just document UPI not supported?

I am afraid there is none. Supporting node recycling on UPI would be a major undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask for this in OpenShift.

Comment 9 Andrew Beekhof 2020-10-13 12:07:59 UTC

(In reply to Dan Kenigsberg from comment #8)
> > Is there any plan to support this feature, not just document UPI not supported?
> 
> I am afraid there is none. Supporting node recycling on UPI would be a major
> undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask
> for this in OpenShift.

Alberto Lamela the Cloud team TL would be a good place to start

Comment 10 zhengwan 2020-10-19 12:29:02 UTC

(In reply to Andrew Beekhof from comment #9)
> (In reply to Dan Kenigsberg from comment #8)
> > > Is there any plan to support this feature, not just document UPI not supported?
> > 
> > I am afraid there is none. Supporting node recycling on UPI would be a major
> > undertaking for OCP. Maybe Andrew Beekhof can point you to where you can ask
> > for this in OpenShift.
> 
> Alberto Lamela the Cloud team TL would be a good place to start

Thank you very much.  I will try to reach out him.

Comment 11 Pan Ousley 2020-10-31 21:11:35 UTC

I have a PR ready for review which attempts to cover this BZ and the related Jira issue CNV-7443. I have already requested via Jira that either Fabian or Stu review it:

https://github.com/openshift/openshift-docs/pull/26963

@jsaucier, I would also appreciate your review for the sake of this BZ and the related customer cases. Thanks!

Comment 12 Jean-Francois Saucier 2020-11-02 16:44:29 UTC

@pousley I did put my review in the PR, thanks!

Comment 13 Pan Ousley 2020-11-05 17:02:17 UTC

As I stated on https://issues.redhat.com/browse/CNV-7443, I have merged the PR after applying peer review feedback. Thanks to Fabian and Jean-Francois for your reviews as well.

Vasiliy Sibirskiy had approved the PR from a QE perspective; Israel, can we move this to Verified?

Thanks!

Comment 14 Pan Ousley 2020-11-19 14:53:30 UTC

View the published docs here: https://docs.openshift.com/container-platform/4.6/virt/virtual_machines/virt-triggering-vm-failover-resolving-failed-node.html

Closing this bug.

Note You need to log in before you can comment on or make changes to this bug.