Bug 1946886

Summary:	VM cloned from a VM(HPP) is stucking at starting
Product:	Container Native Virtualization (CNV)	Reporter:	Guohua Ouyang <gouyang>
Component:	Storage	Assignee:	Alexander Wels <awels>
Status:	CLOSED NOTABUG	QA Contact:	Ying Cui <ycui>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8.0	CC:	aos-bugs, awels, cnv-qe-bugs, gouyang, yzamir
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1940296	Environment:
Last Closed:	2021-04-08 11:53:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1940296
Bug Blocks:

Description Guohua Ouyang 2021-04-07 07:02:27 UTC

+++ This bug was initially created as a clone of Bug #1940296 +++

Description of problem:
Two situations will prevent a VM cloned from another VM(HPP) cannot be started at all, it stucks at starting state.
1. the original VM is never started before, the PVC for the VM is in pending and not bounded.
2. the original VM is running before the cloned VM starts.

Only happens with storageClass HPP, not OCS.

Version-Release number of selected component (if applicable):
CNV 2.6/CNV 2.7

How reproducible:
100%

Steps to Reproduce:
1. create a VM 'A with storage class HPP, uncheck "Start on creation"
2. clone a VM 'B' from VM 'A'
3. start VM 'B'  / start VM 'A' and then start VM 'B'

Actual results:
VM 'B‘ is never becoming running, stucks in starting state.

Expected results:
VM 'B' can become running as this is normal flow.

Additional info:
The root cause is the datavolume clone is never run, no clue in import pod logs at all.
ref: https://groups.google.com/g/kubevirt-dev/c/FAo2raBtlKI

--- Additional comment from Guohua Ouyang on 2021-03-18 12:53:46 CST ---

This should be a storage issue, report it in console kubevirt for a review firstly because the flow is quite normal. 
1. created a HPP VM, it could run well.
2. after sometime cloned a new VM from it
3. try to start both VMs

only the firstly VM can be started, the cloned VM is stucking at 'starting' and no useful logs/informations at all.

--- Additional comment from Yaacov Zamir on 2021-04-07 14:40:04 CST ---

set as duplicate of BZ#1924728 because the root of this problem is the fact that the user can't see the problem and need to look into events and logs to understand what happen

Note:
we can also open a bug on storage to fix the specific issue here

Comment 1 Guohua Ouyang 2021-04-07 07:04:51 UTC

Clone the bug to storage for review.

This problem can be also reproduced via steps:
1. add source to a template (HPP)
2. create a vm from the template
3. delete source from the template
4. start the VM

VM is stucking at "Starting".

Comment 2 Adam Litke 2021-04-07 20:40:53 UTC

Alexander can you please take a look?   I saw you responding in the linked kubevirt-dev thread.  Do we need some more information to see if there is an issue to fix here?

Comment 3 Alexander Wels 2021-04-08 11:53:14 UTC

So hpp storage class is what is called WaitForFirstConsumer, and we recently introduced code into CDI and KubeVirt that respects WFFC until the VM runs, so what is happening is fully expected and not a problem. The following is happening in what you are describing:
1. You create a DV, but the storage is WFFC, so the DV does NOT populate the PVC until a VM runs (that way we populate the PVC on the node the VM is scheduled to run on)
2. The PVC is not bound because it is waiting for the first consumer before binding, this also means the DV is not in succeeded phases.
3. Trying to clone a DV that is not in succeeded state will cause it to not clone at all, because cloning incomplete DV makes no sense.
4. Trying to start from the cloned DV (which is also not in succeeded phase) will not work because the VM detects the DV is not succeeded and will not start.

This works as you did it in OCS because OCS has immediate binding mode because it is shared storage. There are 2 ways to make this work like it did before.
1. After creating the initial DV, start a VM which will cause it to get scheduled on a node which in turn triggers CDI to populate the DV, then stop the VM and do the rest as before. Because the DV is now succeeded, the clone will also succeed (once you start the second VM) again the WFFC is applied, so the clone won't actually happen until the VM that uses the cloned DV is started (for the same reason, we don't want to clone until we know where the data is going)
2. Add cdi.kubevirt.io/storage.bind.immediate.requested annotation on the DV, this will cause CDI to behave like before and just put the data on a random node (which might not be schedule-able for a VM btw) and bind the PVC immediately.

For the second:
> This should be a storage issue, report it in console kubevirt for a review firstly because the flow is quite normal. 
> 1. created a HPP VM, it could run well.
> 2. after sometime cloned a new VM from it
> 3. try to start both VMs
You cannot clone a disk in a running VM, the VM will actively be modifying it, you have to shut down the VM first. Also the target PVC will have events on it indicating that the source is in use.

Also note this is completely unrelated to BZ#1924728 because the importer/cloner pod(s) are not even started, so there are no longs to point to. The only thing we could do that we are not right now is communicate the reason we are not doing something in the condition of the DV.

To recap, all of this is completely expected and correct behavior due to the nature of the hostpath storage being WaitForFirstConsumer.