Bug 1857784 - [v2v][RHV to CNV VM import] Add bound condition in VM import CR to indicate if PVC is bound
Summary: [v2v][RHV to CNV VM import] Add bound condition in VM import CR to indicate i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: V2V
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.5.0
Assignee: Ondra Machacek
QA Contact: Ilanit Stein
URL:
Whiteboard:
: 1858548 (view as bug list)
Depends On:
Blocks: 1857786
TreeView+ depends on / blocked
 
Reported: 2020-07-16 14:54 UTC by Ilanit Stein
Modified: 2020-12-07 09:53 UTC (History)
7 users (show)

Fixed In Version: 2.5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1857786 (view as bug list)
Environment:
Last Closed: 2020-12-07 09:53:19 UTC
Target Upstream Version:
Embargoed:
istein: needinfo+
istein: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt vm-import-operator pull 360 0 None closed Add pending condition 2021-02-12 16:14:30 UTC

Description Ilanit Stein 2020-07-16 14:54:03 UTC
Description of problem:
When running VM import, and there is no storage PV that should fit the import needs. the VM import will stay forever in a state o "Importing", with showing progress bar stack on 10%.

VM event do not indicate what is the problem.

Navigating to pods - seeing that the importer pod is in pending state,
and it's event show:
"Failed to bind volumes: provisioning failed for PVC "rhel7-vm-9f242f97-38df-42e9-9489-ae975c0e0383"

Cases of not being able to allocate storage:
1. This case happen to me when I tried to VM import using "Standard" Cinder storage, that was picked by default, since I haven't set target storage to any storage class.

2. Another case was on bere metal CNV, where the ceph storage had almost reached out of space (reached it's limit becoming not able to be used anymore)

Please fail the MV import in such cases, to indicate to the user what is the cause of having the VM import not being able to perform.

Version-Release number of selected component (if applicable):
CNV-2.4

Comment 1 Ilanit Stein 2020-07-16 15:13:02 UTC
Alexander Wels:
it will try forever, remember this is a declarative system. You ask it to reach a state and it will try to do so. If it can't it will keep trying in case something changes that allows it to reach the desired state.

Comment 2 Piotr Kliczewski 2020-07-16 15:17:45 UTC
@Peter, what do you think? Should we fail it or let the user cancel it?

Comment 3 Piotr Kliczewski 2020-07-16 15:42:05 UTC
@Alex, This BZ was opened because pvc won't bind at all due to "Failed to provision volume with StorageClass "standard": invalid AccessModes [ReadWriteMany]: only AccessModes [ReadWriteOnce] are supported". The only way is to create new one with different SC. A user decided to go with either default storage class or created a mapping. It seems like we should fail the import due to user provided configuration.

Maybe we should check SC accessMode before creating DVs.

Comment 4 Piotr Kliczewski 2020-07-16 16:18:58 UTC
I think we need to check whether ReadWriteMany is supported by SC that user provided. The question is whether we should timeout if pvc is in pending for too long.

Comment 5 Alexander Wels 2020-07-16 17:31:04 UTC
So a few things.

You cannot tell by storage class what accessModes it provides. Also only admin users will be able to inspect the storage class, as regular users will not have permissions to do so. This is a problem we have been struggling with for a while now. We are working on some tooling to help us tell the user what options are available, but until we have that tooling there is no way to determine if creating a PVC will actually be able to be bound to a PV. Let me try to explain the basics of storage in kubernetes. There is nothing that says a storage has to exist, you can make Persistent Volumes (PV) without a storage class. A storage class is mostly just convenient naming for users. I am also going to ignore storage classes without provisioners, like static NFS. For those PVs are created ahead of time by the administrator. So the below will apply to dynamic provisioners like ceph or the hostpath provisioner.

When a user wants to use some storage they will make a PersistentVolumeClaim (PVC). The PVC will have some attributes that kubernetes uses to match it with a PV.
1. Size, in order to match a PVC with a PV, the PV capacity must be >= PVC request size.
2. Access Modes. The PV must be able to satisfy the accessModes requested by the PVC. 
3. Volume Mode. The PV must have the same volume Mode the PVC requests.
4. storage class name. If the PVC specifies a storage class, then the PV matched must have the same storage class
  a. A storage class can specify a provisioner, if a PV doesn't exist then the provisioner will create a PV that should match the above criteria. The created PV will have annotation about which provisioner created it.
  b. The annotation on the PV must match the provisioner defined in the storage class (if the storage class defines a provisioner)

If a PV matches all the above with a PVC, then the PVC can be bound to the PV.

Now this 'bug' is about the ceph provisioner being unable to create a PV of the requested size because there was no more storage available. So what happens in that case? The provisioner attempted to create a PV to satisfy the request but couldn't. So no PV is available to be bound to the PVC. So the PVC simply remains pending until a PV becomes available that can bind to it. This is normal kubernetes behavior. A similar but slightly different problem occurred with the 'standard' storage class. The provisioner for that storage class attempted to create a PV with RWX accessMode, apparently this provisioner does not support RWX, and thus it failed to create a PV for the requested PVC. And again the PVC remains pending until a PV becomes available that can be bound to.

As I started out saying, you cannot check a SC for a particular access mode. In fact it is entirely possible that at one point in time RWX access mode is not available, but the provisioner gets updated and now RWX is available, or vice versa. So you can't even store somewhere saying this SC has these capabilities as they can change at any time.

To wrap up, your code needs to be able to handle PVCs not binding due to whatever reason. This is expected behaviour of kubernetes and goes counter to traditional virtualization where if something doesn't happen in a certain period something bad happened. The usual way to handle these situations is to have some kind of condition on your crd that tells the user the current status of their request. For instance DataVolumes have a bound condition that is true if the PVC is bound and false if it is not. So question is not do we timeout, that is the wrong question. The question is how do we inform the user of the current state of their request in a way that lets them make an informed decision on what to do next.

Comment 6 Piotr Kliczewski 2020-07-16 17:51:40 UTC
Alex, Thank you for the explanation.

It seems like we need to add condition stating that PVC is in pending state. Whenever PVC get bound we need to remove that condition and continue with the import.
There is always a possibility to cancel.

Comment 7 Alexander Wels 2020-07-16 17:55:55 UTC
The DataVolume has a condition on if the PVC is bound, you can use that for your condition.

Comment 8 Ilanit Stein 2020-07-16 18:11:07 UTC
For VMware to CNV VM import differ from RHV to CNV VM import, in case there is no match to the available storage.

For RHV to CNV VM import: 
The VM state is "Importing", progress bar show 10%.
and VM event do not indicate any problem. 
The error on PVC is displayed in the importer pending pod events.

For VMware to CNV VM import:
The VM import state will be "Import pending",
and VM events will show the issue,
For example: 
"running "VolumeBinding" filter plugin for pod "kubevirt-v2v-conversion-v2v-rhel8-vm-vng9b": pod has unbound immediate PersistentVolumeClaims"

So it seems that for RHV to CNV VM import, the problem (unbound immediate PersistentVolumeClaims) is not indicated in the VMs view.

This bug asks to fail VM import from RHV to CNV.
If the decision would be not to do so,
Please improve the problem indication, within the VMs view, so users would know what is the problem, preventing the import from progressing.
It can be similar to what we have for VMware to CNV.

Comment 9 Piotr Kliczewski 2020-07-17 10:24:42 UTC
As Alex suggested we need add condition stating that PVC is Pending and we should let the user decide what to do. We will use DV condition to understand it.

Comment 10 Alexander Wels 2020-07-23 11:38:19 UTC
The new title is not suitable for a declarative system. There is no such thing as there is no available suitable storage. It is entirely possible something in the system changes that makes it possible for the system to satisfy the request at a later time. Or it could be that for whatever reason the current provisioner can satisfy the request, but it is just really slow and it takes a while to satisfy the request. For instance I was creating a datavolume on a cloud one time, and the cloud storage took 5 minutes to satisfy my request for a PV. In order to communicate the current state of the import, you should add some conditions to the VM import status object that can tell the user what the current state is. For example in CDI for an import we have the following information in the datavolume status object:

- Phase: This is basically a state machine, it goes from pending, to in progress, to complete
- Conditions:
  - Bound: True if all PVCs required for the import are bound, false otherwise
  - Running: True if the container in the importer pod is running. This can flip from true to false and back in case of a problem and the container exits and restarts.
  - Ready: True if the datavolume is ready to be consumed. False otherwise. Guaranteed to be false if bound is not true or if running is true.

From the phase and the conditions the user can determine the current state of the import in CDI. A computer can also determine the state based on the information. Note that the conditions are related to the phase, but do not follow it, they are orthogonal to each other and the phase.

In short I would change the title to say something like

Have bound condition in VM import CR to indicate if PVC is bound. If you use Datavolumes in the VM Import, you can have it follow the DV bound condition.

Comment 11 Piotr Kliczewski 2020-07-27 10:25:32 UTC
*** Bug 1858548 has been marked as a duplicate of this bug. ***

Comment 12 Ilanit Stein 2020-08-03 06:23:57 UTC
@Ondra, @Alexander,

Is it possible to add the "bound condition" the reason the bound cannot be performed at that time?

The "pending PVC to bind" reasons for this can be various. 
For example:
1. VM import with disk larger than available PVs 
2. VM import where there is no "default" storage class, and there is no storage class marked as "no sc"

Therefore, it would be a good user experience to know why the bind is not performed, 
in addition to knowing the VM import is pending the PVC bound.

Comment 13 Piotr Kliczewski 2020-08-03 07:00:32 UTC
@Ilanit,

I think we could do copy condition form DV with corresponding message. If we need to have more information let's make sure it is available on DV first.

Comment 14 Ilanit Stein 2020-08-03 09:05:09 UTC
@Piotr,
Thanks.
I hope DV condition reports it.

Comment 15 Alexander Wels 2020-08-03 11:42:07 UTC
The DV doesn't report the message from the PVC. But it is probably something we could do.

Comment 16 Ilanit Stein 2020-10-01 14:20:06 UTC
Verified on CNv-2.5 deployed from osbs on Sep 30 2020.
A Fedora32 VM import using unsupported "standard" storage - VM import stays forever in 10%, with a message:

Importing (RHV)
v2v-fedora32 is being imported.
Pending: DataVolume v2v-fedora32-8737b4f7-2b6b-4801-abc9-e307f838b337 is pending to bound
....10%

   
VM import resource status:
status:
  conditions:
  - lastHeartbeatTime: "2020-10-01T13:48:34Z"
    lastTransitionTime: "2020-10-01T13:48:34Z"
    message: Validation completed successfully
    reason: ValidationCompleted
    status: "True"
    type: Valid
  - lastHeartbeatTime: "2020-10-01T13:48:34Z"
    lastTransitionTime: "2020-10-01T13:48:34Z"
    message: 'VM specifies IO Threads: 1, VM has NUMA tune mode secified: interleave,
      Interface f9259d37-c965-4d53-9828-fc4ca757fa55 uses profile with a network filter
      with ID: d2370ab4-fee3-11e9-a310-8c1645ce738e.'
    reason: MappingRulesVerificationReportedWarnings
    status: "True"
    type: MappingRulesVerified
  - lastHeartbeatTime: "2020-10-01T14:02:03Z"
    lastTransitionTime: "2020-10-01T13:48:34Z"
    message: DataVolume v2v-fedora32-8737b4f7-2b6b-4801-abc9-e307f838b337 is pending  <------
      to bound
    reason: Pending  <------
    status: "True"
    type: Processing
  dataVolumes:
  - name: v2v-fedora32-8737b4f7-2b6b-4801-abc9-e307f838b337
  targetVmName: v2v-fedora32


Note You need to log in before you can comment on or make changes to this bug.