Bug 1901335

Summary:	[CNV][Chaos] Vm is not paused when connection to storage is lost
Product:	Container Native Virtualization (CNV)	Reporter:	Ondra Machacek <omachace>
Component:	Virtualization	Assignee:	lpivarc
Status:	CLOSED ERRATA	QA Contact:	Kedar Bidarkar <kbidarka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.5.0	CC:	aasserzo, cnv-qe-bugs, dfediuck, fdeutsch, kbidarka, omachace, pkliczew, sgott, ycui
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	virt-operator-container-v4.8.0-60 hco-bundle-registry-container-v4.8.0-375	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 14:21:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1908661, 1926746

Description Ondra Machacek 2020-11-24 21:34:15 UTC

Description of problem:
When connection to storage is lost the VM is not paused. According to libvirt documentation[1] of the driver's error_policy argument, default value is not to pause VM on I/O error.

https://libvirt.org/formatdomain.html#driver

Version-Release number of selected component (if applicable):
2.5.0

How reproducible:
always

Steps to Reproduce:
1. Create VM with DV on NFS.
2. Block connection to NFS and change the metadata of the NFS volume (so stale-file handler error occurs).
3. Reenable connection to NFS.
4. oc exec -it virt-launcher-vm-1-nts97 -- virsh domstate default_vm-1
running

Actual results:
Libevirt reports running state. And in VMI status user can't detect there is any issue with the VM.

Expected results:
In the VMI status use should see the VMI is paused. Libvirt should report vm as paused.

Additional info:

When patching virt-launcher as follows:
<----------------8<----------------------8<--------------------------
diff --git a/pkg/virt-launcher/virtwrap/api/converter.go b/pkg/virt-launcher/virtwrap/api/converter.go
index ebfc81f7d..eec4e60ff 100644
--- a/pkg/virt-launcher/virtwrap/api/converter.go
+++ b/pkg/virt-launcher/virtwrap/api/converter.go
@@ -179,9 +179,10 @@ func Convert_v1_Disk_To_api_Disk(diskDevice *v1.Disk, disk *Disk, devicePerBus m
                }
        }
        disk.Driver = &DiskDriver{
+               ErrorPolicy: "stop",
        }
        if numQueues != nil && disk.Target.Bus == "virtio" {
                disk.Driver.Queues = numQueues
----------------8<----------------------8<--------------------------

The qemu vm is created with args werror=stop,rerror=stop and so libvirt is reporting the vm as 'paused':

$ kubectl  exec -it virt-launcher-vm-1-r2gvz -- virsh domstate default_vm-1
paused

But the condtion in the VMI is as follows:

- lastProbeTime: null
      lastTransitionTime: "2020-11-24T21:31:57Z"
      message: 'server error. command SyncVMI failed: "neither found block device
        nor regular file for volume rootdisk"'
      reason: Synchronizing with the Domain failed.
      status: "False"
      type: Synchronized

So user can't still see the propagated state in the VMI.

So this BZ is asking for the API to specify the error_policy of the disk and also fixing the state propagation in case of the failure on the disk file.

Comment 2 Israel Pinto 2020-11-25 09:03:52 UTC

Ondra:
Do you see it also with OCS?

Comment 3 Ondra Machacek 2020-11-25 09:43:29 UTC

The storage is not relevant here. The point here is that in case of any I/O failure, the VM is not paused, because of the default libvirt error policy. I added an example with NFS for simple reproduction steps. With OCS it should be the same.

Comment 4 Doron Fediuck 2020-11-25 11:15:22 UTC

Good catch.
While at it, please consider the resume policy as well.
For reference, see the available RHV options [1].

[1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index#Configuring_a_highly_available_virtual_machine

Comment 5 sgott 2020-11-25 13:08:16 UTC

Adam, we're treating this as a Virtualization bug, but just pinging you so you're aware. Feel free to move it to the Storage component if you think that is more appropriate.

Comment 6 lpivarc 2021-03-03 11:05:48 UTC

Hi,

there was an addition of `error_policy=stop` recently in https://github.com/kubevirt/kubevirt/pull/4840. Could you confirm that the only piece we are missing is the propagation to status/conditions?

The `resume` policy is specific to RHV and libvirt doesn't support it. Therefore I would like to ask for another feature request.

Comment 7 Ondra Machacek 2021-03-03 20:32:29 UTC

(In reply to lpivarc from comment #6)
> there was an addition of `error_policy=stop` recently in
> https://github.com/kubevirt/kubevirt/pull/4840. Could you confirm that the
> only piece we are missing is the propagation to status/conditions?
>

Yes, the conditions/status propagation is last missing piece to this bz, if it wasn't solved as part of that PR.

Comment 9 sgott 2021-06-07 11:58:06 UTC

To verify, follow steps to reproduce in description.

Comment 10 Kedar Bidarkar 2021-06-09 17:31:25 UTC

Tested with
a) virt-operator version 4.8.0-60
b) NFS Storage ( Configured my own NFS storage )

1) With the below config

[root@cnv-qe-01 pv101]# cat /etc/exports
/data/nfs_shares/bm01-cnvqe-rdu2 *(rw,sync,no_wdelay,no_root_squash,insecure)

[kbidarka@localhost ~]$ oc get vmi -o wide
NAME             AGE   PHASE     IP             NODENAME                                  LIVE-MIGRATABLE   PAUSED
vm2-rhel84       45m   Running   xx.yyy.d.s   node-13.redhat.com   True              

The VMI is running successfully.

2) By limiting the NFS export to only the NFS Server itself.

[root@cnv-qe-01 pv101]# cat /etc/exports
/data/nfs_shares/bm01-cnvqe-rdu2 localhost(rw,sync,no_wdelay,no_root_squash,insecure)

(cnv-tests) [kbidarka@localhost ~]$ oc get vmi -o wide
NAME             AGE   PHASE     IP             NODENAME                                  LIVE-MIGRATABLE   PAUSED
vm2-rhel84       46m   Running   xx.yyy.d.s   node-13.redhat.com   True              True

The VMI enters PAUSED state automatically.

~]$ oc rsh virt-launcher-vm2-rhel84-pq9lh
sh-4.4# virsh list
 Id   Name                 State
-----------------------------------
 1    default_vm2-rhel84   paused

Moving this bug to VERIFIED state.

Comment 11 Kedar Bidarkar 2021-06-09 18:16:24 UTC

See the below message:

Message:               VMI was paused, IO error
    Reason:                PausedIOError
    Status:                True
    Type:                  Paused 

---

 Volumes:
    Data Volume:
      Name:  rhel84-dv2
    Name:    datavolumedisk1
    Cloud Init No Cloud:
      User Data:  #cloud-config
password: redhat
chpasswd: { expire: False }
    Name:  cloudinitdisk
Status:
  Active Pods:
    ea34997a-968d-43a2-9ce8-0f0e5547247c:  node-13.redhat.com
  Conditions:
    Last Probe Time:       <nil>
    Last Transition Time:  <nil>
    Status:                True
    Type:                  LiveMigratable
    Last Probe Time:       <nil>
    Last Transition Time:  2021-06-09T16:39:00Z
    Status:                True
    Type:                  Ready
    Last Probe Time:       2021-06-09T16:39:09Z
    Last Transition Time:  <nil>
    Status:                True
    Type:                  AgentConnected
    Last Probe Time:       2021-06-09T17:25:23Z
    Last Transition Time:  2021-06-09T17:25:23Z
    Message:               VMI was paused, IO error
    Reason:                PausedIOError
    Status:                True
    Type:                  Paused
  Guest OS Info:
    Id:              rhel
    Kernel Release:  4.18.0-287.el8.dt4.x86_64
    Kernel Version:  #1 SMP Thu Feb 18 13:31:55 EST 2021
    Name:            Red Hat Enterprise Linux
    Pretty Name:     Red Hat Enterprise Linux 8.4 (Ootpa)
    Version:         8.4
    Version Id:      8.4

Comment 14 errata-xmlrpc 2021-07-27 14:21:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920