Description of problem: When connection to storage is lost the VM is not paused. According to libvirt documentation[1] of the driver's error_policy argument, default value is not to pause VM on I/O error. https://libvirt.org/formatdomain.html#driver Version-Release number of selected component (if applicable): 2.5.0 How reproducible: always Steps to Reproduce: 1. Create VM with DV on NFS. 2. Block connection to NFS and change the metadata of the NFS volume (so stale-file handler error occurs). 3. Reenable connection to NFS. 4. oc exec -it virt-launcher-vm-1-nts97 -- virsh domstate default_vm-1 running Actual results: Libevirt reports running state. And in VMI status user can't detect there is any issue with the VM. Expected results: In the VMI status use should see the VMI is paused. Libvirt should report vm as paused. Additional info: When patching virt-launcher as follows: <----------------8<----------------------8<-------------------------- diff --git a/pkg/virt-launcher/virtwrap/api/converter.go b/pkg/virt-launcher/virtwrap/api/converter.go index ebfc81f7d..eec4e60ff 100644 --- a/pkg/virt-launcher/virtwrap/api/converter.go +++ b/pkg/virt-launcher/virtwrap/api/converter.go @@ -179,9 +179,10 @@ func Convert_v1_Disk_To_api_Disk(diskDevice *v1.Disk, disk *Disk, devicePerBus m } } disk.Driver = &DiskDriver{ + ErrorPolicy: "stop", } if numQueues != nil && disk.Target.Bus == "virtio" { disk.Driver.Queues = numQueues ----------------8<----------------------8<-------------------------- The qemu vm is created with args werror=stop,rerror=stop and so libvirt is reporting the vm as 'paused': $ kubectl exec -it virt-launcher-vm-1-r2gvz -- virsh domstate default_vm-1 paused But the condtion in the VMI is as follows: - lastProbeTime: null lastTransitionTime: "2020-11-24T21:31:57Z" message: 'server error. command SyncVMI failed: "neither found block device nor regular file for volume rootdisk"' reason: Synchronizing with the Domain failed. status: "False" type: Synchronized So user can't still see the propagated state in the VMI. So this BZ is asking for the API to specify the error_policy of the disk and also fixing the state propagation in case of the failure on the disk file.
Ondra: Do you see it also with OCS?
The storage is not relevant here. The point here is that in case of any I/O failure, the VM is not paused, because of the default libvirt error policy. I added an example with NFS for simple reproduction steps. With OCS it should be the same.
Good catch. While at it, please consider the resume policy as well. For reference, see the available RHV options [1]. [1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index#Configuring_a_highly_available_virtual_machine
Adam, we're treating this as a Virtualization bug, but just pinging you so you're aware. Feel free to move it to the Storage component if you think that is more appropriate.
Hi, there was an addition of `error_policy=stop` recently in https://github.com/kubevirt/kubevirt/pull/4840. Could you confirm that the only piece we are missing is the propagation to status/conditions? The `resume` policy is specific to RHV and libvirt doesn't support it. Therefore I would like to ask for another feature request.
(In reply to lpivarc from comment #6) > there was an addition of `error_policy=stop` recently in > https://github.com/kubevirt/kubevirt/pull/4840. Could you confirm that the > only piece we are missing is the propagation to status/conditions? > Yes, the conditions/status propagation is last missing piece to this bz, if it wasn't solved as part of that PR.
To verify, follow steps to reproduce in description.
Tested with a) virt-operator version 4.8.0-60 b) NFS Storage ( Configured my own NFS storage ) 1) With the below config [root@cnv-qe-01 pv101]# cat /etc/exports /data/nfs_shares/bm01-cnvqe-rdu2 *(rw,sync,no_wdelay,no_root_squash,insecure) [kbidarka@localhost ~]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm2-rhel84 45m Running xx.yyy.d.s node-13.redhat.com True The VMI is running successfully. 2) By limiting the NFS export to only the NFS Server itself. [root@cnv-qe-01 pv101]# cat /etc/exports /data/nfs_shares/bm01-cnvqe-rdu2 localhost(rw,sync,no_wdelay,no_root_squash,insecure) (cnv-tests) [kbidarka@localhost ~]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm2-rhel84 46m Running xx.yyy.d.s node-13.redhat.com True True The VMI enters PAUSED state automatically. ~]$ oc rsh virt-launcher-vm2-rhel84-pq9lh sh-4.4# virsh list Id Name State ----------------------------------- 1 default_vm2-rhel84 paused Moving this bug to VERIFIED state.
See the below message: Message: VMI was paused, IO error Reason: PausedIOError Status: True Type: Paused --- Volumes: Data Volume: Name: rhel84-dv2 Name: datavolumedisk1 Cloud Init No Cloud: User Data: #cloud-config password: redhat chpasswd: { expire: False } Name: cloudinitdisk Status: Active Pods: ea34997a-968d-43a2-9ce8-0f0e5547247c: node-13.redhat.com Conditions: Last Probe Time: <nil> Last Transition Time: <nil> Status: True Type: LiveMigratable Last Probe Time: <nil> Last Transition Time: 2021-06-09T16:39:00Z Status: True Type: Ready Last Probe Time: 2021-06-09T16:39:09Z Last Transition Time: <nil> Status: True Type: AgentConnected Last Probe Time: 2021-06-09T17:25:23Z Last Transition Time: 2021-06-09T17:25:23Z Message: VMI was paused, IO error Reason: PausedIOError Status: True Type: Paused Guest OS Info: Id: rhel Kernel Release: 4.18.0-287.el8.dt4.x86_64 Kernel Version: #1 SMP Thu Feb 18 13:31:55 EST 2021 Name: Red Hat Enterprise Linux Pretty Name: Red Hat Enterprise Linux 8.4 (Ootpa) Version: 8.4 Version Id: 8.4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920