Description of problem: I have HA VM with a lease on NFS v3 storage domain. When the host running the VM shuts down unexpectedly, the VM gets stuck in unknown status instead of failing over to a different and active host. Version-Release number of selected component (if applicable): rhvm-4.2.2.1-0.1.el7.noarch vdsm-4.20.19-1.el7ev.x86_64 qemu-kvm-rhev-2.10.0-21.el7.x86_64 libvirt-3.9.0-13.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. create VM with a disk on NFS v3 domain and HA lease on NFS v3 domain 2. start the VM 3. block connection between engine and host running the VM, and between the same host and the storage domain with the VM's disk and lease Actual results: VM is stuck in UNKOWN status Expected results: VM should start on another active host Additional info: engine.log 2018-02-27 19:29:46,209+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4030) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_0_TestCase17618_2719245588 was set to the Unknown status. vdsm.log 2018-02-27 19:31:49,389+0200 ERROR (vm/da5b0c4a) [virt.vm] (vmId='da5b0c4a-a7f1-4980-aeff-87a4d1973056') The vm start process failed (vm:940) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: process exited while connecting to monitor: 2018-02-27T17:31:47.088766Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead 2018-02-27T17:31:47.127937Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock Is another process using the image?
Created attachment 1401914 [details] logs
Please test also the case when we power off the host, simulating power loss.
Seems that the root cause is bug 1547095. We plan to workaround this issue using NFS "nolock" mount option. Setting the bug as dependency until we have a tested patch.
Raz, can we test the patch with latest engine 4.2? Note that you will need the fix from bug 1548819 for engine to work with vdsm master with this patch.
(In reply to Nir Soffer from comment #4) > Raz, can we test the patch with latest engine 4.2? > > Note that you will need the fix from bug 1548819 for engine to work with vdsm > master with this patch. Elad, can you take care of that?
(In reply to Nir Soffer from comment #2) > Please test also the case when we power off the host, simulating power loss. The results look the same - the VM is stuck on UNKNOWN status and in the vdsm log there is the same error 2018-03-01 16:07:43,577+0200 ERROR (vm/999c4968) [virt.vm] (vmId='999c4968-9443-4e30-bdc8-13f337efde5b') The vm start process failed (vm:940) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: process exited while connecting to monitor: 2018-03-01T14:07:41.368118Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead 2018-03-01T14:07:41.391107Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock Is another process using the image?
Created attachment 1403728 [details] Power off host logs
We do not depend now on bug 1547095.
Nir, can you please add some doctext explaining the change, and how the admin can override it if needed?
Tested according to the steps from the description: 1) Started a VM with lease and disk on an NFSv3 domain (on host_mixed_3) 2) blocked connection from the host (host_mixed_3) running the VM (non spm) to the storage and from the engine to the host (host_mixed_3) Results: VM started on another host (host_mixed_2) ============================================================= VM started on host_mixed_3: 2018-03-15 18:30:10,524+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [235586a8] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) moved from 'PoweringUp' --> 'Up' host_mixed_3 becomes unreachable: 2018-03-15 18:33:42,158+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-35) [610e3b8e] Host 'host_mixed_3' is not responding. VM started on host_mixed_2: 2018-03-15 18:36:27,140+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) was unexpectedly detected as 'PoweringUp' on VDS '53a89450-a346-4002-8d93-82c266c50c20'(host_mixed_2) (expected on 'babf91ab-84e7-4dce-84a4-5eaf0dd5aaf1') ============================================================= Used: rhevm-4.1.10.3-0.1.el7.noarch vdsm-4.19.50-1.el7ev.x86_64 libvirt-3.9.0-14.el7.x86_64 qemu-kvm-rhev-2.10.0-21.el7.x86_64 sanlock-3.6.0-1.el7.x86_64