Bug 1550127 - [NFS v3] HA VM is stuck in unkown status after ungraceful shutdown
Summary: [NFS v3] HA VM is stuck in unkown status after ungraceful shutdown
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.2.2
: ---
Assignee: Nir Soffer
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-28 15:06 UTC by Lilach Zitnitski
Modified: 2018-11-28 13:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Qemu takes locks on virtual machine images in NFSv3 shared storage. If a host was disconnected from the network or had a power failure or other fatal failure, these locks become stale. Consequence: Starting the a virtual machine on another host would fail when qemu try to acquire the stale locks. Fix: We use now the "nolock" option in NFSv3 mounts. Qemu locks are using local locks which are effective only on the host running the virtual machine and cannot prevent starting the virtual machine on another host. Result: Virtual machine can be started on another host after fatal failures of the original host. Additional info: If the locking on NFSv3 shared storage is desired, and the issue with stale locks does not affect the workload, users can enable locking on a storage domain by adding the "lock" option to the "Additional mount options" field.
Clone Of:
Environment:
Last Closed: 2018-03-22 15:50:33 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2?
rule-engine: blocker+


Attachments (Terms of Use)
logs (584.71 KB, application/zip)
2018-02-28 15:06 UTC, Lilach Zitnitski
no flags Details
Power off host logs (175.02 KB, application/zip)
2018-03-04 10:00 UTC, Lilach Zitnitski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1547095 0 unspecified CLOSED QEMU image locking on NFSv3 prevents VMs from getting restarted on different hosts upon an host crash, seen on RHEL 7.5 2021-09-09 13:14:28 UTC
oVirt gerrit 88317 0 master MERGED nfs: Disable NFSv3 locks 2020-10-09 20:30:12 UTC
oVirt gerrit 88333 0 master MERGED nfs: Add tests for NFSv3 locking 2020-10-09 20:30:12 UTC
oVirt gerrit 88488 0 ovirt-4.2 MERGED nfs: Disable NFSv3 locks 2020-10-09 20:30:12 UTC
oVirt gerrit 88489 0 ovirt-4.2 MERGED nfs: Add tests for NFSv3 locking 2020-10-09 20:30:12 UTC
oVirt gerrit 88490 0 ovirt-4.1 MERGED nfs: Disable NFSv3 locks 2020-10-09 20:30:12 UTC
oVirt gerrit 88491 0 ovirt-4.1 MERGED nfs: Add tests for NFSv3 locking 2020-10-09 20:30:22 UTC

Internal Links: 1547095

Description Lilach Zitnitski 2018-02-28 15:06:01 UTC
Description of problem:
I have HA VM with a lease on NFS v3 storage domain.
When the host running the VM shuts down unexpectedly, the VM gets stuck in unknown status instead of failing over to a different and active host.

Version-Release number of selected component (if applicable):
rhvm-4.2.2.1-0.1.el7.noarch
vdsm-4.20.19-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create VM with a disk on NFS v3 domain and HA lease on NFS v3 domain
2. start the VM
3. block connection between engine and host running the VM, and between the same host and the storage domain with the VM's disk and lease

Actual results:
VM is stuck in UNKOWN status

Expected results:
VM should start on another active host

Additional info:

engine.log

2018-02-27 19:29:46,209+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4030) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_0_TestCase17618_2719245588 was set to the Unknown status.

vdsm.log

2018-02-27 19:31:49,389+0200 ERROR (vm/da5b0c4a) [virt.vm] (vmId='da5b0c4a-a7f1-4980-aeff-87a4d1973056') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-02-27T17:31:47.088766Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-02-27T17:31:47.127937Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
Is another process using the image?

Comment 1 Lilach Zitnitski 2018-02-28 15:06:45 UTC
Created attachment 1401914 [details]
logs

Comment 2 Nir Soffer 2018-02-28 16:49:08 UTC
Please test also the case when we power off the host, simulating power loss.

Comment 3 Nir Soffer 2018-02-28 16:52:43 UTC
Seems that the root cause is bug 1547095. We plan to workaround this issue using
NFS "nolock" mount option. Setting the bug as dependency until we have a tested
patch.

Comment 4 Nir Soffer 2018-02-28 23:55:40 UTC
Raz, can we test the patch with latest engine 4.2?

Note that you will need the fix from bug 1548819 for engine to work with vdsm
master with this patch.

Comment 6 Raz Tamir 2018-03-01 08:34:41 UTC
(In reply to Nir Soffer from comment #4)
> Raz, can we test the patch with latest engine 4.2?
> 
> Note that you will need the fix from bug 1548819 for engine to work with vdsm
> master with this patch.

Elad, can you take care of that?

Comment 8 Lilach Zitnitski 2018-03-04 09:59:41 UTC
(In reply to Nir Soffer from comment #2)
> Please test also the case when we power off the host, simulating power loss.

The results look the same - the VM is stuck on UNKNOWN status and in the vdsm log there is the same error 

2018-03-01 16:07:43,577+0200 ERROR (vm/999c4968) [virt.vm] (vmId='999c4968-9443-4e30-bdc8-13f337efde5b') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-03-01T14:07:41.368118Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-03-01T14:07:41.391107Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
Is another process using the image?

Comment 9 Lilach Zitnitski 2018-03-04 10:00:18 UTC
Created attachment 1403728 [details]
Power off host logs

Comment 10 Nir Soffer 2018-03-05 20:34:07 UTC
We do not depend now on bug 1547095.

Comment 11 Allon Mureinik 2018-03-06 08:16:43 UTC
Nir, can you please add some doctext explaining the change, and how the admin can override it if needed?

Comment 14 Elad 2018-03-15 16:50:15 UTC
Tested according to the steps from the description:
1) Started a VM with lease and disk on an NFSv3 domain (on host_mixed_3)
2) blocked connection from the host (host_mixed_3) running the VM (non spm) to the storage and from the engine to the host (host_mixed_3) 


Results:
VM started on another host (host_mixed_2)

=============================================================


VM started on host_mixed_3:

2018-03-15 18:30:10,524+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [235586a8] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) moved from 'PoweringUp' --> 'Up'


host_mixed_3 becomes unreachable:

2018-03-15 18:33:42,158+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-35) [610e3b8e] Host 'host_mixed_3' is not responding.


VM started on host_mixed_2:

2018-03-15 18:36:27,140+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) was unexpectedly detected as 'PoweringUp' on
 VDS '53a89450-a346-4002-8d93-82c266c50c20'(host_mixed_2) (expected on 'babf91ab-84e7-4dce-84a4-5eaf0dd5aaf1')


=============================================================

Used:
rhevm-4.1.10.3-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64
libvirt-3.9.0-14.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
sanlock-3.6.0-1.el7.x86_64


Note You need to log in before you can comment on or make changes to this bug.