Bug 1550127

Summary:

[NFS v3] HA VM is stuck in unkown status after ungraceful shutdown

Product:

[oVirt] ovirt-engine

Reporter:

Lilach Zitnitski <lzitnits>

Component:

BLL.Storage

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED UPSTREAM

QA Contact:

Elad <ebenahar>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

4.2.0

CC:

amureini, bugs, ebenahar, lzitnits, michal.skrivanek, nsoffer, ratamir

Target Milestone:

ovirt-4.2.2

Keywords:

Automation, Regression

Target Release:

---

Flags:

rule-engine: ovirt-4.2?
rule-engine: blocker+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Qemu takes locks on virtual machine images in NFSv3 shared storage. If a host was disconnected from the network or had a power failure or other fatal failure, these locks become stale. Consequence: Starting the a virtual machine on another host would fail when qemu try to acquire the stale locks. Fix: We use now the "nolock" option in NFSv3 mounts. Qemu locks are using local locks which are effective only on the host running the virtual machine and cannot prevent starting the virtual machine on another host. Result: Virtual machine can be started on another host after fatal failures of the original host. Additional info: If the locking on NFSv3 shared storage is desired, and the issue with stale locks does not affect the workload, users can enable locking on a storage domain by adding the "lock" option to the "Additional mount options" field.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-22 15:50:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none
Power off host logs	none

Description Lilach Zitnitski 2018-02-28 15:06:01 UTC

Description of problem:
I have HA VM with a lease on NFS v3 storage domain.
When the host running the VM shuts down unexpectedly, the VM gets stuck in unknown status instead of failing over to a different and active host.

Version-Release number of selected component (if applicable):
rhvm-4.2.2.1-0.1.el7.noarch
vdsm-4.20.19-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create VM with a disk on NFS v3 domain and HA lease on NFS v3 domain
2. start the VM
3. block connection between engine and host running the VM, and between the same host and the storage domain with the VM's disk and lease

Actual results:
VM is stuck in UNKOWN status

Expected results:
VM should start on another active host

Additional info:

engine.log

2018-02-27 19:29:46,209+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4030) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_0_TestCase17618_2719245588 was set to the Unknown status.

vdsm.log

2018-02-27 19:31:49,389+0200 ERROR (vm/da5b0c4a) [virt.vm] (vmId='da5b0c4a-a7f1-4980-aeff-87a4d1973056') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-02-27T17:31:47.088766Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-02-27T17:31:47.127937Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/d475e955-7b00-4a3a-b28f-a8ae5cd67362/2783ea0f-6292-4fad-b930-e76d9bacbf98,format=qcow2,if=none,id=drive-virtio-disk0,serial=d475e955-7b00-4a3a-b28f-a8ae5cd67362,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
Is another process using the image?

Comment 1 Lilach Zitnitski 2018-02-28 15:06:45 UTC

Created attachment 1401914 [details]
logs

Comment 2 Nir Soffer 2018-02-28 16:49:08 UTC

Please test also the case when we power off the host, simulating power loss.

Comment 3 Nir Soffer 2018-02-28 16:52:43 UTC

Seems that the root cause is bug 1547095. We plan to workaround this issue using
NFS "nolock" mount option. Setting the bug as dependency until we have a tested
patch.

Comment 4 Nir Soffer 2018-02-28 23:55:40 UTC

Raz, can we test the patch with latest engine 4.2?

Note that you will need the fix from bug 1548819 for engine to work with vdsm
master with this patch.

Comment 6 Raz Tamir 2018-03-01 08:34:41 UTC

(In reply to Nir Soffer from comment #4)
> Raz, can we test the patch with latest engine 4.2?
> 
> Note that you will need the fix from bug 1548819 for engine to work with vdsm
> master with this patch.

Elad, can you take care of that?

Comment 8 Lilach Zitnitski 2018-03-04 09:59:41 UTC

(In reply to Nir Soffer from comment #2)
> Please test also the case when we power off the host, simulating power loss.

The results look the same - the VM is stuck on UNKNOWN status and in the vdsm log there is the same error 

2018-03-01 16:07:43,577+0200 ERROR (vm/999c4968) [virt.vm] (vmId='999c4968-9443-4e30-bdc8-13f337efde5b') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-03-01T14:07:41.368118Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-03-01T14:07:41.391107Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__0/53086230-5e7b-4f3c-93fd-9de67a417699/images/3c161b7a-39f7-4e28-b64d-8da90049c44f/2f131f26-dc85-439b-8089-47b77bc96491,format=qcow2,if=none,id=drive-virtio-disk0,serial=3c161b7a-39f7-4e28-b64d-8da90049c44f,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
Is another process using the image?

Comment 9 Lilach Zitnitski 2018-03-04 10:00:18 UTC

Created attachment 1403728 [details]
Power off host logs

Comment 10 Nir Soffer 2018-03-05 20:34:07 UTC

We do not depend now on bug 1547095.

Comment 11 Allon Mureinik 2018-03-06 08:16:43 UTC

Nir, can you please add some doctext explaining the change, and how the admin can override it if needed?

Comment 14 Elad 2018-03-15 16:50:15 UTC

Tested according to the steps from the description:
1) Started a VM with lease and disk on an NFSv3 domain (on host_mixed_3)
2) blocked connection from the host (host_mixed_3) running the VM (non spm) to the storage and from the engine to the host (host_mixed_3) 


Results:
VM started on another host (host_mixed_2)

=============================================================


VM started on host_mixed_3:

2018-03-15 18:30:10,524+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [235586a8] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) moved from 'PoweringUp' --> 'Up'


host_mixed_3 becomes unreachable:

2018-03-15 18:33:42,158+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-35) [610e3b8e] Host 'host_mixed_3' is not responding.


VM started on host_mixed_2:

2018-03-15 18:36:27,140+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [] VM 'bbc69a02-b57c-4c62-8f15-7608d62dae1d'(test1) was unexpectedly detected as 'PoweringUp' on
 VDS '53a89450-a346-4002-8d93-82c266c50c20'(host_mixed_2) (expected on 'babf91ab-84e7-4dce-84a4-5eaf0dd5aaf1')


=============================================================

Used:
rhevm-4.1.10.3-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64
libvirt-3.9.0-14.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
sanlock-3.6.0-1.el7.x86_64