Bug 1471907 - Could not achieve HA VM High Availability through VM storage lease option enabled with FC based Storage Domains
Could not achieve HA VM High Availability through VM storage lease option ena...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: sanlock (Show other bugs)
4.1.3
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Tomas Jelinek
Pavel Stehlik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-17 11:58 EDT by Anuvrat Sharma
Modified: 2017-07-18 05:18 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-18 02:47:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Anuvrat Sharma 2017-07-17 11:58:05 EDT
Description of problem:

Highly Available Virtual machines are not restarting after when RHV 4.1 hyper-visor hosts went into non responsive state. After when hyper-visor host went into non-responsive mode VM state changes to unknown state (?) and RHV manager tries to restart it but HA VM never restarts to target host.

Version-Release number of selected component (if applicable):

RHV Manager            =   4.1.3.5-0.1.el7
RHV Hypervisor Host    =   Red Hat Virtualization Host 4.1 (el7.3)
Kernel Version         =   3.10.0 - 514.16.1.el7.x86_64

How reproducible:

(1) 

Steps to Reproduce:

1. Enable High Availability for VM and select Target Storage Domain for VM Lease and choose any SD from drop down menu and Start VM.
2. Make Hyper-visor Host running VM as non responsive by bringing down ovirtmgmt bridge on hyper-visor host or bring down underlying physical interfaces attached to bridge.


Actual results:

After when hyper-visor host is set into non responsive mode by Manager VM state changes to unknown with (?) in Manager portal. An event is also logged as "Trying to restart VM Test_HA on Host <choosen host>" but it never restart that HA VM. From vdsm side we are seeing following logs:

2017-07-17 20:13:02,407+0530 ERROR (vm/2393c0c9) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') The vm start process failed (vm:632)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 563, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 2021, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3777, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: resource busy: Failed to acquire lock: error -243
2017-07-17 20:13:02,409+0530 INFO  (vm/2393c0c9) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') Changed state to Down: resource busy: Failed to acquire lock: error -243 (code=1) (vm:1222)
2017-07-17 20:13:02,409+0530 INFO  (vm/2393c0c9) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') Stopping connection (guestagent:430)
2017-07-17 20:13:02,633+0530 INFO  (jsonrpc/3) [vdsm.api] START destroy(gracefulAttempts=1) (api:39)
2017-07-17 20:13:02,633+0530 INFO  (jsonrpc/3) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') Release VM resources (vm:4231)
2017-07-17 20:13:02,633+0530 WARN  (jsonrpc/3) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') trying to set state to Powering down when already Down (vm:362)
2017-07-17 20:13:02,633+0530 INFO  (jsonrpc/3) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') Stopping connection (guestagent:430)

2017-07-17 20:13:02,843+0530 INFO  (jsonrpc/3) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') Stopping connection (guestagent:430)
2017-07-17 20:13:02,844+0530 WARN  (jsonrpc/3) [root] File: /var/lib/libvirt/qemu/channels/2393c0c9-ec04-4f96-8aba-081a37e9adf8.com.redhat.rhevm.vdsm already removed (utils:120)
2017-07-17 20:13:02,844+0530 WARN  (jsonrpc/3) [root] File: /var/lib/libvirt/qemu/channels/2393c0c9-ec04-4f96-8aba-081a37e9adf8.org.qemu.guest_agent.0 already removed (utils:120)
2017-07-17 20:13:02,844+0530 WARN  (jsonrpc/3) [virt.vm] (vmId='2393c0c9-ec04-4f96-8aba-081a37e9adf8') timestamp already removed from stats cache (vm:1744)
2017-07-17 20:13:02,845+0530 INFO  (jsonrpc/3) [dispatcher] Run and protect: inappropriateDevices(thiefId=u'2393c0c9-ec04-4f96-8aba-081a37e9adf8') (logUtils:51)


Expected results:

HA VM should restart by RHV manager on some other active hyper-visor host.

Additional info:
Comment 1 Anuvrat Sharma 2017-07-17 12:04:17 EDT
Anything that makes the qemu-kvm process end will restart the VM elsewhere but problem comes where host cannot be rebooted/fenced..
Comment 2 Tomas Jelinek 2017-07-18 02:47:06 EDT
Maybe Im missing something but this is exactly what the VM Lease functionality is supposed to do. If the VM is actually running, it will not be started on an another host.

Note You need to log in before you can comment on or make changes to this bug.