Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1547033

Summary: [RHEL 7.5] Agent cannot restart HE VM installed on NFSv3 on another HE host, after disaster on the host with HE VM
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Artyom <alukiano>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED DUPLICATE QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.2.1CC: alukiano, bugs, dfediuck, mavital
Target Milestone: ---Keywords: Regression
Target Release: ---Flags: mavital: blocker?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-26 11:52:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1547095    
Bug Blocks:    
Attachments:
Description Flags
agent, broker and vdsm logs none

Description Artyom 2018-02-20 11:40:32 UTC
Description of problem:
HE agent can not restart VM on another HE host when host with HE VM crashed

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
libvirt-daemon-3.9.0-13.el7.x86_64
qemu-kvm-common-rhev-2.10.0-20.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
qemu-img-rhev-2.10.0-20.el7.x86_64
qemu-kvm-rhev-2.10.0-20.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure HE environment with at least two hosts
2. Run # poweroff -f command on the host with the HE VM
3.

Actual results:
The second host cannot start the HE VM
2018-02-20 13:28:31,557+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573)
2018-02-20 13:28:31,560+0200 INFO  (jsonrpc/1) [api.host] START getCapabilities() from=::1,47952 (api:46)
2018-02-20 13:28:31,604+0200 ERROR (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') The vm start process failed (vm:939)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 868, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2774, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-02-20T11:28:30.780244Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2018-02-20T11:28:30.829712Z qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock
Is another process using the image?
2018-02-20 13:28:31,604+0200 INFO  (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') Changed state to Down: internal error: process exited while connecting to monitor: 2018-02-20T11:28:30.780244Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2018-02-20T11:28:30.829712Z qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock
Is another process using the image? (code=1) (vm:1658)


Expected results:
The second host must succeed to start HE VM

Additional info:
In my case, I can return my HE environment back after I power on the problematic host via power management

Comment 1 Artyom 2018-02-20 11:44:03 UTC
Created attachment 1398213 [details]
agent, broker and vdsm logs

You can search for lines:
2018-02-20 13:28:31,604+0200 ERROR (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') The vm start process failed (vm:939)
under vdsm logs

I ran # poweroff -f command on the alma07 host

# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : master-vds10.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down_unexpected", "detail": "Down"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : d12626ed
local_conf_timestamp               : 73451
Host timestamp                     : 73451
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73451 (Tue Feb 20 13:42:55 2018)
        host-id=1
        score=0
        vm_conf_refresh_time=73451 (Tue Feb 20 13:42:55 2018)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 22:29:20 1970


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : alma07.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : eb802593
local_conf_timestamp               : 71946
Host timestamp                     : 71945
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=71945 (Tue Feb 20 13:17:00 2018)
        host-id=2
        score=3400
        vm_conf_refresh_time=71946 (Tue Feb 20 13:17:00 2018)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 3
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down_unexpected", "detail": "Down"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 52610974
local_conf_timestamp               : 5672
Host timestamp                     : 5671
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=5671 (Tue Feb 20 13:43:03 2018)
        host-id=3
        score=0
        vm_conf_refresh_time=5672 (Tue Feb 20 13:43:04 2018)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 03:39:49 1970

Comment 2 Martin Sivák 2018-02-20 15:03:18 UTC
Artyom, how long did you wait after the host was killed? Sanlock takes time before it releases the lease.. and during that time the VM will not be started anywhere.

- Please provide better timing information
- Please provide the sanlock client status output from all hosts
- Please indicate which host was used to get the --vm-status too

Comment 3 Artyom 2018-02-20 15:19:52 UTC
- I waited half hour, but I believe it will never start VM until I power on problematic host via power management
- I can provide only sanlock information from two up hosts
master-vds10 ~]# sanlock client status
daemon 9b9a32e1-a7a8-403f-86da-ae11006dfccc.master-vds
p -1 helper
p -1 listener
p 9340 
p -1 status
s 8519b4d5-831f-4469-b012-5247e16b1985:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__1/8519b4d5-831f-4469-b012-5247e16b1985/dom_md/ids:0
s 3f2f464e-8c7f-46bb-8f3c-40032e465729:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__2/3f2f464e-8c7f-46bb-8f3c-40032e465729/dom_md/ids:0
s 2a7334e7-c1d3-41fd-9552-2aacbfa4f9af:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/dom_md/ids:0
s 36b82654-72fe-4ad8-8893-8bd2b8c879d3:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/ids:0
s hosted-engine:1:/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/fcfef94d-74f6-4bc3-b775-7803b5d85361/fa8f88d7-78fc-4bbc-8a74-a4108ab773a3:0
r 36b82654-72fe-4ad8-8893-8bd2b8c879d3:SDM:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/leases:1048576:9 p 9340

# sanlock client status
daemon ba2c7ee8-67f7-4235-8914-6e95177c98e4.rose05.qa.
p -1 helper
p -1 listener
p -1 status
s hosted-engine:3:/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/fcfef94d-74f6-4bc3-b775-7803b5d85361/fa8f88d7-78fc-4bbc-8a74-a4108ab773a3:0
s 8519b4d5-831f-4469-b012-5247e16b1985:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__1/8519b4d5-831f-4469-b012-5247e16b1985/dom_md/ids:0
s 3f2f464e-8c7f-46bb-8f3c-40032e465729:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__2/3f2f464e-8c7f-46bb-8f3c-40032e465729/dom_md/ids:0
s 36b82654-72fe-4ad8-8893-8bd2b8c879d3:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/ids:0
s 2a7334e7-c1d3-41fd-9552-2aacbfa4f9af:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/dom_md/ids:0

- from master-vds10

Simone did some nice investigation, so I believe we need to wait for some input on the bug https://bugzilla.redhat.com/show_bug.cgi?id=1547095

Comment 4 Artyom 2018-02-25 19:03:42 UTC
I can confirm that on ISCSI and NFSv4 all works as expected, so we can reduce bug priority and do not set blocker flag.

Comment 5 Simone Tiraboschi 2018-02-26 11:52:45 UTC
It's a RHEL 7.5 qemu issue on NFSv3.
Nothing specific to HE.

*** This bug has been marked as a duplicate of bug 1547095 ***