Bug 1547033
| Summary: | [RHEL 7.5] Agent cannot restart HE VM installed on NFSv3 on another HE host, after disaster on the host with HE VM | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Artyom <alukiano> | ||||
| Component: | Agent | Assignee: | Simone Tiraboschi <stirabos> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Artyom <alukiano> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.2.1 | CC: | alukiano, bugs, dfediuck, mavital | ||||
| Target Milestone: | --- | Keywords: | Regression | ||||
| Target Release: | --- | Flags: | mavital:
blocker?
|
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-02-26 11:52:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1547095 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
Created attachment 1398213 [details]
agent, broker and vdsm logs
You can search for lines:
2018-02-20 13:28:31,604+0200 ERROR (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') The vm start process failed (vm:939)
under vdsm logs
I ran # poweroff -f command on the alma07 host
# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : master-vds10.qa.lab.tlv.redhat.com
Host ID : 1
Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down_unexpected", "detail": "Down"}
Score : 0
stopped : False
Local maintenance : False
crc32 : d12626ed
local_conf_timestamp : 73451
Host timestamp : 73451
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=73451 (Tue Feb 20 13:42:55 2018)
host-id=1
score=0
vm_conf_refresh_time=73451 (Tue Feb 20 13:42:55 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Thu Jan 1 22:29:20 1970
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : alma07.qa.lab.tlv.redhat.com
Host ID : 2
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : eb802593
local_conf_timestamp : 71946
Host timestamp : 71945
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=71945 (Tue Feb 20 13:17:00 2018)
host-id=2
score=3400
vm_conf_refresh_time=71946 (Tue Feb 20 13:17:00 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
--== Host 3 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : rose05.qa.lab.tlv.redhat.com
Host ID : 3
Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down_unexpected", "detail": "Down"}
Score : 0
stopped : False
Local maintenance : False
crc32 : 52610974
local_conf_timestamp : 5672
Host timestamp : 5671
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=5671 (Tue Feb 20 13:43:03 2018)
host-id=3
score=0
vm_conf_refresh_time=5672 (Tue Feb 20 13:43:04 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Thu Jan 1 03:39:49 1970
Artyom, how long did you wait after the host was killed? Sanlock takes time before it releases the lease.. and during that time the VM will not be started anywhere. - Please provide better timing information - Please provide the sanlock client status output from all hosts - Please indicate which host was used to get the --vm-status too - I waited half hour, but I believe it will never start VM until I power on problematic host via power management - I can provide only sanlock information from two up hosts master-vds10 ~]# sanlock client status daemon 9b9a32e1-a7a8-403f-86da-ae11006dfccc.master-vds p -1 helper p -1 listener p 9340 p -1 status s 8519b4d5-831f-4469-b012-5247e16b1985:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__1/8519b4d5-831f-4469-b012-5247e16b1985/dom_md/ids:0 s 3f2f464e-8c7f-46bb-8f3c-40032e465729:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__2/3f2f464e-8c7f-46bb-8f3c-40032e465729/dom_md/ids:0 s 2a7334e7-c1d3-41fd-9552-2aacbfa4f9af:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/dom_md/ids:0 s 36b82654-72fe-4ad8-8893-8bd2b8c879d3:1:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/ids:0 s hosted-engine:1:/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/fcfef94d-74f6-4bc3-b775-7803b5d85361/fa8f88d7-78fc-4bbc-8a74-a4108ab773a3:0 r 36b82654-72fe-4ad8-8893-8bd2b8c879d3:SDM:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/leases:1048576:9 p 9340 # sanlock client status daemon ba2c7ee8-67f7-4235-8914-6e95177c98e4.rose05.qa. p -1 helper p -1 listener p -1 status s hosted-engine:3:/var/run/vdsm/storage/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/fcfef94d-74f6-4bc3-b775-7803b5d85361/fa8f88d7-78fc-4bbc-8a74-a4108ab773a3:0 s 8519b4d5-831f-4469-b012-5247e16b1985:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__1/8519b4d5-831f-4469-b012-5247e16b1985/dom_md/ids:0 s 3f2f464e-8c7f-46bb-8f3c-40032e465729:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__2/3f2f464e-8c7f-46bb-8f3c-40032e465729/dom_md/ids:0 s 36b82654-72fe-4ad8-8893-8bd2b8c879d3:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_GE_compute-he-1_nfs__0/36b82654-72fe-4ad8-8893-8bd2b8c879d3/dom_md/ids:0 s 2a7334e7-c1d3-41fd-9552-2aacbfa4f9af:3:/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com\:_Compute__NFS_alukiano_compute-ge-he-1/2a7334e7-c1d3-41fd-9552-2aacbfa4f9af/dom_md/ids:0 - from master-vds10 Simone did some nice investigation, so I believe we need to wait for some input on the bug https://bugzilla.redhat.com/show_bug.cgi?id=1547095 I can confirm that on ISCSI and NFSv4 all works as expected, so we can reduce bug priority and do not set blocker flag. It's a RHEL 7.5 qemu issue on NFSv3. Nothing specific to HE. *** This bug has been marked as a duplicate of bug 1547095 *** |
Description of problem: HE agent can not restart VM on another HE host when host with HE VM crashed Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch libvirt-daemon-3.9.0-13.el7.x86_64 qemu-kvm-common-rhev-2.10.0-20.el7.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64 qemu-img-rhev-2.10.0-20.el7.x86_64 qemu-kvm-rhev-2.10.0-20.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. Configure HE environment with at least two hosts 2. Run # poweroff -f command on the host with the HE VM 3. Actual results: The second host cannot start the HE VM 2018-02-20 13:28:31,557+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573) 2018-02-20 13:28:31,560+0200 INFO (jsonrpc/1) [api.host] START getCapabilities() from=::1,47952 (api:46) 2018-02-20 13:28:31,604+0200 ERROR (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') The vm start process failed (vm:939) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 868, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2774, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: process exited while connecting to monitor: 2018-02-20T11:28:30.780244Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-02-20T11:28:30.829712Z qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock Is another process using the image? 2018-02-20 13:28:31,604+0200 INFO (vm/28f35d31) [virt.vm] (vmId='28f35d31-d0fc-4902-a44c-9d2251f09e21') Changed state to Down: internal error: process exited while connecting to monitor: 2018-02-20T11:28:30.780244Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-02-20T11:28:30.829712Z qemu-kvm: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1: Failed to get "write" lock Is another process using the image? (code=1) (vm:1658) Expected results: The second host must succeed to start HE VM Additional info: In my case, I can return my HE environment back after I power on the problematic host via power management