1759933 – After Read Only disk live migration failed causing engine-host communication issues VM moves to non-responsive than to powering-down forever- VM can not be power down even after host is up

Bug 1759933 - After Read Only disk live migration failed causing engine-host communication issues VM moves to non-responsive than to powering-down forever- VM can not be power down even after host is up

Summary: After Read Only disk live migration failed causing engine-host communication ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.3.6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.1
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1738520 1776003 (view as bug list)
Depends On:	1821627 1832204 1835662
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-09 12:51 UTC by Avihai
Modified:	2020-07-08 08:26 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-07-08 08:26:27 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	bzlotnik: needinfo- pm-rhel: ovirt-4.4+ aoconnor: blocker+

Attachments	(Terms of Use)
first_time_VmPowerOff_failure_logs (5.03 MB, application/gzip) 2019-10-09 12:51 UTC, Avihai	no flags	Details
host_up_VmPowerOff_failure_logs (986.58 KB, application/gzip) 2019-10-09 12:52 UTC, Avihai	no flags	Details
logs (1.47 MB, application/zip) 2019-11-24 15:02 UTC, Evelina Shames	no flags	Details
libvirt_qemu_logs (324.01 KB, application/zip) 2019-11-28 13:48 UTC, Evelina Shames	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	108067	0	master	ABANDONED	core: fallback to cold move for r/o disks LSM	2021-02-15 22:13:57 UTC

Description Avihai 2019-10-09 12:51:31 UTC

Created attachment 1623794 [details]
first_time_VmPowerOff_failure_logs

Description of problem:
After disk live migration failed due to engine to host communication issues host also VM moves to non-responsive and VM poweroff failed with errors and VM is stuck in "powering down" state.

Also after the host went up I tried again to poweroff the VM manually but still I get the same errors (see below) 

only restart to libvirtd changes the VM state to down.

VDSM error:
Virtual machine destroy error', 'code': 42


VDSM:
2019-10-08 19:39:17,535+0300 INFO  (jsonrpc/5) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.46.16.248,52740, flow_id=vms_syncAction_b679707a-a717-483b, vmId=72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 (api:48)
2019-10-08 19:39:47,537+0300 INFO  (jsonrpc/5) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.46.16.248,52740, flow_id=vms_syncAction_b679707a-a717-483b, vmId=72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 (api:54)

Libvirt:
2019-10-08 16:39:23.885+0000: 5661: warning : qemuDomainObjBeginJobInternal:6722 : Cannot start job (query, none, none) for domain vm_TestCase4925_0819323921; current job is (modify, none, none) owned by (5662 
remoteDispatchDomainBlockJobAbort, 0 <null>, 0 <null> (flags=0x0)) for (222s, 0s, 0s)
2019-10-08 16:39:23.885+0000: 5661: error : qemuDomainObjBeginJobInternal:6744 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort)
2019-10-08 16:39:23.885+0000: 5661: info : virObjectUnref:344 : OBJECT_UNREF: obj=0x7f4d74120270
2019-10-08 16:39:23.885+0000: 5661: error : virFileReadAll:1460 : Failed to open file '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d68\x2dvmTestCase49250819323921.scope/cpuacct.usage': No such file 
or directory
2019-10-08 16:39:23.885+0000: 5661: error : virCgroupGetValueStr:836 : Unable to read from '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d68\x2dvmTestCase49250819323921.scope/cpuacct.usage': No such 
file or directory
2019-10-08 16:39:23.885+0000: 5661: error : virFileReadAll:1460 : Failed to open file '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d68\x2dvmTestCase49250819323921.scope/cpuacct.stat': No such file o
r directory
2019-10-08 16:39:23.885+0000: 5661: error : virCgroupGetValueStr:836 : Unable to read from '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d68\x2dvmTestCase49250819323921.scope/cpuacct.stat': No such f
ile or directory
2019-10-08 16:39:23.885+0000: 5661: warning : qemuGetProcessInfo:1404 : cannot parse process status data
2019-10-08 16:39:23.885+0000: 5661: error : virNetDevTapInterfaceStats:764 : internal error: /proc/net/dev: Interface not found
2019-10-08 16:39:47.536+0000: 5660: warning : qemuDomainObjBeginJobInternal:6722 : Cannot start job (destroy, none, none) for domain vm_TestCase4925_0819323921; current job is (modify, none, none) owned by (566
2 remoteDispatchDomainBlockJobAbort, 0 <null>, 0 <null> (flags=0x0)) for (246s, 0s, 0s)
2019-10-08 16:39:47.536+0000: 5660: error : qemuDomainObjBeginJobInternal:6744 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort)


Engine:
2019-10-08 19:39:17,535+03 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Running command: StopVmCommand internal: false. Entities affected :  ID: 72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 Type: VMAction group STOP_VM with role type USER
2019-10-08 19:39:17,539+03 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 446f4cc5
2019-10-08 19:39:17,541+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5d75f260
2019-10-08 19:39:29,737+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-28) [] VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7'(vm_TestCase4925_0819323921) moved from 'NotResponding' --> 'PoweringDown'
2019-10-08 19:39:38,462+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [ead2c4d] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9e1df8ec-2ac3-4496-81af-c9c76b839ac2=PROVIDER]', sharedLocks=''}'
2019-10-08 19:39:38,475+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [ead2c4d] Running command: SyncNetworkProviderCommand internal: true.
2019-10-08 19:39:38,627+03 INFO  [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-83) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access
2019-10-08 19:39:38,803+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [ead2c4d] Lock freed to object 'EngineLock:{exclusiveLocks='[9e1df8ec-2ac3-4496-81af-c9c76b839ac2=PROVIDER]', sharedLocks=''}'
2019-10-08 19:39:47,548+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Failed in 'DestroyVDS' method
2019-10-08 19:39:47,553+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-84) [vms_syncAction_b679707a-a717-483b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command DestroyVDS failed: Virtual machine destroy error
2019-10-08 19:39:47,553+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]'
2019-10-08 19:39:47,553+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] HostName = host_mixed_3
2019-10-08 19:39:47,554+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Command 'DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2019-10-08 19:39:47,554+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] FINISH, DestroyVDSCommand, return: , log id: 5d75f260
2019-10-08 19:39:47,554+03 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Failed to destroy VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7' in VDS = '9e10e475-33fe-4aa6-8df2-5e6bfeb94356' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42'
2019-10-08 19:39:47,554+03 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] FINISH, DestroyVmVDSCommand, return: , log id: 446f4cc5
2019-10-08 19:39:47,554+03 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (default task-84) [vms_syncAction_b679707a-a717-483b] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with error destroyErr and code 42)
2019-10-08 19:39:47,564+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-84) [vms_syncAction_b679707a-a717-483b] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm_TestCase4925_0819323921 (Host: host_mixed_3, User: admin@internal-authz).
2019-10-08 19:39:47,569+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-84) [] Operation Failed: [Virtual machine destroy error]


Version-Release number of selected component (if applicable):


How reproducible:
Not a planned test, happened once due to a network issue between engine->host.

Steps to Reproduce:
TestCase4925.test_live_migrate_RO_disk[iscsi] ran + Network issue occured during live migration phase:

1) Started live disk migration from iscsi_0 SD to iscsi_1 

2) right after that communication issues between engine -> host occurs and VM goes to NotResponding
2019-10-08 19:36:44,539+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7'(vm_TestCase4925_08193239
21) moved from 'Up' --> 'NotResponding'

3) Failed live migration
2019-10-08 19:38:40,226+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_c0dd7821-d829-4b85] Failed VmReplicateDiskFi
nish (Disk 'cc33e230-a2a6-48cd-a67c-0dc531ef4fdb', VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7')

4) host goes up :
2019-10-08 19:38:47,157+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [551ea79f] EVENT_ID: VDS_DETECTED(13), Status of host host_mixed_3 was set to Up.

4) Trying to power down the VM fails once:
VDSM logs shows that time the destroy started when host was not still up:
2019-10-08 19:39:17,535+0300 INFO  (jsonrpc/5) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.46.16.248,52740, flow_id=vms_syncAction_b679707a-a717-483b, vmId=72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 (api:48)
2019-10-08 19:39:47,537+0300 INFO  (jsonrpc/5) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.46.16.248,52740, flow_id=vms_syncAction_b679707a-a717-483b, vmId=72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 (api:54)

5)Hours later I powered up the host and tried to power down the VM again 

Still, I got the same results and VM stay in 'Powering down' state forever.


Engine:
2019-10-09 15:06:17,026+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [603c2621] EVENT_ID: VDS_DETECTED(13), Status of host 
host_mixed_3 was set to Up.
2019-10-09 15:06:17,030+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [603c2621] EVENT_ID: VDS_ALERT_FENCE_IS_NOT_CONFIGURED(9,000), Failed to verify Power Management configuration for Host host_mixed_3.
2019-10-09 15:06:17,065+03 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [31e97303] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 9e10e475-33fe-4aa6-8df2-5e6bfeb94356 Type: VDS
2019-10-09 15:06:25,090+03 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Running command: StopVmCommand internal: false. Entities affected :  ID: 72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 Type: VMAction group STOP_VM with role type USER
2019-10-09 15:06:25,094+03 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 2b8455f4
2019-10-09 15:06:25,094+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 42dc40d2
2019-10-09 15:06:55,102+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Failed in 'DestroyVDS' method
2019-10-09 15:06:55,116+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command DestroyVDS failed: Virtual machine destroy error
2019-10-09 15:06:55,116+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]'
2019-10-09 15:06:55,116+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] HostName = host_mixed_3
2019-10-09 15:06:55,116+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Command 'DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='9e10e475-33fe-4aa6-8df2-5e6bfeb94356', vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2019-10-09 15:06:55,116+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] FINISH, DestroyVDSCommand, return: , log id: 42dc40d2
2019-10-09 15:06:55,116+03 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Failed to destroy VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7' in VDS = '9e10e475-33fe-4aa6-8df2-5e6bfeb94356' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42'
2019-10-09 15:06:55,116+03 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] FINISH, DestroyVmVDSCommand, return: , log id: 2b8455f4
2019-10-09 15:06:55,116+03 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with error destroyErr and code 42)
2019-10-09 15:06:55,127+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-50940) [0a60c03e-33ad-42d5-a020-c6693790f8a8] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm_TestCase4925_0819323921 (Host: host_mixed_3, User: admin@internal-authz).

VDSM:

2019-10-09 15:06:55,096+0300 WARN  (jsonrpc/2) [virt.vm] (vmId='72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7') Failed to destroy VM '72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7' gracefully (error=68) (vm:5230)
2019-10-09 15:06:55,097+0300 INFO  (jsonrpc/2) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.46.16.248,52740, flow_id=0a60c03e-33ad-42d5-a0
20-c6693790f8a8, vmId=72a8ad35-7a5e-4b8a-9d1e-9d9ea9d1d5a7 (api:54)
2019-10-09 15:06:55,097+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.destroy failed (error 42) in 30.01 seconds (__init__:312)

Actual results:
VM power down should work after 

Expected results:
It's expected that LSM fails(due to communication issues) and that VM moves to non-responsive but after host goes up VM should go be able to change state and to power down successfully.


Additional info:
restart to libvirtd on the host where the VM resides solved the issue- vm went to down state.

Comment 1 Avihai 2019-10-09 12:52:33 UTC

Created attachment 1623795 [details]
host_up_VmPowerOff_failure_logs

Comment 2 Avihai 2019-10-09 13:15:27 UTC

Also, another impact on this issue is that as long as the host as a VM which is not in 'down' state it can not be deactivated.

Comment 3 Avihai 2019-10-09 14:59:24 UTC

Version-Release number of selected component (if applicable):

ovirt-engine-4.3.6.7-0.1.el7.noarch
vdsm-4.30.33-1.el7ev.x86_64
libvirt-4.5.0-23.el7_7.1.x86_64
qemu-img-rhev-2.12.0-33.el7_7.4.x86_64

Comment 4 Evelina Shames 2019-11-24 15:02:15 UTC

I saw this issue again:

Description of problem:
VM became unresponsive after VmReplicateDiskFinishVDSCommand and caused LSM failure and VM destroy error.

Engine log:
2019-11-20 08:20:19,230+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='666bdfa6-a483-431e-891a-1e9eb2197278', vmId='cfda2240-2db8-430b-9a09-2d37b97e33d6', storagePoolId='1b798999-f94b-46f3-850a-b002ab228fd6', srcStorageDomainId='288b173a-cb9e-444d-be42-43ec6aa73d5f', targetStorageDomainId='56c56243-35c2-4027-b095-0d013c12e1dd', imageGroupId='dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a', imageId='ba60a7f5-9893-474c-b7a4-ce67088a82de'}), log id: 27ed9a7a
2019-11-20 08:20:47,343+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] Setting new tasks map. The map contains now 2 tasks
2019-11-20 08:21:11,485+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] VM 'cfda2240-2db8-430b-9a09-2d37b97e33d6'(vm_TestCase4925_2008172480) moved from 'Up' --> 'NotResponding'
2019-11-20 08:21:11,566+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VM_NOT_RESPONDING(126), VM vm_TestCase4925_2008172480 is not responding.
2019-11-20 08:21:17,344+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] Setting new tasks map. The map contains now 0 tasks
2019-11-20 08:21:17,344+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] Cleared all tasks of pool '1b798999-f94b-46f3-850a-b002ab228fd6'.
2019-11-20 08:23:19,442+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Message timeout which can be caused by communication issues
2019-11-20 08:23:19,443+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=5022, message=Message timeout which can be caused by communication issues]]'
2019-11-20 08:23:19,443+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] HostName = host_mixed_2
2019-11-20 08:23:19,443+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='666bdfa6-a483-431e-891a-1e9eb2197278', vmId='cfda2240-2db8-430b-9a09-2d37b97e33d6', storagePoolId='1b798999-f94b-46f3-850a-b002ab228fd6', srcStorageDomainId='288b173a-cb9e-444d-be42-43ec6aa73d5f', targetStorageDomainId='56c56243-35c2-4027-b095-0d013c12e1dd', imageGroupId='dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a', imageId='ba60a7f5-9893-474c-b7a4-ce67088a82de'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2019-11-20 08:23:19,443+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 27ed9a7a
2019-11-20 08:23:19,443+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-27924) [disks_syncAction_202f4230-8e2e-4b9b] Clearing domains data for host host_mixed_2
2019-11-20 08:23:19,443+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-27924) [disks_syncAction_202f4230-8e2e-4b9b] Host 'host_mixed_2' is not responding.
2019-11-20 08:23:19,449+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-27924) [disks_syncAction_202f4230-8e2e-4b9b] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host host_mixed_2 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2019-11-20 08:23:19,464+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] Failed VmReplicateDiskFinish (Disk 'dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a', VM 'cfda2240-2db8-430b-9a09-2d37b97e33d6')
2019-11-20 08:23:19,464+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] Command 'LiveMigrateDisk' id: 'cfb39447-69cb-489a-b1d1-10c634e76f3a' with children [8a6ee64e-1f55-4d05-9742-f1a8a6510603, da329ab8-6758-4e69-a24a-00e36500fc06, 88a452ba-e71a-4d18-bac1-6e0680d80c15] failed when attempting to perform the next operation, marking as 'ACTIVE'
2019-11-20 08:23:19,464+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] EngineException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.EngineException: EngineException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
	at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:429) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:393) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:246) [bll.jar:]
	at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:]
	at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:77) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) [bll.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_232]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_232]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_232]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_232]
	at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]

2019-11-20 08:23:19,465+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-12) [disks_syncAction_202f4230-8e2e-4b9b] Command 'LiveMigrateDisk' id: 'cfb39447-69cb-489a-b1d1-10c634e76f3a' child commands '[8a6ee64e-1f55-4d05-9742-f1a8a6510603, da329ab8-6758-4e69-a24a-00e36500fc06, 88a452ba-e71a-4d18-bac1-6e0680d80c15]' executions were completed, status 'FAILED'
2019-11-20 08:23:20,661+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2019-11-20 08:23:20,662+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] Failed during live storage migration of disk 'dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a' of vm 'cfda2240-2db8-430b-9a09-2d37b97e33d6', attempting to end replication before deleting the target disk
2019-11-20 08:23:50,670+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] Failed in 'VmReplicateDiskFinishVDS' method
2019-11-20 08:23:50,681+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: General Exception: ('Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort)',)
2019-11-20 08:23:50,682+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='666bdfa6-a483-431e-891a-1e9eb2197278', vmId='cfda2240-2db8-430b-9a09-2d37b97e33d6', storagePoolId='1b798999-f94b-46f3-850a-b002ab228fd6', srcStorageDomainId='288b173a-cb9e-444d-be42-43ec6aa73d5f', targetStorageDomainId='288b173a-cb9e-444d-be42-43ec6aa73d5f', imageGroupId='dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a', imageId='ba60a7f5-9893-474c-b7a4-ce67088a82de'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = General Exception: ('Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort)',), code = 100
2019-11-20 08:23:50,682+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] Replication end of disk 'dd1b4ba9-3a01-4eba-b8bb-aeec6f1e048a' in vm 'cfda2240-2db8-430b-9a09-2d37b97e33d6' back to the source failed, skipping deletion of the target disk
2019-11-20 08:23:50,798+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [disks_syncAction_202f4230-8e2e-4b9b] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk disk_virtio_scsicow_2008174840 to domain iscsi_1.
2019-11-20 08:23:53,762+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-86) [] Operation Failed: [Cannot move Virtual Disk. The Virtual Machine should be either in Up or Paused status.]
2019-11-20 08:23:56,669+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-71) [] VM 'cfda2240-2db8-430b-9a09-2d37b97e33d6'(vm_TestCase4925_2008172480) moved from 'NotResponding' --> 'PoweringDown'
2019-11-20 08:24:24,349+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] Failed in 'DestroyVDS' method
2019-11-20 08:24:24,355+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command DestroyVDS failed: Virtual machine destroy error
2019-11-20 08:24:24,356+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] Command 'DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hostId='666bdfa6-a483-431e-891a-1e9eb2197278', vmId='cfda2240-2db8-430b-9a09-2d37b97e33d6', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2019-11-20 08:24:24,356+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] FINISH, DestroyVDSCommand, return: , log id: 25d9bfc5
2019-11-20 08:24:24,356+02 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] Failed to destroy VM 'cfda2240-2db8-430b-9a09-2d37b97e33d6' in VDS = '666bdfa6-a483-431e-891a-1e9eb2197278' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42'
2019-11-20 08:24:24,356+02 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] FINISH, DestroyVmVDSCommand, return: , log id: 2ca3017f
2019-11-20 08:24:24,356+02 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with error destroyErr and code 42)
2019-11-20 08:24:24,367+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-86) [vms_syncAction_158dc9fe-bc60-4869] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm_TestCase4925_2008172480 (Host: host_mixed_2, User: admin@internal-authz).
2019-11-20 08:24:24,371+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-86) [] Operation Failed: [Virtual machine destroy error]


VDSM:
2019-11-20 08:23:50,668+0200 ERROR (jsonrpc/7) [api] FINISH diskReplicateFinish error=Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort) (api:134)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 580, in diskReplicateFinish
    return self.vm.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4650, in diskReplicateFinish
    blkJobInfo = self._dom.blockJobInfo(drive.name, 0)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 108, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockJobAbort)

Version-Release number of selected component (if applicable):
vdsm-4.30.38-1.el7ev.x86_64
ovirt-engine-4.3.7.2-0.1.el7.noarch

How reproducible:
Once (automation TC)

Steps to Reproduce:
1. create VM with os
2. Attach read-only disks to the VM (all possible permutation)
3. Activate the disks
4. try to move them (LSM) to the another storage domain

Actual results:
Operation should succeed

Expected results:
Operation failed.

Additional info:
Relevant logs are attached

Comment 5 Evelina Shames 2019-11-24 15:02:46 UTC

Created attachment 1639220 [details]
logs

Comment 6 Evelina Shames 2019-11-24 15:03:33 UTC

*** Bug 1776003 has been marked as a duplicate of this bug. ***

Comment 7 Avihai 2019-11-28 08:53:27 UTC

VM poweroff failed with errors and VM is stuck in "powering down" state forever(until restarting libvirt manually).

Also after the host went up I tried again to poweroff the VM manually but still I get the same errors  and VM is stuck in "powering down" state.

This issue cause many TC's to fail as the host (which this VM resides on) can not be deactivated or move to maintenance for the rest of the test suit.

Adding this as an automation blocker and raising severity to high.
Please consider fixing this sooner, on 4.3.X.

Comment 9 RHEL Program Management 2019-11-28 10:11:02 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 10 Evelina Shames 2019-11-28 13:48:51 UTC

Created attachment 1640357 [details]
libvirt_qemu_logs

Manual reproducing steps:
 - Create VM with OS
 - Add new R/O disk d1
 - Start VM
 - Try to move (LSM) d1 to another storage domain - Operartion fails
 - VM enters into non responding state, try to power off VM - VM stays in 'powering 
   down' state forever.

* After libvirtd restart, the VM moves to 'Down' state.

libvirt and qemu logs are attached.

Comment 11 Avihai 2019-12-12 12:56:32 UTC

Benny, can you please take a look at this? this reproduces 100% of the time with read only disk.

Effects:
- LSM fails
- VM stays in 'powering down' state forever 
- This Block host to go to maintenance(and move to another cluster) as VM is not in down state.
- Many tests fails(that depends on the host going to maintenance/moving to another cluster) due to this failure

Comment 12 Benny Zlotnik 2019-12-12 13:24:49 UTC

(In reply to Avihai from comment #11)
> Benny, can you please take a look at this? this reproduces 100% of the time
> with read only disk.
> 
> Effects:
> - LSM fails
> - VM stays in 'powering down' state forever 
> - This Block host to go to maintenance(and move to another cluster) as VM is
> not in down state.
> - Many tests fails(that depends on the host going to maintenance/moving to
> another cluster) due to this failure
Does it reproduce on an el8 host?

The error is similar to this one: https://bugzilla.redhat.com/show_bug.cgi?id=1687176
And the bug being worked on is: https://bugzilla.redhat.com/show_bug.cgi?id=1655367

Comment 13 Avihai 2019-12-17 15:51:32 UTC

(In reply to Benny Zlotnik from comment #12)
> (In reply to Avihai from comment #11)
> > Benny, can you please take a look at this? this reproduces 100% of the time
> > with read only disk.
> > 
> > Effects:
> > - LSM fails
> > - VM stays in 'powering down' state forever 
> > - This Block host to go to maintenance(and move to another cluster) as VM is
> > not in down state.
> > - Many tests fails(that depends on the host going to maintenance/moving to
> > another cluster) due to this failure
> Does it reproduce on an el8 host?
Evelina can you please check the same manual scenario from comment 10 in an 4.4 ENV and update?

> The error is similar to this one:
> https://bugzilla.redhat.com/show_bug.cgi?id=1687176
This bug is on RHEL7.6 and it's closed-deferred meaning nothing will be done with it.
We see this issue on RHV4.3 meaning RHEL7.7 hosts and we started seeing this only on rhv-4.3.6/7 (was not seen before), it's weird as RHEL7.7 hosts were existing in 4.3.5 RHV and was not 

> And the bug being worked on is:
> https://bugzilla.redhat.com/show_bug.cgi?id=1655367
This is a bug for RHEL8, not for RHEL7.7 meaning this bug will only be fixed in RHV4.4 and not for customers in 4.3.
I think this bug should also be fixed in RHEL7.7 as we are seeing this all the time since 4.3.6.

Comment 14 Benny Zlotnik 2019-12-17 16:10:37 UTC

(In reply to Avihai from comment #13)
> (In reply to Benny Zlotnik from comment #12)
> > (In reply to Avihai from comment #11)
> > > Benny, can you please take a look at this? this reproduces 100% of the time
> > > with read only disk.
> > > 
> > > Effects:
> > > - LSM fails
> > > - VM stays in 'powering down' state forever 
> > > - This Block host to go to maintenance(and move to another cluster) as VM is
> > > not in down state.
> > > - Many tests fails(that depends on the host going to maintenance/moving to
> > > another cluster) due to this failure
> > Does it reproduce on an el8 host?
> Evelina can you please check the same manual scenario from comment 10 in an
> 4.4 ENV and update?
> 
> > The error is similar to this one:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1687176
> This bug is on RHEL7.6 and it's closed-deferred meaning nothing will be done
> with it.
> We see this issue on RHV4.3 meaning RHEL7.7 hosts and we started seeing this
> only on rhv-4.3.6/7 (was not seen before), it's weird as RHEL7.7 hosts were
> existing in 4.3.5 RHV and was not 
> 
> > And the bug being worked on is:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1655367
> This is a bug for RHEL8, not for RHEL7.7 meaning this bug will only be fixed
> in RHV4.4 and not for customers in 4.3.
> I think this bug should also be fixed in RHEL7.7 as we are seeing this all
> the time since 4.3.6.

From what I checked now it seems live merge doesn't work at all with read only disks.
libvirt.libvirtError: internal error: unable to execute QEMU command 'block-commit': Block node is read-only

I have never used this flow previously so I don't know if it used to work in the past. But if live merge doesn't work, LSM can't work either.
So perhaps the solution is to actually block these operations?

And I don't get the lock issue you see on an el8 host, just the live merge error

Comment 15 Evelina Shames 2019-12-24 07:09:56 UTC

I tried the flow in 4.4.0-0.13 ENV, VM doesn't enter into non responding state but the LSM fails.

Comment 16 RHEL Program Management 2020-03-20 10:26:51 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 17 Lukas Svaty 2020-03-20 11:47:25 UTC

Can we recheck this?

Comment 18 Avihai 2020-03-21 18:46:12 UTC

Evelina, can you please recheck this at latest 4.4 ds ?
Engine changed to RHEL8.2 as well with other changes...

Comment 19 Evelina Shames 2020-03-24 15:35:20 UTC

(In reply to Avihai from comment #18)
> Evelina, can you please recheck this at latest 4.4 ds ?
> Engine changed to RHEL8.2 as well with other changes...

Yes, VM doesn't enter into non-responding state but the LSM fails.

Comment 21 Benny Zlotnik 2020-03-30 08:53:15 UTC

I'm trying to understand when did this flow work
It looks like libvirt blocks snapshot creation for read-only disks entirely (which makes sense in a way)
libvirt.libvirtError: internal error: unable to execute QEMU command 'block-commit': Block node is read-only <-- When did live migrating a read-only disk work?
If it didn't used to work, then we should just block this, if it did then it's a regression in libvirt/qemu

Comment 22 Evelina Shames 2020-03-30 09:01:28 UTC

(In reply to Benny Zlotnik from comment #21)
> I'm trying to understand when did this flow work
> It looks like libvirt blocks snapshot creation for read-only disks entirely
> (which makes sense in a way)
> libvirt.libvirtError: internal error: unable to execute QEMU command
> 'block-commit': Block node is read-only <-- When did live migrating a
> read-only disk work?
> If it didn't used to work, then we should just block this, if it did then
> it's a regression in libvirt/qemu

We test this flow since 2014, it did used to work.

Comment 23 Benny Zlotnik 2020-03-30 13:11:42 UTC

(In reply to Evelina Shames from comment #22)
> (In reply to Benny Zlotnik from comment #21)
> > I'm trying to understand when did this flow work
> > It looks like libvirt blocks snapshot creation for read-only disks entirely
> > (which makes sense in a way)
> > libvirt.libvirtError: internal error: unable to execute QEMU command
> > 'block-commit': Block node is read-only <-- When did live migrating a
> > read-only disk work?
> > If it didn't used to work, then we should just block this, if it did then
> > it's a regression in libvirt/qemu
> 
> We test this flow since 2014, it did used to work.

I see, so to fix this we need fixes in both qemu and libvirt (which added a fairly recent validation to block external snapshot creation for r/o disks).
I guess it would be simpler to just fallback to regular move when dealing with r/o disks, so I'll add this workaround.
Is there a bug for live snapshot creation for r/o disks?

Comment 24 Lukas Svaty 2020-04-01 10:12:48 UTC

Moving to 4.4.0, due to Regression keyword

Comment 25 Benny Zlotnik 2020-04-02 07:23:50 UTC

It's more complicated than I initially thought

Hi Peter (not sure who is the right person to ask),
Is there a way to restore the old behaviour? Blocking the snapshot breaks live storage migration of read-only disks for us, I assume this was introduced here[1]
I tried to find a way around this, without the snapshot but I get "Block node is read-only" for blockCopy which are coming from qemu
 

[1] https://www.redhat.com/archives/libvir-list/2018-October/msg01322.html

Comment 26 Peter Krempa 2020-04-02 07:33:18 UTC

So is the read-only snapshot used only for the 'live storage migration' case? I think it would be straightforward with the new blockdev infrastructure to allow direct copy without the need to do a snapshot.

Same is theoretically also possible for snapshots, but if that's not needed we might not want to go that way.

After we clarify what's really needed we can then open a libvirt bug/rfe for this.

Comment 28 Michal Skrivanek 2020-04-02 13:22:01 UTC

*** Bug 1738520 has been marked as a duplicate of this bug. ***

Comment 29 Evelina Shames 2020-04-07 09:12:28 UTC

(In reply to Benny Zlotnik from comment #23)
> (In reply to Evelina Shames from comment #22)
> > (In reply to Benny Zlotnik from comment #21)
> > > I'm trying to understand when did this flow work
> > > It looks like libvirt blocks snapshot creation for read-only disks entirely
> > > (which makes sense in a way)
> > > libvirt.libvirtError: internal error: unable to execute QEMU command
> > > 'block-commit': Block node is read-only <-- When did live migrating a
> > > read-only disk work?
> > > If it didn't used to work, then we should just block this, if it did then
> > > it's a regression in libvirt/qemu
> > 
> > We test this flow since 2014, it did used to work.
> 
> I see, so to fix this we need fixes in both qemu and libvirt (which added a
> fairly recent validation to block external snapshot creation for r/o disks).
> I guess it would be simpler to just fallback to regular move when dealing
> with r/o disks, so I'll add this workaround.
> Is there a bug for live snapshot creation for r/o disks?

There wasn't, I opened one: Bug 1821627

Comment 34 Evelina Shames 2020-06-28 13:35:08 UTC

Verified on rhv-4.4.1-5

Comment 35 Sandro Bonazzola 2020-07-08 08:26:27 UTC

This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.