Created attachment 1425844 [details] relevant engine , and SPM vdsm log Description of problem: During automation test (TestCase19007) which does Cold move the VM disks(4 disks) + During the move operation block the connection ( using iptables)from the SPM host to target SD I see that some disks fail to move as expected BUT some disks are stuck in locked state. I saw this issue twice: - 1st time 2 disks got locked - 2nd time 1 disk got locked Details info is given for the 2nd time 1 disk got locked: Locked Disk ID = 739443d2-87b5-4f4f-b36e-2448f2155e38 Locked Name = disk_virtiocow_2407441898 Target SD = 'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b ' Engine: Start moving disk to a different SD named 'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b ': 2018-04-24 07:47:40,776+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-9) [disks_syncAction_8b902e8a-e50e-405a] Running command: MoveDiskCommand internal: false. Entities affected : ID: 739443d2-87b5-4f4f-b36e-2448f2155e38 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: 6c863436-8535-4615-98a2-310420c8bd7b Type: StorageAction group CREATE_DISK with role type USER Right after moving starts Blocking connection(No network connectivity) from SPM (host_mixed_1) to the target SD (test_gluster0, test_gluster_1, test_gluster_2) starts . 2018-04-24 07:47:41,159+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-9) [disks_syncAction_8b902e8a-e50e-405a] EVENT_ID: USER_MOVED_DISK(2,008), User admin@internal-authz moving disk disk_virtiocow_2407441898 to domain test_gluster_1. From host_mixed_2 (SPM) log: Move started (Starting with createVolume on target SD) - 2018-04-24 07:47:42,466+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='2c7abda6-5390-4167-ba9b-670cc77a91d5') Task.run: running job 0: createVolume: <bound method StoragePool.createVolume of <vdsm.storag e.sp.StoragePool object at 0x7f136c1536d0>> (args: ('6c863436-8535-4615-98a2-310420c8bd7b', '739443d2-87b5-4f4f-b36e-2448f2155e38', 2097152, 4, 2, 'DATA', 'f07dd93f-e76f-4d54-86df-21bbc62b04cd', None, '00000000- 0000-0000-0000-000000000000', '00000000-0000-0000-0000-000000000000', None) kwargs: {}) (task:914) - createVolume on target SD - suceeded 2018-04-24 07:47:46,700+0300 DEBUG (jsonrpc/2) [storage.TaskManager] Return: {'1833a9ea-6de8-4189-bd06-37b09fbe489c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult' : 'success', 'taskID': '1833a9ea-6de8-4189-bd06-37b09fbe489c'}, '96bccb80-d99c-4c1a-8c58-15bd46041b5a': {'code': 0, 'message': 'running job 1 of 1', 'taskState': 'running', 'taskResult': '', 'taskID': '96bccb80- d99c-4c1a-8c58-15bd46041b5a'}, '182b83b0-a60d-4d72-b40b-b0f8291ce267': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '182b83b0-a60d-4d72-b40 b-b0f8291ce267'}, '2c7abda6-5390-4167-ba9b-670cc77a91d5': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2c7abda6-5390-4167-ba9b-670cc77a91d5 '}} (taskManager:124) - Least acquired successfully on target SD: 2018-04-24 07:47:54,483+0300 INFO (tasks/2) [storage.SANLock] Acquiring Lease(name='f07dd93f-e76f-4d54-86df-21bbc62b04cd', path=u'/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__1/6c863436-8535-4615-98a2-310420c8bd7b/images/739443d2-87b5-4f4f-b36e-2448f2155e38/f07dd93f-e76f-4d54-86df-21bbc62b04cd.lease', offset=0) for host id 2 (clusterlock:377) 2018-04-24 07:47:54,547+0300 INFO (tasks/2) [storage.SANLock] Successfully acquired Lease(name='f07dd93f-e76f-4d54-86df-21bbc62b04cd', path=u'/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__1/6c863436-8535-4615-98a2-310420c8bd7b/images/739443d2-87b5-4f4f-b36e-2448f2155e38/f07dd93f-e76f-4d54-86df-21bbc62b04cd.lease', offset=0) for host id 2 (clusterlock:415) - sdm_copy_data start but DOES NOT CONTINUE ! 2018-04-24 07:47:53,967+0300 INFO (jsonrpc/7) [vdsm.api] START sdm_copy_data(job_id='e4327ecf-1f62-43b4-a327-64cf71484939', source={'img_id': '739443d2-87b5-4f4f-b36e-2448f2155e38', 'sd_id': '1e71d289-d3a2-47ad-ac4b-03c6983c73ab', 'endpoint_type': 'div', 'vol_id': 'f07dd93f-e76f-4d54-86df-21bbc62b04cd'}, destination={'generation': 0, 'img_id': '739443d2-87b5-4f4f-b36e-2448f2155e38', 'sd_id': '6c863436-8535-4615-98a2-310420c8bd7b', 'endpoint_type': 'div', 'vol_id': 'f07dd93f-e76f-4d54-86df-21bbc62b04cd'}) from=::ffff:10.35.162.7,49170, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=643b16ec-7f8c-4b2a-848a-8f2030055ebf (api:46) 2018-04-24 07:47:53,968+0300 DEBUG (jsonrpc/7) [storage.TaskManager] scheduled job copy_data for task 643b16ec-7f8c-4b2a-848a-8f2030055ebf (taskManager:79) 2018-04-24 07:47:53,969+0300 INFO (jsonrpc/7) [vdsm.api] FINISH sdm_copy_data return=None from=::ffff:10.35.162.7,49170, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=643b16ec-7f8c-4b2a-848a-8f2030055ebf (api:52) - Task is stopped as SPM goes DOWN: 2018-04-24 07:49:20,862+0300 DEBUG (MainThread) [storage.TaskManager] Request to stop all tasks (taskManager:91) 2018-04-24 07:49:20,866+0300 INFO (MainThread) [storage.TaskManager] 643b16ec-7f8c-4b2a-848a-8f2030055ebf (taskManager:97) - From that point onward SPM checks the volume status every couple of minutes : 2018-04-24 07:49:58,477+0300 INFO (jsonrpc/6) [vdsm.api] START getVolumeInfo(sdUUID='6c863436-8535-4615-98a2-310420c8bd7b', spUUID='5f741a3c-905e-4695-aee9-34a24410b83f', imgUUID='739443d2-87b5-4f4f-b36e-2448f2155e38', volUUID='f07dd93f-e76f-4d54-86df-21bbc62b04cd', options=None) from=::ffff:10.35.162.7,55172, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=fb5af342-b693-4002-ba34-ebe41da4c73d (api:46) Version-Release number of selected component (if applicable): How reproducible: Not 100% , happened twice in so far ( ran the same scenario 5 times) . Steps to Reproduce: 1) move 4 disks from SD1 (iscsi SD to Target SD gluster SD ) 2) block connection from SPM to target storage domain 3) for some reason SPM goes to unresponsive state. Actual results: 1 disk (or more) stays locked . Locked Disk ID = 739443d2-87b5-4f4f-b36e-2448f2155e38 Locked Name = disk_virtiocow_2407441898 Target SD = 'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b ' Expected results: moved disks should either succeed move or fail , not get locked. Additional info:
Task finished on vdsm 2018-04-24 07:49:19,118+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') moving from state running -> state finished (task:602) 2018-04-24 07:49:19,119+0300 DEBUG (tasks/2) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910) 2018-04-24 07:49:19,119+0300 DEBUG (tasks/2) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947) 2018-04-24 07:49:19,120+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') Task.run: exit - success: result (task:925) 2018-04-24 07:49:19,120+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') ref 0 aborting False (task:1002) 2018-04-24 07:49:19,121+0300 INFO (tasks/2) [storage.ThreadPool.WorkerThread] FINISH task 643b16ec-7f8c-4b2a-848a-8f2030055ebf (threadPool:210) Engine does not end the job because lease in not free 2018-04-24 08:51:01,188+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] START, GetHostJobsVDSCommand(HostName = host_mixed_2, GetHostJobsVDSCommandParameters:{hostId='14ddbea5-dd3b-44ae-9f39-c137b8d03344', type='storage', jobIds='[e4327ecf-1f62-43b4-a327-64cf71484939]'}), log id: 3e557bb0 2018-04-24 08:51:01,196+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] FINISH, GetHostJobsVDSCommand, return: {}, log id: 3e557bb0 2018-04-24 08:51:01,200+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] Command CopyData id: 'c14ec1b2-21f8-479c-b720-b7c8c1584543': attempting to determine the job status by polling the entity. 2018-04-24 08:51:01,212+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] START, GetVolumeInfoVDSCommand(HostName = host_mixed_1, GetVolumeInfoVDSCommandParameters:{hostId='ede6a614-2f97-4d39-97f2-04eafb0a6abb', storagePoolId='5f741a3c-905e-4695-aee9-34a24410b83f', storageDomainId='6c863436-8535-4615-98a2-310420c8bd7b', imageGroupId='739443d2-87b5-4f4f-b36e-2448f2155e38', imageId='f07dd93f-e76f-4d54-86df-21bbc62b04cd'}), log id: 4e245932 2018-04-24 08:51:01,306+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] FINISH, GetVolumeInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@4a6adbe7, log id: 4e245932 2018-04-24 08:51:01,306+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.VdsmImagePoller] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] Command CopyData id: 'c14ec1b2-21f8-479c-b720-b7c8c1584543': the volume lease is not FREE - the job is running I assume we don't see the job polling in the vdsm logs because SPM was re-elected and we don't have the new SPM logs. Regardless, this looks like: https://bugzilla.redhat.com/show_bug.cgi?id=1565040
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Moving to ON_QA as it seems like a dup of bug 1565040
verified on ovirt-engine 4.3.2-0.1
This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.