1571142 – Blocking connection from SPM to target SD during cold migration cause some of the moved disks to get locked

Bug 1571142 - Blocking connection from SPM to target SD during cold migration cause some of the moved disks to get locked

Summary: Blocking connection from SPM to target SD during cold migration cause some of...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.1
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-24 07:42 UTC by Avihai
Modified:	2019-03-13 16:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-03-13 16:40:28 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)
relevant engine , and SPM vdsm log (1.86 MB, application/x-gzip) 2018-04-24 07:42 UTC, Avihai	no flags	Details
View All

Description Avihai 2018-04-24 07:42:53 UTC

Created attachment 1425844 [details]
relevant engine , and SPM vdsm log

Description of problem:
During automation test (TestCase19007) which does Cold move the VM disks(4 disks) + During the move operation block the connection ( using iptables)from the SPM host to target SD I see that some disks fail to move as expected BUT some disks are stuck in locked state.

I saw this issue twice:
- 1st time 2 disks got locked 
- 2nd time 1 disk got locked

Details info is given for the 2nd time 1 disk got locked:

Locked Disk ID = 739443d2-87b5-4f4f-b36e-2448f2155e38
Locked Name = disk_virtiocow_2407441898
Target SD =  'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b '

Engine:
Start moving disk to a different SD named 'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b ':
2018-04-24 07:47:40,776+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-9) [disks_syncAction_8b902e8a-e50e-405a] Running command: MoveDiskCommand internal: false. Entities affected :  ID: 739443d2-87b5-4f4f-b36e-2448f2155e38 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: 6c863436-8535-4615-98a2-310420c8bd7b Type: StorageAction group CREATE_DISK with role type USER

Right after moving starts Blocking connection(No network connectivity) from SPM (host_mixed_1) to the target SD (test_gluster0, test_gluster_1, test_gluster_2) starts .

2018-04-24 07:47:41,159+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-9) [disks_syncAction_8b902e8a-e50e-405a] EVENT_ID: USER_MOVED_DISK(2,008), User admin@internal-authz moving disk disk_virtiocow_2407441898 to domain test_gluster_1.

From host_mixed_2 (SPM) log:
Move started (Starting with createVolume on target SD) - 
2018-04-24 07:47:42,466+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='2c7abda6-5390-4167-ba9b-670cc77a91d5') Task.run: running job 0: createVolume: <bound method StoragePool.createVolume of <vdsm.storag
e.sp.StoragePool object at 0x7f136c1536d0>> (args: ('6c863436-8535-4615-98a2-310420c8bd7b', '739443d2-87b5-4f4f-b36e-2448f2155e38', 2097152, 4, 2, 'DATA', 'f07dd93f-e76f-4d54-86df-21bbc62b04cd', None, '00000000-
0000-0000-0000-000000000000', '00000000-0000-0000-0000-000000000000', None) kwargs: {}) (task:914)

- createVolume on target SD - suceeded 
2018-04-24 07:47:46,700+0300 DEBUG (jsonrpc/2) [storage.TaskManager] Return: {'1833a9ea-6de8-4189-bd06-37b09fbe489c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult'
: 'success', 'taskID': '1833a9ea-6de8-4189-bd06-37b09fbe489c'}, '96bccb80-d99c-4c1a-8c58-15bd46041b5a': {'code': 0, 'message': 'running job 1 of 1', 'taskState': 'running', 'taskResult': '', 'taskID': '96bccb80-
d99c-4c1a-8c58-15bd46041b5a'}, '182b83b0-a60d-4d72-b40b-b0f8291ce267': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '182b83b0-a60d-4d72-b40
b-b0f8291ce267'}, '2c7abda6-5390-4167-ba9b-670cc77a91d5': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2c7abda6-5390-4167-ba9b-670cc77a91d5
'}} (taskManager:124)

- Least acquired successfully on target SD:
2018-04-24 07:47:54,483+0300 INFO  (tasks/2) [storage.SANLock] Acquiring Lease(name='f07dd93f-e76f-4d54-86df-21bbc62b04cd', path=u'/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__1/6c863436-8535-4615-98a2-310420c8bd7b/images/739443d2-87b5-4f4f-b36e-2448f2155e38/f07dd93f-e76f-4d54-86df-21bbc62b04cd.lease', offset=0) for host id 2 (clusterlock:377)
2018-04-24 07:47:54,547+0300 INFO  (tasks/2) [storage.SANLock] Successfully acquired Lease(name='f07dd93f-e76f-4d54-86df-21bbc62b04cd', path=u'/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__1/6c863436-8535-4615-98a2-310420c8bd7b/images/739443d2-87b5-4f4f-b36e-2448f2155e38/f07dd93f-e76f-4d54-86df-21bbc62b04cd.lease', offset=0) for host id 2 (clusterlock:415)

- sdm_copy_data start but DOES NOT CONTINUE !
2018-04-24 07:47:53,967+0300 INFO  (jsonrpc/7) [vdsm.api] START sdm_copy_data(job_id='e4327ecf-1f62-43b4-a327-64cf71484939', source={'img_id': '739443d2-87b5-4f4f-b36e-2448f2155e38', 'sd_id': '1e71d289-d3a2-47ad-ac4b-03c6983c73ab', 'endpoint_type': 'div', 'vol_id': 'f07dd93f-e76f-4d54-86df-21bbc62b04cd'}, destination={'generation': 0, 'img_id': '739443d2-87b5-4f4f-b36e-2448f2155e38', 'sd_id': '6c863436-8535-4615-98a2-310420c8bd7b', 'endpoint_type': 'div', 'vol_id': 'f07dd93f-e76f-4d54-86df-21bbc62b04cd'}) from=::ffff:10.35.162.7,49170, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=643b16ec-7f8c-4b2a-848a-8f2030055ebf (api:46)

2018-04-24 07:47:53,968+0300 DEBUG (jsonrpc/7) [storage.TaskManager] scheduled job copy_data for task 643b16ec-7f8c-4b2a-848a-8f2030055ebf  (taskManager:79)
2018-04-24 07:47:53,969+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH sdm_copy_data return=None from=::ffff:10.35.162.7,49170, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=643b16ec-7f8c-4b2a-848a-8f2030055ebf (api:52)

- Task is stopped as SPM goes DOWN:
2018-04-24 07:49:20,862+0300 DEBUG (MainThread) [storage.TaskManager] Request to stop all tasks (taskManager:91)
2018-04-24 07:49:20,866+0300 INFO  (MainThread) [storage.TaskManager] 643b16ec-7f8c-4b2a-848a-8f2030055ebf (taskManager:97)


- From that point onward SPM checks the volume status every couple of minutes :
2018-04-24 07:49:58,477+0300 INFO  (jsonrpc/6) [vdsm.api] START getVolumeInfo(sdUUID='6c863436-8535-4615-98a2-310420c8bd7b', spUUID='5f741a3c-905e-4695-aee9-34a24410b83f', imgUUID='739443d2-87b5-4f4f-b36e-2448f2155e38', volUUID='f07dd93f-e76f-4d54-86df-21bbc62b04cd', options=None) from=::ffff:10.35.162.7,55172, flow_id=disks_syncAction_8b902e8a-e50e-405a, task_id=fb5af342-b693-4002-ba34-ebe41da4c73d (api:46)

Version-Release number of selected component (if applicable):


How reproducible:
Not 100% , happened twice in so far ( ran the same scenario 5 times) .

Steps to Reproduce:
1) move 4 disks from SD1 (iscsi SD to Target SD  gluster SD )
2) block connection from SPM to target storage domain
3) for some reason SPM goes to unresponsive state.


Actual results:
1 disk (or more) stays locked .


Locked Disk ID = 739443d2-87b5-4f4f-b36e-2448f2155e38
Locked Name = disk_virtiocow_2407441898
Target SD =  'test_gluster_1' with ID = '6c863436-8535-4615-98a2-310420c8bd7b '


Expected results:
moved disks should either succeed move or fail , not get locked.

Additional info:

Comment 1 Benny Zlotnik 2018-04-24 12:31:30 UTC

Task finished on vdsm

2018-04-24 07:49:19,118+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') moving from state running -> state finished (task:602)
2018-04-24 07:49:19,119+0300 DEBUG (tasks/2) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910)
2018-04-24 07:49:19,119+0300 DEBUG (tasks/2) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947)
2018-04-24 07:49:19,120+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') Task.run: exit - success: result  (task:925)
2018-04-24 07:49:19,120+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='643b16ec-7f8c-4b2a-848a-8f2030055ebf') ref 0 aborting False (task:1002)
2018-04-24 07:49:19,121+0300 INFO  (tasks/2) [storage.ThreadPool.WorkerThread] FINISH task 643b16ec-7f8c-4b2a-848a-8f2030055ebf (threadPool:210)

Engine does not end the job because lease in not free
2018-04-24 08:51:01,188+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] START, GetHostJobsVDSCommand(HostName = host_mixed_2, GetHostJobsVDSCommandParameters:{hostId='14ddbea5-dd3b-44ae-9f39-c137b8d03344', type='storage', jobIds='[e4327ecf-1f62-43b4-a327-64cf71484939]'}), log id: 3e557bb0
2018-04-24 08:51:01,196+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] FINISH, GetHostJobsVDSCommand, return: {}, log id: 3e557bb0
2018-04-24 08:51:01,200+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] Command CopyData id: 'c14ec1b2-21f8-479c-b720-b7c8c1584543': attempting to determine the job status by polling the entity.
2018-04-24 08:51:01,212+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] START, GetVolumeInfoVDSCommand(HostName = host_mixed_1, GetVolumeInfoVDSCommandParameters:{hostId='ede6a614-2f97-4d39-97f2-04eafb0a6abb', storagePoolId='5f741a3c-905e-4695-aee9-34a24410b83f', storageDomainId='6c863436-8535-4615-98a2-310420c8bd7b', imageGroupId='739443d2-87b5-4f4f-b36e-2448f2155e38', imageId='f07dd93f-e76f-4d54-86df-21bbc62b04cd'}), log id: 4e245932
2018-04-24 08:51:01,306+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] FINISH, GetVolumeInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@4a6adbe7, log id: 4e245932
2018-04-24 08:51:01,306+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.VdsmImagePoller] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [disks_syncAction_8b902e8a-e50e-405a] Command CopyData id: 'c14ec1b2-21f8-479c-b720-b7c8c1584543': the volume lease is not FREE - the job is running

I assume we don't see the job polling in the vdsm logs because SPM was re-elected and we don't have the new SPM logs.

Regardless, this looks like: https://bugzilla.redhat.com/show_bug.cgi?id=1565040

Comment 2 Sandro Bonazzola 2019-01-28 09:34:08 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 3 Tal Nisan 2019-03-01 03:57:26 UTC

Moving to ON_QA as it seems like a dup of bug 1565040

Comment 4 Avihai 2019-03-08 07:27:17 UTC

verified on ovirt-engine 4.3.2-0.1

Comment 5 Sandro Bonazzola 2019-03-13 16:40:28 UTC

This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.