1517540 – Uncleaned leftovers after live storage migration failure during VmReplicateDiskFinish

Bug 1517540 - Uncleaned leftovers after live storage migration failure during VmReplicateDiskFinish

Summary: Uncleaned leftovers after live storage migration failure during VmReplicateDi...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	4.2.2
Assignee:	Benny Zlotnik
QA Contact:	Yosi Ben Shimon
Docs Contact:
URL:
Whiteboard:
Depends On:	1542423
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-26 15:24 UTC by Yosi Ben Shimon
Modified:	2018-03-29 10:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.2.2
Doc Type:	Bug Fix
Doc Text:	Cause: Vdsm sent the same error for when the replication was already aborted and when there was a problem with it. Consequence: The ovirt engine could not distinguish between the reasons for failure and as result could not perform a clean up of the target even when it was possible. Fix: Create a separate error for when the replication is not in progress and have the engine perform a clean up in this case. Result: When the replication is aborted as a result of a failure in live storage migration, ovirt engine will perform a clean up of the target storage domain, making it possible to retry the operation without the need to manually clean up the target storage domain. Note: This behavior is relevant only for DC level >= 4.2 and ovirt-engine >= 4.2
Clone Of:
Environment:
Last Closed:	2018-03-29 10:54:13 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: exception+

Attachments	(Terms of Use)
logs (802.81 KB, application/x-gzip) 2017-11-26 15:24 UTC, Yosi Ben Shimon	no flags	Details
vdsm log file with relevant error (15.06 MB, text/plain) 2017-12-13 09:48 UTC, Yosi Ben Shimon	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	87132	0	master	MERGED	core: handle replication not in progress	2020-10-27 12:21:05 UTC
oVirt gerrit	87266	0	ovirt-engine-4.2	MERGED	core: handle replication not in progress	2020-10-27 12:21:05 UTC

Description Yosi Ben Shimon 2017-11-26 15:24:28 UTC

Created attachment 1359162 [details]
logs

Description of problem:
After live storage migration failure, due to lack of storage space, i tried again (after adding storage space), and the VG was already exist.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Have a VM with disk on block based domain with little available space left
2. start the VM
2. In the VM, start DD operation to the disk resides on the block based domain
3. start live storage migration while DD operation runs and till there is no free space in the storage domain
4. live storage migration fails (need to make it fail during VmReplicateDiskFinish). The VM enters paused state. 


SRC domain - 2aff9356-1bff-4202-a5ea-6681e131a713
DST domain - ef80aae1-de8a-4187-93bc-a949b4399fa3


Failure:


2017-11-26 13:29:40,367+0200 ERROR (mailbox-spm/2) [storage.SPM.Messages.Extend] processRequest: Exception caught while trying to extend volume: 73b90832-dc94-480c-940e-0c03d10be178 in doma
in: ef80aae1-de8a-4187-93bc-a949b4399fa3 (mailbox:172)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 166, in processRequest
    pool.extendVolume(volume['domainID'], volume['volumeID'], size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1413, in extendVolume
    sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1536, in extendVolume
    lvm.extendLV(self.sdUUID, volumeUUID, size)  # , isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1240, in extendLV
    % (vgName, lvName, free_extents, needed_extents))
VolumeGroupSizeError: Volume Group not big enough: ('Not enough free extents for extending LV ef80aae1-de8a-4187-93bc-a949b4399fa3/73b90832-dc94-480c-940e-0c03d10be178 (free=4, needed=16)',
)


2017-11-26 13:31:34,131+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [674d808a] Command 'VmReplic
ateDiskFinishVDSCommand(HostName = master-vds12, VmReplicateDiskParameters:{hostId='52beba9f-874e-46fa-a625-f14b27683bd4', vmId='40da4047-2d8d-4c03-8b59-d8a34ac5af41', storagePoolId='36f529
a0-8260-4f5c-9b75-963855fe8e21', srcStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', targetStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', imageGroupId='bca679ad-82c4-4ac4-9b
ea-d9c29a0dad80', imageId='73b90832-dc94-480c-940e-0c03d10be178'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication e
rror, code = 55

5. Extend the storage domain
6. shut down and start VM
7. perform live storage migration again (the same source and destination domains, in the same direction)

Actual results:

Second live storage migration fails with:

2017-11-26 14:52:14,572+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='a4e58d63-ed51-46b2-abbd-9beaef58909b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1726, in cloneImageStructure
    img.cloneStructure(sdUUID, imgUUID, dstSdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 628, in cloneStructure
    self._createTargetImage(sdCache.produce(dstSdUUID), sdUUID, imgUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 403, in _createTargetImage
    srcVolUUID=volParams['parent'])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 770, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1156, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 501, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1161, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=ef80aae1-de8a-4187-93bc-a949b4399fa3 lvname=21a60b0f-5485-44da-89d0-ab0cb69bbf94 err=[\'  Logical Volume "21a60b0f-5485-44da
-89d0-ab0cb69bbf94" already exists in volume group "ef80aae1-de8a-4187-93bc-a949b4399fa3"\']'
2017-11-26 14:52:14,669+0200 INFO  (tasks/6) [storage.Volume] startCreateVolumeRollback: sdUUID=ef80aae1-de8a-4187-93bc-a949b4399fa3 imgUUID=bca679ad-82c4-4ac4-9bea-d9c29a0dad80 volUUID=21a
60b0f-5485-44da-89d0-ab0cb69bbf94  (volume:1045)



Expected results:
After a live storage migration failure, during VmReplicateDiskFinish due to lack of space in the destination storage domain, there shouldn't be leftovers in the destination domain 


Additional info:
logs

Comment 1 Allon Mureinik 2017-11-27 08:12:31 UTC

Benny - please take a look.
The error message looks like the error in bug 1497931. Perhaps the added information in this flow can explain that bug?

I'm thinking of something like an LSM failing, reporting the cleanup failed too early and then the user retries but fails since the cleanup hasn't completed and the LVs still exist on the target?

Comment 2 Benny Zlotnik 2017-12-07 15:09:14 UTC

Yosi, the attached vdsm log doesn't seem to contain the relevant errors (cannot find the diskReplicateFinish error for instance), can you please attach the correct log (preferably in DEBUG level)?

Comment 3 Yosi Ben Shimon 2017-12-13 09:47:22 UTC

Benny,
I commented the relevant log lines, both engine and vdsm and attached the vdsm log file.

********** engine log **********

before 'VmReplicateDiskStartVDSCommand' i got the following warning, from the engine side, regarding the disk space:

2017-11-26 13:26:52,884+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. 
Yosi_New_StorageDomain_iSCSI domain has 2 GB of free space.

'VmReplicateDiskStartVDSCommand' method starts here:

2017-11-26 13:26:58,053+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [674d808a] START, VmReplicateDiskStartVDSCommand(HostName = master-vds12, VmReplicateDiskParameters:{hostId='52beba9f-874e-46fa-a625-f14b27683bd4', vmId='40da4047-2d8d-4c03-8b59-d8a34ac5af41', storagePoolId='36f529a0-8260-4f5c-9b75-963855fe8e21', srcStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', targetStorageDomainId='ef80aae1-de8a-4187-93bc-a949b4399fa3', imageGroupId='bca679ad-82c4-4ac4-9bea-d9c29a0dad80', imageId='73b90832-dc94-480c-940e-0c03d10be178'}), log id: 275e09e0

********** vdsm log **********

From the vdsm side i can see the 'diskReplicateStart' method:

2017-11-26 13:26:58,057+0200 INFO  (jsonrpc/7) [api.virt] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '36f529a0-8260-4f5c-9b75-963855fe8e21', 'volumeID': '73b90832-dc94-480c-940e-0c03d10be178', 'domainID': '2aff9356-1bff-4202-a5ea-6681e131a713', 'imageID': 'bca679ad-82c4-4ac4-9bea-d9c29a0dad80'}, dstDisk={'device': 'disk', 'poolID': '36f529a0-8260-4f5c-9b75-963855fe8e21', 'volumeID': '73b90832-dc94-480c-940e-0c03d10be178', 'domainID': 'ef80aae1-de8a-4187-93bc-a949b4399fa3', 'imageID': 'bca679ad-82c4-4ac4-9bea-d9c29a0dad80'}) from=::ffff:10.35.163.132,46486, flow_id=674d808a (api:46)

Here is the 'sendExtendMsg' method (after ~2min):

2017-11-26 13:28:53,233+0200 INFO  (periodic/0) [vdsm.api] START sendExtendMsg(spUUID='36f529a0-8260-4f5c-9b75-963855fe8e21', volDict={'newSize': 5368709120, 'domainID': 'ef80aae1

And here is the 'Volume extension failed' error :

2017-11-26 13:29:00,029+0200 INFO  (mailbox-hsm/3) [vdsm.api] FINISH getVolumeSize return={'truesize': '3221225472', 'apparentsize': '3221225472'} from=internal, task_id=e42f0059-cae1-43a4-bc82-abe79bbb2860 (api:52)
2017-11-26 13:29:00,030+0200 ERROR (mailbox-hsm/3) [storage.TaskManager.Task] (Task='d8987efa-0d0f-4687-b4f3-e659e693ac5a') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1284, in __afterReplicaExtension
    self.__verifyVolumeExtension(volInfo)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1273, in __verifyVolumeExtension
    (volInfo['name'], volInfo['domainID'], volInfo['volumeID']))
RuntimeError: Volume extension failed for sdb (domainID: ef80aae1-de8a-4187-93bc-a949b4399fa3, volumeID: 73b90832-dc94-480c-940e-0c03d10be178)

Comment 4 Yosi Ben Shimon 2017-12-13 09:48:22 UTC

Created attachment 1367216 [details]
vdsm log file with relevant error

Comment 5 Allon Mureinik 2018-01-11 14:16:30 UTC

After talking to Benny:
When LSM (e.g., runs out of space), engine attempts to stop the replication and delete the target. Stopping the replication fails (since it isn't running), which breaks the flow.

Comment 6 Allon Mureinik 2018-02-07 10:01:33 UTC

Benny, according to our f2f discussion, this would only work with a patched engine and patched vdsm - i.e., engine 4.2 and a vms runnings in a 4.2 DC.
Please add this info (or correct it as required) to the doctext.

Comment 8 Yosi Ben Shimon 2018-02-27 12:07:28 UTC

Tested using:
ovirt-engine-4.2.2.1-0.1.el7.noarch

Actual result:
When trying to perform live storage migration and the destination SD doesn't have enough space - error with the reason is displayed.

Moving to VERIFIED

Comment 9 Sandro Bonazzola 2018-03-29 10:54:13 UTC

This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.