Created attachment 1359162 [details] logs Description of problem: After live storage migration failure, due to lack of storage space, i tried again (after adding storage space), and the VG was already exist. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Have a VM with disk on block based domain with little available space left 2. start the VM 2. In the VM, start DD operation to the disk resides on the block based domain 3. start live storage migration while DD operation runs and till there is no free space in the storage domain 4. live storage migration fails (need to make it fail during VmReplicateDiskFinish). The VM enters paused state. SRC domain - 2aff9356-1bff-4202-a5ea-6681e131a713 DST domain - ef80aae1-de8a-4187-93bc-a949b4399fa3 Failure: 2017-11-26 13:29:40,367+0200 ERROR (mailbox-spm/2) [storage.SPM.Messages.Extend] processRequest: Exception caught while trying to extend volume: 73b90832-dc94-480c-940e-0c03d10be178 in doma in: ef80aae1-de8a-4187-93bc-a949b4399fa3 (mailbox:172) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 166, in processRequest pool.extendVolume(volume['domainID'], volume['volumeID'], size) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1413, in extendVolume sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1536, in extendVolume lvm.extendLV(self.sdUUID, volumeUUID, size) # , isShuttingDown) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1240, in extendLV % (vgName, lvName, free_extents, needed_extents)) VolumeGroupSizeError: Volume Group not big enough: ('Not enough free extents for extending LV ef80aae1-de8a-4187-93bc-a949b4399fa3/73b90832-dc94-480c-940e-0c03d10be178 (free=4, needed=16)', ) 2017-11-26 13:31:34,131+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [674d808a] Command 'VmReplic ateDiskFinishVDSCommand(HostName = master-vds12, VmReplicateDiskParameters:{hostId='52beba9f-874e-46fa-a625-f14b27683bd4', vmId='40da4047-2d8d-4c03-8b59-d8a34ac5af41', storagePoolId='36f529 a0-8260-4f5c-9b75-963855fe8e21', srcStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', targetStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', imageGroupId='bca679ad-82c4-4ac4-9b ea-d9c29a0dad80', imageId='73b90832-dc94-480c-940e-0c03d10be178'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication e rror, code = 55 5. Extend the storage domain 6. shut down and start VM 7. perform live storage migration again (the same source and destination domains, in the same direction) Actual results: Second live storage migration fails with: 2017-11-26 14:52:14,572+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='a4e58d63-ed51-46b2-abbd-9beaef58909b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1726, in cloneImageStructure img.cloneStructure(sdUUID, imgUUID, dstSdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 628, in cloneStructure self._createTargetImage(sdCache.produce(dstSdUUID), sdUUID, imgUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 403, in _createTargetImage srcVolUUID=volParams['parent']) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 770, in createVolume initialSize=initialSize) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1156, in create initialSize=initialSize) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 501, in _create initialTags=(sc.TAG_VOL_UNINIT,)) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1161, in createLV raise se.CannotCreateLogicalVolume(vgName, lvName, err) CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=ef80aae1-de8a-4187-93bc-a949b4399fa3 lvname=21a60b0f-5485-44da-89d0-ab0cb69bbf94 err=[\' Logical Volume "21a60b0f-5485-44da -89d0-ab0cb69bbf94" already exists in volume group "ef80aae1-de8a-4187-93bc-a949b4399fa3"\']' 2017-11-26 14:52:14,669+0200 INFO (tasks/6) [storage.Volume] startCreateVolumeRollback: sdUUID=ef80aae1-de8a-4187-93bc-a949b4399fa3 imgUUID=bca679ad-82c4-4ac4-9bea-d9c29a0dad80 volUUID=21a 60b0f-5485-44da-89d0-ab0cb69bbf94 (volume:1045) Expected results: After a live storage migration failure, during VmReplicateDiskFinish due to lack of space in the destination storage domain, there shouldn't be leftovers in the destination domain Additional info: logs
Benny - please take a look. The error message looks like the error in bug 1497931. Perhaps the added information in this flow can explain that bug? I'm thinking of something like an LSM failing, reporting the cleanup failed too early and then the user retries but fails since the cleanup hasn't completed and the LVs still exist on the target?
Yosi, the attached vdsm log doesn't seem to contain the relevant errors (cannot find the diskReplicateFinish error for instance), can you please attach the correct log (preferably in DEBUG level)?
Benny, I commented the relevant log lines, both engine and vdsm and attached the vdsm log file. ********** engine log ********** before 'VmReplicateDiskStartVDSCommand' i got the following warning, from the engine side, regarding the disk space: 2017-11-26 13:26:52,884+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. Yosi_New_StorageDomain_iSCSI domain has 2 GB of free space. 'VmReplicateDiskStartVDSCommand' method starts here: 2017-11-26 13:26:58,053+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [674d808a] START, VmReplicateDiskStartVDSCommand(HostName = master-vds12, VmReplicateDiskParameters:{hostId='52beba9f-874e-46fa-a625-f14b27683bd4', vmId='40da4047-2d8d-4c03-8b59-d8a34ac5af41', storagePoolId='36f529a0-8260-4f5c-9b75-963855fe8e21', srcStorageDomainId='2aff9356-1bff-4202-a5ea-6681e131a713', targetStorageDomainId='ef80aae1-de8a-4187-93bc-a949b4399fa3', imageGroupId='bca679ad-82c4-4ac4-9bea-d9c29a0dad80', imageId='73b90832-dc94-480c-940e-0c03d10be178'}), log id: 275e09e0 ********** vdsm log ********** From the vdsm side i can see the 'diskReplicateStart' method: 2017-11-26 13:26:58,057+0200 INFO (jsonrpc/7) [api.virt] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '36f529a0-8260-4f5c-9b75-963855fe8e21', 'volumeID': '73b90832-dc94-480c-940e-0c03d10be178', 'domainID': '2aff9356-1bff-4202-a5ea-6681e131a713', 'imageID': 'bca679ad-82c4-4ac4-9bea-d9c29a0dad80'}, dstDisk={'device': 'disk', 'poolID': '36f529a0-8260-4f5c-9b75-963855fe8e21', 'volumeID': '73b90832-dc94-480c-940e-0c03d10be178', 'domainID': 'ef80aae1-de8a-4187-93bc-a949b4399fa3', 'imageID': 'bca679ad-82c4-4ac4-9bea-d9c29a0dad80'}) from=::ffff:10.35.163.132,46486, flow_id=674d808a (api:46) Here is the 'sendExtendMsg' method (after ~2min): 2017-11-26 13:28:53,233+0200 INFO (periodic/0) [vdsm.api] START sendExtendMsg(spUUID='36f529a0-8260-4f5c-9b75-963855fe8e21', volDict={'newSize': 5368709120, 'domainID': 'ef80aae1 And here is the 'Volume extension failed' error : 2017-11-26 13:29:00,029+0200 INFO (mailbox-hsm/3) [vdsm.api] FINISH getVolumeSize return={'truesize': '3221225472', 'apparentsize': '3221225472'} from=internal, task_id=e42f0059-cae1-43a4-bc82-abe79bbb2860 (api:52) 2017-11-26 13:29:00,030+0200 ERROR (mailbox-hsm/3) [storage.TaskManager.Task] (Task='d8987efa-0d0f-4687-b4f3-e659e693ac5a') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1284, in __afterReplicaExtension self.__verifyVolumeExtension(volInfo) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1273, in __verifyVolumeExtension (volInfo['name'], volInfo['domainID'], volInfo['volumeID'])) RuntimeError: Volume extension failed for sdb (domainID: ef80aae1-de8a-4187-93bc-a949b4399fa3, volumeID: 73b90832-dc94-480c-940e-0c03d10be178)
Created attachment 1367216 [details] vdsm log file with relevant error
After talking to Benny: When LSM (e.g., runs out of space), engine attempts to stop the replication and delete the target. Stopping the replication fails (since it isn't running), which breaks the flow.
Benny, according to our f2f discussion, this would only work with a patched engine and patched vdsm - i.e., engine 4.2 and a vms runnings in a 4.2 DC. Please add this info (or correct it as required) to the doctext.
Tested using: ovirt-engine-4.2.2.1-0.1.el7.noarch Actual result: When trying to perform live storage migration and the destination SD doesn't have enough space - error with the reason is displayed. Moving to VERIFIED
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.