Bug 2102149 - Engine doesn't clean up the SD target after first LSM try failed due to insufficient free space on the target SD
Summary: Engine doesn't clean up the SD target after first LSM try failed due to insuf...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.5.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.2
: ---
Assignee: Pavel Bar
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-29 11:56 UTC by sshmulev
Modified: 2022-08-30 08:47 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.5.2.2
Clone Of:
Environment:
Last Closed: 2022-08-30 08:47:42 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
ahadas: blocker-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 568 0 None Merged Perform cleanup after insufficient free space error during LSM 2022-08-08 09:14:18 UTC
Red Hat Issue Tracker RHV-46733 0 None None None 2022-06-29 12:09:50 UTC

Description sshmulev 2022-06-29 11:56:47 UTC
Description of problem:
The issue was found in an automation test: first checking negative flow when we expect the LSM to fail because the target SD doesn't have enough space.
After that try, we extend the LUN so there will be enough space for the disk to migrate to the target SD.
The disk fails to migrate also after this try, although there is enough space in the target SD.


2022-06-29 11:47:22,424+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'cmd=[\'/sbin/lvm\', \'lvcreate\', \'--devices\', \'/dev/mapper/3600a09803830447a4f244c4657612f68,/dev/mapper/3600a09803830447a4f244c4657612f69,/dev/mapper/3600a09803830447a4f244c4657612f6a,/dev/mapper/3600a09803830447a4f244c4657612f6b,/dev/mapper/3600a09803830447a4f244c4657612f6c,/dev/mapper/3600a09803830447a4f244c4657612f6d,/dev/mapper/3600a09803830447a4f244c4657623647\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3    hints="none"  obtain_device_list_from_udev=0 } global {  prioritise_write_locks=1  wait_for_locks=1  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--autobackup\', \'n\', \'--contiguous\', \'n\', \'--size\', \'8448m\', \'--wipesignatures\', \'n\', \'--addtag\', \'OVIRT_VOL_INITIALIZING\', \'--name\', \'0caead62-1995-409f-8086-5cf343d3a528\', \'090d90bf-e690-46a1-87d6-68ebe50179c5\'] rc=5 out=[] err=[\'  Volume group "090d90bf-e690-46a1-87d6-68ebe50179c5" has insufficient free space (53 extents): 66 required.\']' abortedcode=550

2022-06-29 11:47:22,434+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] BaseAsyncTask::logEndTaskFailure: Task 'fac97aa3-abc9-49bb-b752-5031e90c3c3a' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:

2022-06-29 11:47:22,445+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engine-Thread-92712) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand' with failure.

2022-06-29 11:47:29,618+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand' with failure.

2022-06-29 11:47:30,699+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.

2022-06-29 11:47:30,915+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz has failed to move disk disk_TestCase10144_2911461916 to domain sd_TestCase10144_2911405538.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.1.2-0.11.el8ev.noarch
vdsm-4.50.1.4-1.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create ISCSI SD with 12G free space size
2. Create a VM from template and attach to it a new preallocated disk of 14G and run it - on some other ISCSI SD
3. LSM of the newly created disk of the VM to the new SD target (12G size) - Action should fail as expected cause there is no space in that SD for 14G.
4. Extend that SD so it will have 20G free space size
5. LSM the same disk(14G) again to the extended ISCSI SD.

Actual results:
LSM fails due to insufficient free space on the target SD.
The target SD is left with a 6G free space size, although the disk failed to migrate to it.

Expected results:
LSM of the 14G disk should succeed after extending the target SD

Comment 2 sshmulev 2022-06-29 13:08:35 UTC
This issue was not seen in RHV-4.4, adding regression keyword

Comment 3 Arik 2022-06-30 09:15:03 UTC
(In reply to sshmulev from comment #2)
> This issue was not seen in RHV-4.4, adding regression keyword

Does it reproduce if you change the test to have let's say not 20G free but 23G free at step 4?

Comment 4 Arik 2022-06-30 09:20:32 UTC
(In reply to Arik from comment #3)
> (In reply to sshmulev from comment #2)
> > This issue was not seen in RHV-4.4, adding regression keyword
> 
> Does it reproduce if you change the test to have let's say not 20G free but
> 23G free at step 4?

ok, I wrote that because the primary suspect is the allocation of 3 chunks on the destination now but that's probably not the right question to ask - need to take a better look at the logs
leaving this to Pavel

Comment 5 RHEL Program Management 2022-06-30 09:22:04 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Avihai 2022-07-04 08:24:25 UTC
Can not mark blocker "-" -> Not a blocker bug .

Comment 7 Benny Zlotnik 2022-07-14 08:38:21 UTC
The fix is likely removing this check[1], at this point the disk has already been created on the target


[1] https://github.com/oVirt/ovirt-engine/blob/03cab35b639f5d2b1a4722ccd31eb46c85dbc798/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java#L767

Comment 8 Benny Zlotnik 2022-07-27 12:28:43 UTC
The root cause is that the disk is not removed from the target after failure

Comment 12 Pavel Bar 2022-08-08 09:02:07 UTC
QE testing instructions:
Option 1:
Use the automatic script that you already have - which performs additional things, most of which are not important.

Option 2:
1) Create a disk on iSCSI Storage Domain #1 with size X GiB.
2) Have an iSCSI Storage Domain #2 with available space X+3 GiB.
3) LSM the above disk from SD1 to SD2.

Result:
  Expected:
    The operation should have succeeded (SD2 has 3 GiB more space than required for the operation).
  In practice:
    Operation fails (that's another bug #2116309) and besides no cleanup is performed, SD2 is shown with 3 GiB available space, instead of X+3 GiB as one would expect since LSM has failed...

Note: this bug handles the cleanup part only.

Comment 13 sshmulev 2022-08-15 07:30:53 UTC
Verified.
Cleanup is working after the migration trial, the SD remains as it was when migration failed.

Versions:
ovirt-engine-4.5.2.2-0.1.el8ev
vdsm-4.50.2.2-1.el8ev

Comment 14 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.