Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2102149

Summary: Engine doesn't clean up the SD target after first LSM try failed due to insufficient free space on the target SD
Product: [oVirt] ovirt-engine Reporter: sshmulev
Component: BLL.StorageAssignee: Pavel Bar <pbar>
Status: CLOSED CURRENTRELEASE QA Contact: sshmulev
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5.1.2CC: ahadas, bugs, bzlotnik, dfodor
Target Milestone: ovirt-4.5.2Keywords: Automation, Regression, ZStream
Target Release: ---Flags: pm-rhel: ovirt-4.5?
ahadas: blocker-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.2.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-30 08:47:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sshmulev 2022-06-29 11:56:47 UTC
Description of problem:
The issue was found in an automation test: first checking negative flow when we expect the LSM to fail because the target SD doesn't have enough space.
After that try, we extend the LUN so there will be enough space for the disk to migrate to the target SD.
The disk fails to migrate also after this try, although there is enough space in the target SD.


2022-06-29 11:47:22,424+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'cmd=[\'/sbin/lvm\', \'lvcreate\', \'--devices\', \'/dev/mapper/3600a09803830447a4f244c4657612f68,/dev/mapper/3600a09803830447a4f244c4657612f69,/dev/mapper/3600a09803830447a4f244c4657612f6a,/dev/mapper/3600a09803830447a4f244c4657612f6b,/dev/mapper/3600a09803830447a4f244c4657612f6c,/dev/mapper/3600a09803830447a4f244c4657612f6d,/dev/mapper/3600a09803830447a4f244c4657623647\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3    hints="none"  obtain_device_list_from_udev=0 } global {  prioritise_write_locks=1  wait_for_locks=1  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--autobackup\', \'n\', \'--contiguous\', \'n\', \'--size\', \'8448m\', \'--wipesignatures\', \'n\', \'--addtag\', \'OVIRT_VOL_INITIALIZING\', \'--name\', \'0caead62-1995-409f-8086-5cf343d3a528\', \'090d90bf-e690-46a1-87d6-68ebe50179c5\'] rc=5 out=[] err=[\'  Volume group "090d90bf-e690-46a1-87d6-68ebe50179c5" has insufficient free space (53 extents): 66 required.\']' abortedcode=550

2022-06-29 11:47:22,434+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [] BaseAsyncTask::logEndTaskFailure: Task 'fac97aa3-abc9-49bb-b752-5031e90c3c3a' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:

2022-06-29 11:47:22,445+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engine-Thread-92712) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand' with failure.

2022-06-29 11:47:29,618+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand' with failure.

2022-06-29 11:47:30,699+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [disks_syncAction_69ee0cd6-4776-458a] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.

2022-06-29 11:47:30,915+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz has failed to move disk disk_TestCase10144_2911461916 to domain sd_TestCase10144_2911405538.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.1.2-0.11.el8ev.noarch
vdsm-4.50.1.4-1.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create ISCSI SD with 12G free space size
2. Create a VM from template and attach to it a new preallocated disk of 14G and run it - on some other ISCSI SD
3. LSM of the newly created disk of the VM to the new SD target (12G size) - Action should fail as expected cause there is no space in that SD for 14G.
4. Extend that SD so it will have 20G free space size
5. LSM the same disk(14G) again to the extended ISCSI SD.

Actual results:
LSM fails due to insufficient free space on the target SD.
The target SD is left with a 6G free space size, although the disk failed to migrate to it.

Expected results:
LSM of the 14G disk should succeed after extending the target SD

Comment 2 sshmulev 2022-06-29 13:08:35 UTC
This issue was not seen in RHV-4.4, adding regression keyword

Comment 3 Arik 2022-06-30 09:15:03 UTC
(In reply to sshmulev from comment #2)
> This issue was not seen in RHV-4.4, adding regression keyword

Does it reproduce if you change the test to have let's say not 20G free but 23G free at step 4?

Comment 4 Arik 2022-06-30 09:20:32 UTC
(In reply to Arik from comment #3)
> (In reply to sshmulev from comment #2)
> > This issue was not seen in RHV-4.4, adding regression keyword
> 
> Does it reproduce if you change the test to have let's say not 20G free but
> 23G free at step 4?

ok, I wrote that because the primary suspect is the allocation of 3 chunks on the destination now but that's probably not the right question to ask - need to take a better look at the logs
leaving this to Pavel

Comment 5 RHEL Program Management 2022-06-30 09:22:04 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Avihai 2022-07-04 08:24:25 UTC
Can not mark blocker "-" -> Not a blocker bug .

Comment 7 Benny Zlotnik 2022-07-14 08:38:21 UTC
The fix is likely removing this check[1], at this point the disk has already been created on the target


[1] https://github.com/oVirt/ovirt-engine/blob/03cab35b639f5d2b1a4722ccd31eb46c85dbc798/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java#L767

Comment 8 Benny Zlotnik 2022-07-27 12:28:43 UTC
The root cause is that the disk is not removed from the target after failure

Comment 12 Pavel Bar 2022-08-08 09:02:07 UTC
QE testing instructions:
Option 1:
Use the automatic script that you already have - which performs additional things, most of which are not important.

Option 2:
1) Create a disk on iSCSI Storage Domain #1 with size X GiB.
2) Have an iSCSI Storage Domain #2 with available space X+3 GiB.
3) LSM the above disk from SD1 to SD2.

Result:
  Expected:
    The operation should have succeeded (SD2 has 3 GiB more space than required for the operation).
  In practice:
    Operation fails (that's another bug #2116309) and besides no cleanup is performed, SD2 is shown with 3 GiB available space, instead of X+3 GiB as one would expect since LSM has failed...

Note: this bug handles the cleanup part only.

Comment 13 sshmulev 2022-08-15 07:30:53 UTC
Verified.
Cleanup is working after the migration trial, the SD remains as it was when migration failed.

Versions:
ovirt-engine-4.5.2.2-0.1.el8ev
vdsm-4.50.2.2-1.el8ev

Comment 14 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.