Bug 2110186 - Restart of ovirt-engine while LSM is running causes LSM to get stuck
Summary: Restart of ovirt-engine while LSM is running causes LSM to get stuck
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.5.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.3
: ---
Assignee: Mark Kemel
QA Contact: Ilia Markelov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-24 10:10 UTC by Evelina Shames
Modified: 2022-10-03 19:00 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.5.3.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-03 19:00:53 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
engine.log (3.62 MB, text/plain)
2022-07-24 10:10 UTC, Evelina Shames
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 670 0 None open LSM: fix locking on restart 2022-09-22 15:07:33 UTC
Red Hat Issue Tracker RHV-47754 0 None None None 2022-07-24 10:23:23 UTC

Description Evelina Shames 2022-07-24 10:10:04 UTC
Created attachment 1899034 [details]
engine.log

Description of problem:
While verifying bug 2107985, ovirt-engine service restarts successfully but LSM gets stuck and disk remains locked.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.2-0.3.el8ev


How reproducible:
Always

Steps to Reproduce:
Restart ovirt-engine service during LSM operation


Actual results:
LSM gets stuck and disk remains locked

Expected results:
LSM should finish successfully and disk should not be locked

Additional info:
Attaching engine.log

Comment 1 Benny Zlotnik 2022-07-24 10:49:51 UTC
It looks like the root cause is the lock is reacquired after restart by LiveMigrateDiskCommand, this can probably be resolved by either overriding reacquireLocks and not locking again if snapshot create has already started. Or by removing the command locks entirely since their acquisition is handled in MoveDiskCommand

Comment 2 Arik 2022-07-25 14:09:07 UTC
we suspect this is not a new issue but the timing of restarting the engine during create-snapshot, which is a fairly quick operation compared to copying the disk, made us miss this before

Comment 3 Pavel Bar 2022-09-01 13:15:50 UTC
I also suggest to update the log message.
The current "Failed to acquire VM lock, will retry on the next polling cycle" is a little confusing.
For example in this case the actual failure is a disk acquire (exclusive lock), not a VM (shared lock).
I suggest to log the exact failure as can be retrieved from the "LockingResult" received from the "acquireLock()" call in the "LiveDiskMigrateStage.LIVE_MIGRATE_DISK_EXEC_COMPLETED" phase.
Just a thought :)

Comment 4 Casper (RHV QE bot) 2022-09-01 13:35:05 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 5 Arik 2022-09-22 15:07:34 UTC
(In reply to Benny Zlotnik from comment #1)
> It looks like the root cause is the lock is reacquired after restart by
> LiveMigrateDiskCommand, this can probably be resolved by either overriding
> reacquireLocks and not locking again if snapshot create has already started.
> Or by removing the command locks entirely since their acquisition is handled
> in MoveDiskCommand

right, we chose to go with the latter

Comment 6 Casper (RHV QE bot) 2022-10-03 19:00:53 UTC
This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.


Note You need to log in before you can comment on or make changes to this bug.