2110186 – Restart of ovirt-engine while LSM is running causes LSM to get stuck

Bug 2110186 - Restart of ovirt-engine while LSM is running causes LSM to get stuck

Summary: Restart of ovirt-engine while LSM is running causes LSM to get stuck

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.5.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.3
Target Release:	---
Assignee:	Mark Kemel
QA Contact:	Ilia Markelov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-24 10:10 UTC by Evelina Shames
Modified:	2022-10-03 19:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ovirt-engine-4.5.3.1
Clone Of:
Environment:
Last Closed:	2022-10-03 19:00:53 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5?

Attachments	(Terms of Use)
engine.log (3.62 MB, text/plain) 2022-07-24 10:10 UTC, Evelina Shames	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 670	0	None	open	LSM: fix locking on restart	2022-09-22 15:07:33 UTC
Red Hat Issue Tracker	RHV-47754	0	None	None	None	2022-07-24 10:23:23 UTC

Description Evelina Shames 2022-07-24 10:10:04 UTC

Created attachment 1899034 [details]
engine.log

Description of problem:
While verifying bug 2107985, ovirt-engine service restarts successfully but LSM gets stuck and disk remains locked.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.2-0.3.el8ev


How reproducible:
Always

Steps to Reproduce:
Restart ovirt-engine service during LSM operation


Actual results:
LSM gets stuck and disk remains locked

Expected results:
LSM should finish successfully and disk should not be locked

Additional info:
Attaching engine.log

Comment 1 Benny Zlotnik 2022-07-24 10:49:51 UTC

It looks like the root cause is the lock is reacquired after restart by LiveMigrateDiskCommand, this can probably be resolved by either overriding reacquireLocks and not locking again if snapshot create has already started. Or by removing the command locks entirely since their acquisition is handled in MoveDiskCommand

Comment 2 Arik 2022-07-25 14:09:07 UTC

we suspect this is not a new issue but the timing of restarting the engine during create-snapshot, which is a fairly quick operation compared to copying the disk, made us miss this before

Comment 3 Pavel Bar 2022-09-01 13:15:50 UTC

I also suggest to update the log message.
The current "Failed to acquire VM lock, will retry on the next polling cycle" is a little confusing.
For example in this case the actual failure is a disk acquire (exclusive lock), not a VM (shared lock).
I suggest to log the exact failure as can be retrieved from the "LockingResult" received from the "acquireLock()" call in the "LiveDiskMigrateStage.LIVE_MIGRATE_DISK_EXEC_COMPLETED" phase.
Just a thought :)

Comment 4 Casper (RHV QE bot) 2022-09-01 13:35:05 UTC

This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 5 Arik 2022-09-22 15:07:34 UTC

(In reply to Benny Zlotnik from comment #1)
> It looks like the root cause is the lock is reacquired after restart by
> LiveMigrateDiskCommand, this can probably be resolved by either overriding
> reacquireLocks and not locking again if snapshot create has already started.
> Or by removing the command locks entirely since their acquisition is handled
> in MoveDiskCommand

right, we chose to go with the latter

Comment 6 Casper (RHV QE bot) 2022-10-03 19:00:53 UTC

This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.

Note You need to log in before you can comment on or make changes to this bug.