Bug 2110186

Summary:

Restart of ovirt-engine while LSM is running causes LSM to get stuck

Product:

[oVirt] ovirt-engine

Reporter:

Evelina Shames <eshames>

Component:

BLL.Storage

Assignee:

Mark Kemel <mkemel>

Status:

CLOSED NEXTRELEASE

QA Contact:

Ilia Markelov <imarkelo>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.5.2

CC:

ahadas, bugs, bzlotnik, dfodor, pbar, sfishbai

Target Milestone:

ovirt-4.5.3

Flags:

pm-rhel: ovirt-4.5?

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-4.5.3.1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-10-03 19:00:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine.log	none

Description Evelina Shames 2022-07-24 10:10:04 UTC

Created attachment 1899034 [details]
engine.log

Description of problem:
While verifying bug 2107985, ovirt-engine service restarts successfully but LSM gets stuck and disk remains locked.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.2-0.3.el8ev


How reproducible:
Always

Steps to Reproduce:
Restart ovirt-engine service during LSM operation


Actual results:
LSM gets stuck and disk remains locked

Expected results:
LSM should finish successfully and disk should not be locked

Additional info:
Attaching engine.log

Comment 1 Benny Zlotnik 2022-07-24 10:49:51 UTC

It looks like the root cause is the lock is reacquired after restart by LiveMigrateDiskCommand, this can probably be resolved by either overriding reacquireLocks and not locking again if snapshot create has already started. Or by removing the command locks entirely since their acquisition is handled in MoveDiskCommand

Comment 2 Arik 2022-07-25 14:09:07 UTC

we suspect this is not a new issue but the timing of restarting the engine during create-snapshot, which is a fairly quick operation compared to copying the disk, made us miss this before

Comment 3 Pavel Bar 2022-09-01 13:15:50 UTC

I also suggest to update the log message.
The current "Failed to acquire VM lock, will retry on the next polling cycle" is a little confusing.
For example in this case the actual failure is a disk acquire (exclusive lock), not a VM (shared lock).
I suggest to log the exact failure as can be retrieved from the "LockingResult" received from the "acquireLock()" call in the "LiveDiskMigrateStage.LIVE_MIGRATE_DISK_EXEC_COMPLETED" phase.
Just a thought :)

Comment 4 Casper (RHV QE bot) 2022-09-01 13:35:05 UTC

This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 5 Arik 2022-09-22 15:07:34 UTC

(In reply to Benny Zlotnik from comment #1)
> It looks like the root cause is the lock is reacquired after restart by
> LiveMigrateDiskCommand, this can probably be resolved by either overriding
> reacquireLocks and not locking again if snapshot create has already started.
> Or by removing the command locks entirely since their acquisition is handled
> in MoveDiskCommand

right, we chose to go with the latter

Comment 6 Casper (RHV QE bot) 2022-10-03 19:00:53 UTC

This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.