Bug 2137207

Summary: The RemoveDisk job finishes before the disk was removed from the DB
Product: Red Hat Enterprise Virtualization Manager Reporter: sshmulev
Component: ovirt-engineAssignee: Mark Kemel <mkemel>
Status: CLOSED ERRATA QA Contact: Shir Fishbain <sfishbai>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.5.3CC: ahadas, mperina, sfishbai
Target Milestone: ovirt-4.5.3Keywords: AutomationBlocker, Regression, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.3.2 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-16 12:17:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sshmulev 2022-10-24 07:32:07 UTC
Description of problem:
The job for RemoveDisk finishes before the disk was actually removed from the DB - which leads to faulty errors if relying on this information.
Some tests in our automation remove a VM right after this operation which leads to failure because the disk is locked.

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.3.1-2.el8ev
vdsm-4.50.3.4-1.el8ev


How reproducible:
100%

Steps to Reproduce:
1. Create a VM and attach to it a disk
2. Remove the VM
3. Based on the jobs in the DB, when the RemoveDisk job is done, remove the VM. 
(This might be needed to run with automation flow to reproduce it)

Actual results:
As a result of bug 1836318 fix https://github.com/oVirt/ovirt-engine/pull/656 
The operation of removing the VM fails because the disk is still locked - The disk is still during the removal operation, although it was reported in the DB that the operation was done.

In our tier2 we have 161 failures and 145 in tier3 due to this issue.
As a result we have many leftovers and non of the tests are valid for verification - which blocks us from deliver the version.

We tried to put sleep after the operation of remove disk but the sleep in not consistent due to the fact that not all the tests have the same disk sizes, same flow, this could lead to other failures, automation bugs, refactoring, and stabilization. Since this is a global function that is being used by different teams in RHV QE, this also could be a conflict.

In addition, we can't rely on that a customer is not likely to reproduce the same issue - because we can't know which flow he is using after removing a disk.

Expected results:
When the RemoveDisk job is removed from the DB the disk should be unlocked as well.

Comment 8 Arik 2022-10-28 20:31:53 UTC
from a functional point of view, the severity of this bug is rather low as the disk was removed few milliseconds after the job is completed and so it's unlikely to affect user flows - setting the severity accordingly. however, this bug was prioritized since many test cases in our automation failed because of that and it was complicated to adjust those test cases.

Comment 9 sshmulev 2022-10-30 11:01:50 UTC
Verified.

Tier2 and tier3 have stabilized after this fix.
TCs that have failed before due to this bug, now pass successfully as before.

Versions:
rhv-4.5.3-4
ovirt-engine-4.5.3.2-1.el8ev.noarch
vdsm-4.50.3.4-1.el8ev.x86_64

Comment 13 errata-xmlrpc 2022-11-16 12:17:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8502