Bug 2023224

Summary: multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-ansible-collectionAssignee: Vojtech Juranek <vjuranek>
Status: CLOSED ERRATA QA Contact: Amit Sharir <asharir>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4.8CC: aefrat, ahadas, apinnick, ddacosta, gveitmic, lsvaty, mgandhi, michal.skrivanek, mperina, sfishbai, vjuranek
Target Milestone: ovirt-4.4.10Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-ansible-collection-1.6.6 Doc Type: Enhancement
Doc Text:
Previously, when running the 'ovirt_remove_stale_lun' Ansible role, the removal of the multipath device map could fail because of a conflict with a VGS scan. In the current release, the 'ovirt_remove_stale_lun' role for removing multipath is retried six times to allow the removal to succeed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-08 10:07:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2021-11-15 09:07:15 UTC
Description of problem:

The vdsm storage monitor threads run "vgs" command every 5 minutes for all SDs with a whitelist filter of all multipath LUNs. If the "multipath -f" from the playbook executes at the same time when the vdsm runs "vgs" command, then it will fail with the error "map in use" since LVM will be holding the LUN. This is observed frequently in a big environment (60+ hosts, 20+ storage domains, and 100+ LUNs) when the customer runs the role to remove the LUNs.


Version-Release number of selected component (if applicable):

rhvm-4.4.8.6-0.1.el8ev.noarch

How reproducible:

Intermittently hitting in the large environment while removing LUNs from 60+ hosts.

Steps to Reproduce:

1. Use ovirt_remove_stale_lun to remove the LUNs. If the vdsm storage monitoring thread and multipath -f came in same time, it will fail with error "map in use".  


Actual results:

multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"

Expected results:

ovirt_remove_stale_lun should be able to remove the LUNs.

Additional info:

Comment 3 Amit Sharir 2021-11-17 13:11:41 UTC
Can you please supply the verification flow that is required in order to verify this bug? 
We want a flow that will resemble the most to the flow the customer used.

To be more specific - please update on the following:

1. How to create the stale luns in the setup of the test.
2. Where the ansible script was executed from on the customer side (from the engine?).
3. Does the customer modify the ansible script in some way before running it?
4. The relevant commands that were used in the process/flow.
5. Is there some way to reproduce this error in a smaller environment? (QE doesn't have an environment with so many resources - 60+ hosts, 20+ storage domains, and 100+ LUNs). 

Thanks.

Comment 5 nijin ashok 2021-11-18 03:33:33 UTC
(In reply to Amit Sharir from comment #3)

> 1. How to create the stale luns in the setup of the test.

You can try to remove any LUNs that are mapped to hosts which is not used by the storage domain or VM.

> 2. Where the ansible script was executed from on the customer side (from the
> engine?).

engine.

> 3. Does the customer modify the ansible script in some way before running it?

No.

> 4. The relevant commands that were used in the process/flow.

Used the example yml https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/remove_stale_lun/examples/remove_stale_lun.yml and changed the values to match with the environment.

> 5. Is there some way to reproduce this error in a smaller environment? (QE
> doesn't have an environment with so many resources - 60+ hosts, 20+ storage
> domains, and 100+ LUNs). 

We can ask vdsm to monitor SDs more aggressively by setting the below values in the vdsm conf so that it runs vgs every 2 seconds.

[irs]
repo_stats_cache_refresh_timeout=2
sd_health_check_delay=1

I hit the issue on 1 out of 5 runs in my test environment after setting the above value. It was 2 hosts, 2 SDs, 3 LUNs environment.

> 
> Thanks.

Comment 8 Michal Skrivanek 2021-11-24 08:00:14 UTC
Fixed in nightly quay.io/ovirt/el8stream-ansible-executor:latest as of today, can be used with ansible 2.11

Comment 16 Amit Sharir 2021-12-28 09:50:30 UTC
Following #c14 and #c10, moving to verified.

Comment 21 errata-xmlrpc 2022-02-08 10:07:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0463

Comment 22 Shir Fishbain 2022-03-02 19:50:05 UTC
According to the specific verification flow in comment 10: https://bugzilla.redhat.com/show_bug.cgi?id=2023224#c10
There are some steps in a verification flow that need some operations for the luns from the "NetApp system manager" UI, and we can't add them to our automation. The operations use the initiators mapping option via "NetApp system manager" UI.

There is a TC in our automation that covered the removing stale lun from the hypervisor (TestCase27720)