Bug 2023224 - multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"
Summary: multipath -f fails with "map in use" error while removing the LUNs using "ovi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-ansible-collection
Version: 4.4.8
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.4.10
: ---
Assignee: Vojtech Juranek
QA Contact: Amit Sharir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-15 09:07 UTC by nijin ashok
Modified: 2022-03-02 19:50 UTC (History)
11 users (show)

Fixed In Version: ovirt-ansible-collection-1.6.6
Doc Type: Enhancement
Doc Text:
Previously, when running the 'ovirt_remove_stale_lun' Ansible role, the removal of the multipath device map could fail because of a conflict with a VGS scan. In the current release, the 'ovirt_remove_stale_lun' role for removing multipath is retried six times to allow the removal to succeed.
Clone Of:
Environment:
Last Closed: 2022-02-08 10:07:34 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-collection pull 382 0 None Merged ovirt_remove_stale_lun: Retry "multipath -f" while removing the LUNs 2021-11-23 09:45:40 UTC
Github oVirt ovirt-ansible-collection pull 389 0 None Merged Backport: ovirt_remove_stale_lun: Retry "multipath -f" while removing the LUNs 2021-11-23 12:54:58 UTC
Red Hat Issue Tracker RHV-44011 0 None None None 2021-11-15 09:09:53 UTC
Red Hat Product Errata RHBA-2022:0463 0 None None None 2022-02-08 10:07:39 UTC

Description nijin ashok 2021-11-15 09:07:15 UTC
Description of problem:

The vdsm storage monitor threads run "vgs" command every 5 minutes for all SDs with a whitelist filter of all multipath LUNs. If the "multipath -f" from the playbook executes at the same time when the vdsm runs "vgs" command, then it will fail with the error "map in use" since LVM will be holding the LUN. This is observed frequently in a big environment (60+ hosts, 20+ storage domains, and 100+ LUNs) when the customer runs the role to remove the LUNs.


Version-Release number of selected component (if applicable):

rhvm-4.4.8.6-0.1.el8ev.noarch

How reproducible:

Intermittently hitting in the large environment while removing LUNs from 60+ hosts.

Steps to Reproduce:

1. Use ovirt_remove_stale_lun to remove the LUNs. If the vdsm storage monitoring thread and multipath -f came in same time, it will fail with error "map in use".  


Actual results:

multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"

Expected results:

ovirt_remove_stale_lun should be able to remove the LUNs.

Additional info:

Comment 3 Amit Sharir 2021-11-17 13:11:41 UTC
Can you please supply the verification flow that is required in order to verify this bug? 
We want a flow that will resemble the most to the flow the customer used.

To be more specific - please update on the following:

1. How to create the stale luns in the setup of the test.
2. Where the ansible script was executed from on the customer side (from the engine?).
3. Does the customer modify the ansible script in some way before running it?
4. The relevant commands that were used in the process/flow.
5. Is there some way to reproduce this error in a smaller environment? (QE doesn't have an environment with so many resources - 60+ hosts, 20+ storage domains, and 100+ LUNs). 

Thanks.

Comment 5 nijin ashok 2021-11-18 03:33:33 UTC
(In reply to Amit Sharir from comment #3)

> 1. How to create the stale luns in the setup of the test.

You can try to remove any LUNs that are mapped to hosts which is not used by the storage domain or VM.

> 2. Where the ansible script was executed from on the customer side (from the
> engine?).

engine.

> 3. Does the customer modify the ansible script in some way before running it?

No.

> 4. The relevant commands that were used in the process/flow.

Used the example yml https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/remove_stale_lun/examples/remove_stale_lun.yml and changed the values to match with the environment.

> 5. Is there some way to reproduce this error in a smaller environment? (QE
> doesn't have an environment with so many resources - 60+ hosts, 20+ storage
> domains, and 100+ LUNs). 

We can ask vdsm to monitor SDs more aggressively by setting the below values in the vdsm conf so that it runs vgs every 2 seconds.

[irs]
repo_stats_cache_refresh_timeout=2
sd_health_check_delay=1

I hit the issue on 1 out of 5 runs in my test environment after setting the above value. It was 2 hosts, 2 SDs, 3 LUNs environment.

> 
> Thanks.

Comment 8 Michal Skrivanek 2021-11-24 08:00:14 UTC
Fixed in nightly quay.io/ovirt/el8stream-ansible-executor:latest as of today, can be used with ansible 2.11

Comment 16 Amit Sharir 2021-12-28 09:50:30 UTC
Following #c14 and #c10, moving to verified.

Comment 21 errata-xmlrpc 2022-02-08 10:07:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0463

Comment 22 Shir Fishbain 2022-03-02 19:50:05 UTC
According to the specific verification flow in comment 10: https://bugzilla.redhat.com/show_bug.cgi?id=2023224#c10
There are some steps in a verification flow that need some operations for the luns from the "NetApp system manager" UI, and we can't add them to our automation. The operations use the initiators mapping option via "NetApp system manager" UI.

There is a TC in our automation that covered the removing stale lun from the hypervisor (TestCase27720)


Note You need to log in before you can comment on or make changes to this bug.