2023224 – multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"

Bug 2023224 - multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"

Summary: multipath -f fails with "map in use" error while removing the LUNs using "ovi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-ansible-collection
Sub Component:
Version:	4.4.8
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.4.10
Target Release:	---
Assignee:	Vojtech Juranek
QA Contact:	Amit Sharir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-15 09:07 UTC by nijin ashok
Modified:	2022-03-02 19:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-ansible-collection-1.6.6
Doc Type:	Enhancement
Doc Text:	Previously, when running the 'ovirt_remove_stale_lun' Ansible role, the removal of the multipath device map could fail because of a conflict with a VGS scan. In the current release, the 'ovirt_remove_stale_lun' role for removing multipath is retried six times to allow the removal to succeed.
Clone Of:
Environment:
Last Closed:	2022-02-08 10:07:34 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-ansible-collection pull 382	None	Merged	ovirt_remove_stale_lun: Retry "multipath -f" while removing the LUNs	2021-11-23 09:45:40 UTC
Github	oVirt ovirt-ansible-collection pull 389	None	Merged	Backport: ovirt_remove_stale_lun: Retry "multipath -f" while removing the LUNs	2021-11-23 12:54:58 UTC
Red Hat Issue Tracker	RHV-44011	None	None	None	2021-11-15 09:09:53 UTC
Red Hat Product Errata	RHBA-2022:0463	None	None	None	2022-02-08 10:07:39 UTC

Description nijin ashok 2021-11-15 09:07:15 UTC

Description of problem:

The vdsm storage monitor threads run "vgs" command every 5 minutes for all SDs with a whitelist filter of all multipath LUNs. If the "multipath -f" from the playbook executes at the same time when the vdsm runs "vgs" command, then it will fail with the error "map in use" since LVM will be holding the LUN. This is observed frequently in a big environment (60+ hosts, 20+ storage domains, and 100+ LUNs) when the customer runs the role to remove the LUNs.


Version-Release number of selected component (if applicable):

rhvm-4.4.8.6-0.1.el8ev.noarch

How reproducible:

Intermittently hitting in the large environment while removing LUNs from 60+ hosts.

Steps to Reproduce:

1. Use ovirt_remove_stale_lun to remove the LUNs. If the vdsm storage monitoring thread and multipath -f came in same time, it will fail with error "map in use".  


Actual results:

multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun"

Expected results:

ovirt_remove_stale_lun should be able to remove the LUNs.

Additional info:

Comment 3 Amit Sharir 2021-11-17 13:11:41 UTC

Can you please supply the verification flow that is required in order to verify this bug? 
We want a flow that will resemble the most to the flow the customer used.

To be more specific - please update on the following:

1. How to create the stale luns in the setup of the test.
2. Where the ansible script was executed from on the customer side (from the engine?).
3. Does the customer modify the ansible script in some way before running it?
4. The relevant commands that were used in the process/flow.
5. Is there some way to reproduce this error in a smaller environment? (QE doesn't have an environment with so many resources - 60+ hosts, 20+ storage domains, and 100+ LUNs). 

Thanks.

Comment 5 nijin ashok 2021-11-18 03:33:33 UTC

(In reply to Amit Sharir from comment #3)

> 1. How to create the stale luns in the setup of the test.

You can try to remove any LUNs that are mapped to hosts which is not used by the storage domain or VM.

> 2. Where the ansible script was executed from on the customer side (from the
> engine?).

engine.

> 3. Does the customer modify the ansible script in some way before running it?

No.

> 4. The relevant commands that were used in the process/flow.

Used the example yml https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/remove_stale_lun/examples/remove_stale_lun.yml and changed the values to match with the environment.

> 5. Is there some way to reproduce this error in a smaller environment? (QE
> doesn't have an environment with so many resources - 60+ hosts, 20+ storage
> domains, and 100+ LUNs). 

We can ask vdsm to monitor SDs more aggressively by setting the below values in the vdsm conf so that it runs vgs every 2 seconds.

[irs]
repo_stats_cache_refresh_timeout=2
sd_health_check_delay=1

I hit the issue on 1 out of 5 runs in my test environment after setting the above value. It was 2 hosts, 2 SDs, 3 LUNs environment.

> 
> Thanks.

Comment 8 Michal Skrivanek 2021-11-24 08:00:14 UTC

Fixed in nightly quay.io/ovirt/el8stream-ansible-executor:latest as of today, can be used with ansible 2.11

Comment 16 Amit Sharir 2021-12-28 09:50:30 UTC

Following #c14 and #c10, moving to verified.

Comment 21 errata-xmlrpc 2022-02-08 10:07:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0463

Comment 22 Shir Fishbain 2022-03-02 19:50:05 UTC

According to the specific verification flow in comment 10: https://bugzilla.redhat.com/show_bug.cgi?id=2023224#c10
There are some steps in a verification flow that need some operations for the luns from the "NetApp system manager" UI, and we can't add them to our automation. The operations use the initiators mapping option via "NetApp system manager" UI.

There is a TC in our automation that covered the removing stale lun from the hypervisor (TestCase27720)

Note You need to log in before you can comment on or make changes to this bug.