Description of problem: The vdsm storage monitor threads run "vgs" command every 5 minutes for all SDs with a whitelist filter of all multipath LUNs. If the "multipath -f" from the playbook executes at the same time when the vdsm runs "vgs" command, then it will fail with the error "map in use" since LVM will be holding the LUN. This is observed frequently in a big environment (60+ hosts, 20+ storage domains, and 100+ LUNs) when the customer runs the role to remove the LUNs. Version-Release number of selected component (if applicable): rhvm-4.4.8.6-0.1.el8ev.noarch How reproducible: Intermittently hitting in the large environment while removing LUNs from 60+ hosts. Steps to Reproduce: 1. Use ovirt_remove_stale_lun to remove the LUNs. If the vdsm storage monitoring thread and multipath -f came in same time, it will fail with error "map in use". Actual results: multipath -f fails with "map in use" error while removing the LUNs using "ovirt_remove_stale_lun" Expected results: ovirt_remove_stale_lun should be able to remove the LUNs. Additional info:
Can you please supply the verification flow that is required in order to verify this bug? We want a flow that will resemble the most to the flow the customer used. To be more specific - please update on the following: 1. How to create the stale luns in the setup of the test. 2. Where the ansible script was executed from on the customer side (from the engine?). 3. Does the customer modify the ansible script in some way before running it? 4. The relevant commands that were used in the process/flow. 5. Is there some way to reproduce this error in a smaller environment? (QE doesn't have an environment with so many resources - 60+ hosts, 20+ storage domains, and 100+ LUNs). Thanks.
(In reply to Amit Sharir from comment #3) > 1. How to create the stale luns in the setup of the test. You can try to remove any LUNs that are mapped to hosts which is not used by the storage domain or VM. > 2. Where the ansible script was executed from on the customer side (from the > engine?). engine. > 3. Does the customer modify the ansible script in some way before running it? No. > 4. The relevant commands that were used in the process/flow. Used the example yml https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/remove_stale_lun/examples/remove_stale_lun.yml and changed the values to match with the environment. > 5. Is there some way to reproduce this error in a smaller environment? (QE > doesn't have an environment with so many resources - 60+ hosts, 20+ storage > domains, and 100+ LUNs). We can ask vdsm to monitor SDs more aggressively by setting the below values in the vdsm conf so that it runs vgs every 2 seconds. [irs] repo_stats_cache_refresh_timeout=2 sd_health_check_delay=1 I hit the issue on 1 out of 5 runs in my test environment after setting the above value. It was 2 hosts, 2 SDs, 3 LUNs environment. > > Thanks.
Fixed in nightly quay.io/ovirt/el8stream-ansible-executor:latest as of today, can be used with ansible 2.11
Following #c14 and #c10, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages [ovirt-4.4.10]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0463
According to the specific verification flow in comment 10: https://bugzilla.redhat.com/show_bug.cgi?id=2023224#c10 There are some steps in a verification flow that need some operations for the luns from the "NetApp system manager" UI, and we can't add them to our automation. The operations use the initiators mapping option via "NetApp system manager" UI. There is a TC in our automation that covered the removing stale lun from the hypervisor (TestCase27720)