Description of problem: After removing a multipath FCP-based storage domain in RHEV-M, some hosts still see the device as in use and multipath -f fails to flush the device. This happens on the hosts other then the one selected in RHEV-M for the removal operation. Version-Release number of selected component (if applicable): vdsm-4.13.2-0.13.el6ev.x86_64 How reproducible: Always Steps to Reproduce: 1. put the selected SD in maintenance and then detach it in RHEV-M, selecting host-1 to perform the operation. 2. log on host-1 and flush the related multipath device with multipath -f <device> this works as expected 3. log on host-2, part of the same cluster, and try to flush the same multipath device: multipath -f /dev/dm-11 /dev/dm-11: map in use Actual results: the multipath device seems to be still in use on all remaining nodes (apart from host-1). Expected results: The device shouldn't be in use and 'multipath -f' should succeed on all cluster nodes. Additional info: The ID of the affected LUN is 360000970000295900156533030444430 [root@host-2 ~]# multipath -ll | grep -A4 360000970000295900156533030444430 360000970000295900156533030444430 dm-11 EMC,SYMMETRIX size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 1:0:0:10 sdl 8:176 active ready running `- 2:0:0:10 sdaf 65:240 active ready running [root@host-2 ~]# ls /sys/block/dm-11/holders/ dm-88 dm-89 dm-90 dm-91 dm-92 dm-93 [root@host-2 ~]# dmsetup ls --tree | grep -A3 -B1 360000970000295900156533030444430 b169031a--669c--42fd--8053--5d43da49e3e9-master (253:93) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) -- b169031a--669c--42fd--8053--5d43da49e3e9-outbox (253:92) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) -- b169031a--669c--42fd--8053--5d43da49e3e9-ids (253:90) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) -- b169031a--669c--42fd--8053--5d43da49e3e9-inbox (253:91) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) -- b169031a--669c--42fd--8053--5d43da49e3e9-leases (253:89) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) -- b169031a--669c--42fd--8053--5d43da49e3e9-metadata (253:88) `-360000970000295900156533030444430 (253:11) |- (65:240) `- (8:176) [root@host-2 ~]# vgs -v |grep b169031a-669c-42fd-8053-5d43da49e3e9 (nothing found) [root@host-2 ~]# pvs -a -v |grep 60000970000295900156533030343436 /dev/mapper/360000970000295900156533030343436 lvm2 a-- 60.01g 60.01g 60.01g UGnCpQ-crwt-jiq6-j1xR-iuaC-wdJV-L9JeiB (VG field is empty) After removing the stale devices with: [root@host-2 ~]# dmsetup remove /dev/dm-88 ... [root@host-2 ~]# dmsetup remove /dev/dm-93 ...'multipath -f' worked like a charm.
Please describe how this effect the usage of the system.
(In reply to Nir Soffer from comment #2) > Please describe how this effect the usage of the system. This prevents the admin from easily freeing up LUNs that have to be reused. The obvious workaround would be to put the hosts in maintenance, one-by-one and to reboot all of them, but this, depending on the use conditions, can have a relevant impact on a production infrastructure. Also the workaround I pointed out in c#0 requires some work on each host and moreover it doesn't look as being documented anywhere.
Removing the a device may be only a temporary fix, since vdsm is refreshing the connection to the storage every 5 minutes (or in some events), so the device may appear again later. Can you try this, evaluating your suggested workaround: 1. Remove a domain using host 1 2. log to host 2 and remove the device manually 3. wait 5 minutes 4. check if the removed device appears again in host 1 and host 2.
(In reply to Nir Soffer from comment #4) > Removing the a device may be only a temporary fix, since vdsm is refreshing > the connection to the storage every 5 minutes (or in some events), so the > device may appear again later. > > Can you try this, evaluating your suggested workaround: > > 1. Remove a domain using host 1 > 2. log to host 2 and remove the device manually > 3. wait 5 minutes > 4. check if the removed device appears again in host 1 and host 2. Nir, I cannot test this myself but my customer experience is as follows: - after removing the device it doesn't reappear automatically without admin intervention. Also: - when a new LUN is made visible (presented) to the hosts during the night, the morning after the new device is still not visible and the admin needs to trigger a "create new SD" operation for each host in order to make it visible. - also my impression is that having vdsm that re-adds the device automatically every 5 minutes would create a race condition between an admin trying to remove the LUN from several hosts and vdsm that re-adds them before they are un-presented. Does it make sense to you?
(In reply to Luca Villa from comment #5) > (In reply to Nir Soffer from comment #4) > - also my impression is that having vdsm that re-adds the device > automatically every 5 minutes would create a race condition between an admin > trying to remove the LUN from several hosts and vdsm that re-adds them > before they are un-presented. Exactly, since we are refreshing storage connections regularly, devices on the server may appear back on the machine. However the issue here seems to be the fact that we don't deactivate lvs after removing a storage domain. You said: Steps to Reproduce: 1. put the selected SD in maintenance and then detach it in RHEV-M, selecting host-1 to perform the operation. It is not clear what is the operation (remove storage domain?) Please specify exact steps you do in the engine to trigger this issue, so we can reproduce this issue here. Also, we need logs from engine, vdsm performing the operation, and one of the other hosts.
(In reply to Nir Soffer from comment #6) > You said: > > Steps to Reproduce: > 1. put the selected SD in maintenance and then detach it in RHEV-M, > selecting host-1 to perform the operation. > > It is not clear what is the operation (remove storage domain?) > > Please specify exact steps you do in the engine to trigger this issue, so we > can reproduce this issue here. > > Also, we need logs from engine, vdsm performing the operation, and one of > the other hosts. More detailed steps are as follows: 1) from RHEV-M deactivate disk of the VM you want to remove 2) from RHEV-M remove disk of the VM 3) put in maintenance the Storage Domain containing the removed disk 4) detach the Storage Domain 5) right click on the unattached Storage Domain --> remove and choose node#1 6) log in on node#1 selected for the removal above and run "multipath -f /dev/dm-11" -> OK 7) log in on node#2 NOT selected for the removal above and run "multipath -f /dev/dm-11" -> map in use, KO I putting together all the requested logs and I'll make them available asap.
Thanks for the additional data Luca. Currently we don't need more info.
Testing show that we do not deactivate the special lvs after when moving a host to maintenance. Federico and me think that this is the root cause of this issue. Running vgchange -an vgname after host is put to maintenance or a storage domain is deactivated should fix this.
Looks like a duplicate of bug 880738.
(In reply to Nir Soffer from comment #11) > Looks like a duplicate of bug 880738. Moving to ON_QA to be verified together.
Eyal, this bug is targeted for 3.5 downstream. Since there is no such build, Can we verify it on ovirt-3.5 RC1? Also, please add 'fixed in version'
(In reply to Elad from comment #13) > Eyal, this bug is targeted for 3.5 downstream. > Since there is no such build, Can we verify it on ovirt-3.5 RC1? Yes please.
Did the customer unmapped the LUN from the hosts and then tried to flash the device from the hosts using 'multipath -f'?
(In reply to Elad from comment #15) > Did the customer unmapped the LUN from the hosts and then tried to flash the > device from the hosts using 'multipath -f'? Sorry for the late answer, but I was out of office till today. No, the they didn't un-present the LUs before flushing them with 'multipath -f'. So the LUs were physically visible to the hosts all the time. I hope I interpreted correctly your question. Thanks.
The behavior on vt3.1 is exactly as noted at comment #0 vdsm-4.16.3-3.el6ev.beta.x86_64 rhevm-3.5.0-0.12.beta.el6ev.noarch root@camel-vdsc ~ # multipath -f 360060160f4a03000fe65675991dbe311 Sep 15 12:29:11 | 360060160f4a03000fe65675991dbe311: map in use root@camel-vdsb ~ # multipath -f 360060160f4a03000fe65675991dbe311 root@camel-vdsb ~ # both are at the same cluster and using different devices even though no iscsi domain are currently exists on setup. root@camel-vdsb ~ # multipath -ll 360060160f4a030007eeed85291dbe311 dm-2 size=30G features='0' hwhandler='1 emc' wp=rw root@camel-vdsc ~ # multipath -ll 360060160f4a03000fe65675991dbe311 dm-7 size=30G features='0' hwhandler='1 emc' wp=rw 360060160f4a03000fb65675991dbe311 dm-5 size=30G features='0' hwhandler='1 emc' wp=rw
Nir, please take a look?
I already looked at it, see comment 10. This will not be an easy fix, given then severity, lets move this to next version.
(In reply to Nir Soffer from comment #19) > I already looked at it, see comment 10. > > This will not be an easy fix, given then severity, lets move this to next > version. Talked f2f - this requires a major code overhaul.
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high
Based comment 27 I think we can closed this.
Closing based on comment 27