Bug 1119790

Summary: [BLOCKED] Devices not removed on other hosts after removing a Storage Domain on RHEV-M on one host
Product: Red Hat Enterprise Virtualization Manager Reporter: Luca Villa <luvilla>
Component: vdsmAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: acanan, adevolder, ahino, amureini, bazulay, eedri, fsimonce, gklein, kgoldbla, lpeer, luvilla, nsoffer, sbonazzo, scohen, srevivo, tnisan, ylavi
Target Milestone: ovirt-4.1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-05 10:37:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1163890    
Bug Blocks:    

Description Luca Villa 2014-07-15 14:01:51 UTC
Description of problem:
After removing a multipath FCP-based storage domain in RHEV-M, some hosts still see the device as in use and multipath -f fails to flush the device.
This happens on the hosts other then the one selected in RHEV-M for the removal operation.

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.13.el6ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. put the selected SD in maintenance and then detach it in RHEV-M, selecting host-1 to perform the operation.
2. log on host-1 and flush the related multipath device with 

  multipath -f <device> 

  this works as expected
3. log on host-2, part of the same cluster, and try to flush the same multipath device:

  multipath -f /dev/dm-11 
  /dev/dm-11: map in use

Actual results:
the multipath device seems to be still in use on all remaining nodes (apart from host-1).

Expected results:
The device shouldn't be in use and 'multipath -f' should succeed on all cluster nodes.

Additional info:
The ID of the affected LUN is 360000970000295900156533030444430

[root@host-2 ~]# multipath -ll | grep -A4 360000970000295900156533030444430
360000970000295900156533030444430 dm-11 EMC,SYMMETRIX
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:0:10 sdl  8:176  active ready running
  `- 2:0:0:10 sdaf 65:240 active ready running

[root@host-2 ~]# ls /sys/block/dm-11/holders/
dm-88  dm-89  dm-90  dm-91  dm-92  dm-93

[root@host-2 ~]# dmsetup ls --tree | grep -A3 -B1 360000970000295900156533030444430
b169031a--669c--42fd--8053--5d43da49e3e9-master (253:93)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)
--
b169031a--669c--42fd--8053--5d43da49e3e9-outbox (253:92)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)
--
b169031a--669c--42fd--8053--5d43da49e3e9-ids (253:90)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)
--
b169031a--669c--42fd--8053--5d43da49e3e9-inbox (253:91)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)
--
b169031a--669c--42fd--8053--5d43da49e3e9-leases (253:89)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)
--
b169031a--669c--42fd--8053--5d43da49e3e9-metadata (253:88)
 `-360000970000295900156533030444430 (253:11)
    |- (65:240)
    `- (8:176)

[root@host-2 ~]# vgs -v |grep b169031a-669c-42fd-8053-5d43da49e3e9

(nothing found)

[root@host-2 ~]# pvs -a -v |grep 60000970000295900156533030343436

  /dev/mapper/360000970000295900156533030343436          lvm2 a--   60.01g  60.01g   60.01g UGnCpQ-crwt-jiq6-j1xR-iuaC-wdJV-L9JeiB

(VG field is empty)

After removing the stale devices with:

[root@host-2 ~]# dmsetup remove /dev/dm-88
...
[root@host-2 ~]# dmsetup remove /dev/dm-93

...'multipath -f' worked like a charm.

Comment 2 Nir Soffer 2014-07-16 07:36:36 UTC
Please describe how this effect the usage of the system.

Comment 3 Luca Villa 2014-07-16 07:46:51 UTC
(In reply to Nir Soffer from comment #2)
> Please describe how this effect the usage of the system.

This prevents the admin from easily freeing up LUNs that have to be reused.
The obvious workaround would be to put the hosts in maintenance, one-by-one and to reboot all of them, but this, depending on the use conditions, can have a relevant impact on a production infrastructure.
Also the workaround I pointed out in c#0 requires some work on each host and moreover it doesn't look as being documented anywhere.

Comment 4 Nir Soffer 2014-07-16 07:53:29 UTC
Removing the a device may be only a temporary fix, since vdsm is refreshing the connection to the storage every 5 minutes (or in some events), so the device may appear again later.

Can you try this, evaluating your suggested workaround:

1. Remove a domain using host 1
2. log to host 2 and remove the device manually
3. wait 5 minutes
4. check if the removed device appears again in host 1 and host 2.

Comment 5 Luca Villa 2014-07-16 09:15:02 UTC
(In reply to Nir Soffer from comment #4)
> Removing the a device may be only a temporary fix, since vdsm is refreshing
> the connection to the storage every 5 minutes (or in some events), so the
> device may appear again later.
> 
> Can you try this, evaluating your suggested workaround:
> 
> 1. Remove a domain using host 1
> 2. log to host 2 and remove the device manually
> 3. wait 5 minutes
> 4. check if the removed device appears again in host 1 and host 2.

Nir, 
I cannot test this myself but my customer experience is as follows:

- after removing the device it doesn't reappear automatically without admin intervention.

Also:

- when a new LUN is made visible (presented) to the hosts during the night, the morning after the new device is still not visible and the admin needs to trigger a "create new SD" operation for each host in order to make it visible.

- also my impression is that having vdsm that re-adds the device automatically every 5 minutes would create a race condition between an admin trying to remove the LUN from several hosts and vdsm that re-adds them before they are un-presented.

Does it make sense to you?

Comment 6 Nir Soffer 2014-07-16 10:05:09 UTC
(In reply to Luca Villa from comment #5)
> (In reply to Nir Soffer from comment #4)
> - also my impression is that having vdsm that re-adds the device
> automatically every 5 minutes would create a race condition between an admin
> trying to remove the LUN from several hosts and vdsm that re-adds them
> before they are un-presented.

Exactly, since we are refreshing storage connections regularly, devices on the server may appear back on the machine.

However the issue here seems to be the fact that we don't deactivate lvs after removing a storage domain.

You said:

    Steps to Reproduce:
    1. put the selected SD in maintenance and then detach it in RHEV-M,
       selecting host-1 to perform the operation.

It is not clear what is the operation (remove storage domain?)

Please specify exact steps you do in the engine to trigger this issue, so we can reproduce this issue here.

Also, we need logs from engine, vdsm performing the operation, and one of the other hosts.

Comment 7 Luca Villa 2014-07-16 10:38:56 UTC
(In reply to Nir Soffer from comment #6)

> You said:
> 
>     Steps to Reproduce:
>     1. put the selected SD in maintenance and then detach it in RHEV-M,
>        selecting host-1 to perform the operation.
> 
> It is not clear what is the operation (remove storage domain?)
> 
> Please specify exact steps you do in the engine to trigger this issue, so we
> can reproduce this issue here.
> 
> Also, we need logs from engine, vdsm performing the operation, and one of
> the other hosts.

More detailed steps are as follows:

1) from RHEV-M deactivate disk of the VM you want to remove
2) from RHEV-M remove disk of the VM
3) put in maintenance the Storage Domain containing the removed disk
4) detach the Storage Domain
5) right click on the unattached Storage Domain --> remove and choose node#1
6) log in on node#1 selected for the removal above and run "multipath -f /dev/dm-11" -> OK
7) log in on node#2 NOT selected for the removal above and run "multipath -f /dev/dm-11" -> map in use, KO

I putting together all the requested logs and I'll make them available asap.

Comment 9 Nir Soffer 2014-07-24 08:40:46 UTC
Thanks for the additional data Luca. Currently we don't need more info.

Comment 10 Nir Soffer 2014-07-27 10:53:38 UTC
Testing show that we do not deactivate the special lvs after when moving a host to maintenance. Federico and me think that this is the root cause of this issue.

Running vgchange -an vgname after host is put to maintenance or a storage domain is deactivated should fix this.

Comment 11 Nir Soffer 2014-08-05 20:35:13 UTC
Looks like a duplicate of bug 880738.

Comment 12 Allon Mureinik 2014-08-13 08:15:45 UTC
(In reply to Nir Soffer from comment #11)
> Looks like a duplicate of bug 880738.
Moving to ON_QA to be verified together.

Comment 13 Elad 2014-08-14 10:48:08 UTC
Eyal, this bug is targeted for 3.5 downstream. 
Since there is no such build, Can we verify it on ovirt-3.5 RC1?

Also, please add 'fixed in version'

Comment 14 Allon Mureinik 2014-08-14 10:52:34 UTC
(In reply to Elad from comment #13)
> Eyal, this bug is targeted for 3.5 downstream. 
> Since there is no such build, Can we verify it on ovirt-3.5 RC1?
Yes please.

Comment 15 Elad 2014-08-14 14:34:44 UTC
Did the customer unmapped the LUN from the hosts and then tried to flash the device from the hosts using 'multipath -f'?

Comment 16 Luca Villa 2014-09-01 08:07:43 UTC
(In reply to Elad from comment #15)
> Did the customer unmapped the LUN from the hosts and then tried to flash the
> device from the hosts using 'multipath -f'?

Sorry for the late answer, but I was out of office till today.
No, the they didn't un-present the LUs before flushing them with 'multipath -f'. So the LUs were physically visible to the hosts all the time.
I hope I interpreted correctly your question. Thanks.

Comment 17 Ori Gofen 2014-09-15 09:36:37 UTC
The behavior on vt3.1 is exactly as noted at comment #0

vdsm-4.16.3-3.el6ev.beta.x86_64
rhevm-3.5.0-0.12.beta.el6ev.noarch

root@camel-vdsc ~ # multipath -f 360060160f4a03000fe65675991dbe311
Sep 15 12:29:11 | 360060160f4a03000fe65675991dbe311: map in use

root@camel-vdsb ~ # multipath -f 360060160f4a03000fe65675991dbe311   
root@camel-vdsb ~ # 

both are at the same cluster and using different devices even though no iscsi domain are currently exists on setup.

root@camel-vdsb ~ # multipath -ll                                  
360060160f4a030007eeed85291dbe311 dm-2 
size=30G features='0' hwhandler='1 emc' wp=rw

root@camel-vdsc ~ # multipath -ll
360060160f4a03000fe65675991dbe311 dm-7 
size=30G features='0' hwhandler='1 emc' wp=rw
360060160f4a03000fb65675991dbe311 dm-5 
size=30G features='0' hwhandler='1 emc' wp=rw

Comment 18 Allon Mureinik 2014-09-15 09:43:51 UTC
Nir, please take a look?

Comment 19 Nir Soffer 2014-09-15 11:38:31 UTC
I already looked at it, see comment 10.

This will not be an easy fix, given then severity, lets move this to next version.

Comment 20 Allon Mureinik 2014-09-16 15:44:50 UTC
(In reply to Nir Soffer from comment #19)
> I already looked at it, see comment 10.
> 
> This will not be an easy fix, given then severity, lets move this to next
> version.
Talked f2f - this requires a major code overhaul.

Comment 22 Sandro Bonazzola 2015-10-26 12:46:32 UTC
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 29 Nir Soffer 2017-03-05 09:10:18 UTC
Based comment 27 I think we can closed this.

Comment 30 Tal Nisan 2017-03-05 10:37:39 UTC
Closing based on comment 27