Bug 631009

Summary:	removal of all SCSI devices related to an unmapped LUN doesn't remove the multipath device mapping
Product:	Red Hat Enterprise Linux 7	Reporter:	Yoni Tsafir <tsafir>
Component:	device-mapper-multipath	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED ERRATA	QA Contact:	Lin Li <lilin>
Severity:	medium	Docs Contact:	Steven J. Levine <slevine>
Priority:	low
Version:	7.0	CC:	agk, batkisso, bdonahue, bmarzins, christophe.varoqui, dmoessne, dwysocha, egoggin, fge, heinzm, iheim, jbrassow, junichi.nomura, kueda, lilin, lilu, lmb, msnitzer, nobody, pavel, pep, prajnoha, prockai, pzhukov, soc, tlavigne, tranlan, tvvcox, yanwang
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	device-mapper-multipath-0.4.9-78.el7	Doc Type:	Enhancement
Doc Text:	The "deferred_remove" option has been added to the multipath.conf file. When set to "yes", the multipathd service performs a deferred remove operation when deleting the last path device; the last device is removed after the user closes the device. The default "deferred_remove" value is "no".	Story Points:	---
Clone Of:
Clones:	1257704 (view as bug list)		Environment:
Last Closed:	2015-11-19 12:56:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	620148, 645519, 730389, 756082, 952099, 1113511, 1133060, 1205790, 1257704

Description Yoni Tsafir 2010-09-07 15:26:16 UTC

Description of problem:
On previous versions of RHEL, when unmapping a mapped LUN, deleting all the relevant /dev/sgXX devices would remove the multipath device mapping (/dev/mapper/mpathXX) automatically.
Now 'multipath -f mpathXX' must be run in order for it to be removed.
Also noticed that if you run the multipath -f command prematurely, you get an
error 'map is in use', because there's a 'blkid' process running on the
mpath device by udevd. Only after waiting for about 30 seconds you can run the
multipath -f command successfully.

Version-Release number of selected component (if applicable):
multipath-tools v0.4.9 (04/04, 2009)

How reproducible:
Unmapping a mapped LUN and removing all the SCSI (/dev/sgXX) devices

Steps to Reproduce:
1. Unmap a LUN using storage management application.
2. Delete all affected /dev/sgXX devices (e.g. by echoing 'scsi remove-single-device <hctl>' > /proc/scsi/scsi - using 'sg_map -x' and 'sg_turs <device>' to check which devices are erroneous).
  
Actual results:
relevant /dev/dm-XX and /dev/mapper/mpathXX devices are not removed, and are unusable.
Only by waiting for about 30 seconds and then running 'multipath -f mpathXX' multipath device mapping is removed.

Expected results:
relevant /dev/dm-XX and /dev/mapper/mpathXX devices are removed automatically

Additional info:
None.

Comment 2 Ben Marzinski 2010-09-08 15:07:10 UTC

Can you give me more information about this? The only thing that has the multipath device open is blkid, correct?

Why is udev calling blkid on the multipath device when a scsi device is getting
removed? Were these scsi devices not working to start with? Is this blkid perhaps hung from the time when this multipath device was first created?

Can you please attach the results of running

# multipath -ll

both before and after removing the scsi devices?

Comment 3 Yoni Tsafir 2010-11-04 15:49:13 UTC

Hi Ben,
terribly sorry for the delay in answering this one...

Have no idea why udev is calling blkid on the device.
These scsi devices were working, and it isn't hung from the time when the device was first created because it doesn't run until I perform the operation describe above (deleting all relevant /dev/sgXX devices).

###### Mapped a new volume ######

[root@rhel6Beta ~]# lsof /dev/mapper/mpatha # blkid isn't running just after mapping the volume

[root@rhel6Beta ~]# multipath -ll
mpathb (20017380000161f91) dm-0 IBM,2810XIV
size=144G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:1:1 sdb 8:16  active ready running
  |- 1:0:2:1 sdc 8:32  active ready running
  |- 1:0:3:1 sdd 8:48  active ready running
  |- 2:0:1:1 sde 8:64  active ready running
  |- 2:0:2:1 sdf 8:80  active ready running
  `- 2:0:3:1 sdg 8:96  active ready running
mpatha (20017380000163dcd) dm-6 IBM,2810XIV
size=16G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:3:2 sdj 8:144 active ready running
  |- 1:0:1:2 sdh 8:112 active ready running
  |- 1:0:2:2 sdi 8:128 active ready running
  |- 2:0:3:2 sdm 8:192 active ready running
  |- 2:0:2:2 sdl 8:176 active ready running
  `- 2:0:1:2 sdk 8:160 active ready running

###### Un-mapped the volume ######

[root@rhel6Beta ~]# lsof /dev/mapper/mpatha #blkid still isn't running
[root@rhel6Beta ~]# multipath -ll
mpathb (20017380000161f91) dm-0 IBM,2810XIV
size=144G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:1:1 sdb 8:16  active ready  running
  |- 1:0:2:1 sdc 8:32  active ready  running
  |- 1:0:3:1 sdd 8:48  active ready  running
  |- 2:0:1:1 sde 8:64  active ready  running
  |- 2:0:2:1 sdf 8:80  active ready  running
  `- 2:0:3:1 sdg 8:96  active ready  running
mpatha (20017380000163dcd) dm-6 IBM,2810XIV
size=16G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 1:0:3:4 sdj 8:144 active faulty running
  |- 1:0:1:4 sdh 8:112 active faulty running
  |- 1:0:2:4 sdi 8:128 active faulty running
  |- 2:0:3:4 sdm 8:192 active faulty running
  |- 2:0:2:4 sdl 8:176 active faulty running
  `- 2:0:1:4 sdk 8:160 active faulty running

[root@rhel6Beta ~]# xiv_fc_admin -R # rescan, this will perform the operation described above of deleting faulty /dev/sgXX devices

[root@rhel6Beta ~]# lsof /dev/mapper/mpatha
COMMAND   PID USER   FD   TYPE DEVICE    SIZE/OFF   NODE NAME
blkid   18308 root    3r   BLK  253,6 0x3ffff0000 107910 /dev/mapper/../dm-6
[root@rhel6Beta ~]# multipath -ll
mpathb (20017380000161f91) dm-0 IBM,2810XIV
size=144G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:1:1 sdb 8:16 active ready running
  |- 1:0:2:1 sdc 8:32 active ready running
  |- 1:0:3:1 sdd 8:48 active ready running
  |- 2:0:1:1 sde 8:64 active ready running
  |- 2:0:2:1 sdf 8:80 active ready running
  `- 2:0:3:1 sdg 8:96 active ready running
mpatha (20017380000163dcd) dm-6 ,
size=16G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  `- #:#:#:# -   #:#  failed faulty running

Hope this helps...

Comment 4 Yoni Tsafir 2010-11-04 15:52:24 UTC

Sorry about the LUN change (from 2 to 4) in the middle, output mixed from different runs, but I assure you except from that it's the same results.

Comment 5 Ben Marzinski 2010-11-04 18:06:32 UTC

My best guess at what's happening is this.

When the scsi devices are deleted, multipath needs to reload the device with without that path. When this happens, a change uevent is triggered, which causes
the 13-dm-disk.rules udev rule to call blkid.

When you are deleting all the scsi devices:

1. multipathd receives a remove uevent for one of the scsi devices that make up a multipath device.
2. multipathd removes the scsi device and reloads the multipath device without it.
3. reloading the multipath device causes a change uevent to be sent for it
4. 13-dm-disk.rules calls blkid on the multipath device that was reloaded, which hangs because there are no working paths and the multipath device is currently
set to queue_if_no_path.
5. multipathd receives a remove uevent for the last scsi device that makes up a multipath device, however the device cannot be removed, since blkid has it open.

In your case:
6. The no_path_retry timeout expires, multipathd fails the blkid IO, and closes
the device. However, the uevent that should have removed the device has came and went.

For devices that don't set a timeout for no_path_retry, the IO will never get failed, and without manual intervention, blkid will never complete.

We need to try to avoid that blkid call. Also, I believe that udev sends out unmount messages. It would be nice for multipathd to remember when it failed to remove the device, so on umount it can try again. This wouldn't help your
specific issue, since blkid doesn't have the device mounted, but when all paths are lost and the device is mounted, it would be nice if multipathd cleared it
up on unmount. Although possibly, this should be handled in the kernel so
we can catch all closes, not just the unmounts.

Comment 6 Yoni Tsafir 2010-11-11 11:48:26 UTC

Hi Ben,

What you said makes sense, however we don't have the 13-dm-disk.rules file you talked about and we couldn't find any other udev rule that calls blkid.
So where do you think that blkid call is coming from?

In general, when are you guys planning to resolve this issue?
Until then - is there a workaround we can do?

Thanks!

Comment 7 Ben Marzinski 2010-11-12 18:26:38 UTC

(In reply to comment #6)
> Hi Ben,
> 
> What you said makes sense, however we don't have the 13-dm-disk.rules file you
> talked about and we couldn't find any other udev rule that calls blkid.
> So where do you think that blkid call is coming from?

Huh? you don't have

/lib/udev/rules.d/13-dm-disk.rules

what release are you using?

> In general, when are you guys planning to resolve this issue?
> Until then - is there a workaround we can do?

Ideally multipath would have some mechanism for processes to request that their IO not be queued, even if the device was set to queue_if_no_path. Also, ideally, multipath would be able to remove devices on the last close, if the device had no paths.  The second on might be possible in the near term, but I'm doubtful.  The first one seems pretty unlikely to happen at all, unless there turns out to be an easy way to co-op something like O_NONBLOCK to do this, but I don't think so.

A shorter term solution would be to make sure that blkid wasn't called on the
device in this case. Another solution would be for multipathd to occasionally check devices with no paths to see if they can be deleted.  This would also handle the case where the device was intentially open when all the paths were
lost. However it wouldn't help in the blkid case, if the device was set to
queue forever if there were no paths, and there would still be the lag waiting for blkid to fail in your case.  Possibly the best answer is to do both: Keep
blkid from running on change events where we are simply removing paths, and
make multipathd occassionally check devices with no paths, to see if they can
be removed.

As for a workaround:

If you disable queue_if_no_paths in /etc/multipath.conf, by setting

no_path_retry fail

In your devices section, or if you don't have one, in your defaults section. This should work around the problem, although there is a race, so it's possible that blkid won't have the device closed by the time the last path is removed.  However, this workaround will fail IOs whenever all of your paths are down.  You can avoid this by instead setting

flush_on_last_del yes

In your defaults section.  This will turn off queueing when the last path is deleted.  This won't guarantee that the multipath device will be deleted.  There will be a much tigher race with this method. Also in RHEL6, the scsi layer will automatically delete devices that have been failed for dev_loss_tmo seconds. In order to make sure the you don't lose queueing when all your paths are down,
you can set

dev_loss_tmo 60
fast_io_fail_tmo 5

in your defaults section.  This will cause the scsi layer to return IO from failed paths after 5 seconds, and remove the device after 60 seconds.  This
means that flush_on_last_del will fire after the last path has been down for 60 seconds. Since it appears that your setup is already stopping queuing after 30 seconds, this shouldn't cause any change in how quickly your paths fail back
the queued IO.

Let me know if either of these helps.

> Thanks!

Comment 8 Ben Marzinski 2010-11-15 04:20:08 UTC

*** Bug 649508 has been marked as a duplicate of this bug. ***

Comment 9 Yoni Tsafir 2010-11-16 15:28:56 UTC

> Huh? you don't have
> 
> /lib/udev/rules.d/13-dm-disk.rules
> 
> what release are you using?
> 

Oops, turns out I do have it, looked in the wrong place...

> no_path_retry fail
> flush_on_last_del yes
> dev_loss_tmo 60
> fast_io_fail_tmo 5
> 
> Let me know if either of these helps.
> 

OK, so when setting all four of these, stuff works fine.
But when setting only the bottom three, leaving 'no_path_retry 5', the same problem still happens, which as you said means change in behavior when there are no paths available, and we don't want that...

Any suggestions?

Comment 10 Ben Marzinski 2010-11-16 19:06:26 UTC

> But when setting only the bottom three, leaving 'no_path_retry 5', the same
> problem still happens, which as you said means change in behavior when there
> are no paths available, and we don't want that...
> 
> Any suggestions?

In that case, until I write some code to make multipathd automatically prune these devices every so often, you'll need to manually run

# mulipath -f

after the the device has stopped queuing. This bug is scheduled to get fixed in 6.1

Comment 11 Alasdair Kergon 2011-01-07 12:38:02 UTC

Does the blkid call actually change anything, or just refresh information it already cached with identical information?  (Flags can be set on a reload to control which subset of udev rules is run.)

Peter?

Comment 12 Alasdair Kergon 2011-01-07 12:39:55 UTC

And should blkid be changed to check that a device is accessible before trying to read from it?

Comment 13 Peter Rajnoha 2011-01-07 13:59:34 UTC

(In reply to comment #11)
> Does the blkid call actually change anything, or just refresh information it
> already cached with identical information?  (Flags can be set on a reload to
> control which subset of udev rules is run.)

Blkid is just called on every change uevent unless it is flagged out explicitly by
DM_UDEV_DISABLE_DISK_RULES_FLAG.

I don't think there's any more information that blkid can acquire in this particular situation (if blkid hangs, other tools can't make any changes as well, hence blkid would not make any benefit from that scan either, I think).

Then it's about identifying such situation in the rules directly somehow (like we already catch a few situations in 10-dm.rules). Or, if possible, setting the DM_UDEV_DISABLE_DISK_RULES_FLAG flag through libdevmapper on that device reload which generates the uevent, as you mention it.

Comment 14 Peter Rajnoha 2011-01-07 14:09:38 UTC

(..so identifying the last usable path that has just been removed)

Comment 15 Ben Marzinski 2011-01-24 04:59:21 UTC

Multipath now sets DM_UDEV_DISABLE_DISK_RULES_FLAG when it's reloading the table after a path has been deleted.  This keeps blkid from firing at all in these cases.  There are still cases that could benefit from multipathd occasionally occassionally trying to remove devices that have no paths (or possibly monitoring
closes with inotify), but that work isn't happening for 6.1

Comment 17 Ben Marzinski 2011-02-21 22:27:07 UTC

The fix for this bug caused a regression (Bug 677937).  The problem is that if a multipath or kpartx device was created in the initramfs, the udev disk rules need to be run again after the actual root device is mounted to set up all the symlinks.  However, since the devices have already been created, this fix always sets DM_UDEV_DISABLE_DISK_RULES_FLAG, when it reloads them.  To solve this, I need an option to kpartx and multipath to let them override this behaviors.  This option will be used by rc.sysinit when it calls multipath and kpartx.

Comment 19 Yoni Tsafir 2011-05-22 08:43:26 UTC

Any news about this?
I see RHEL 6.1 is out and this wasn't fixed yet...

Comment 21 Ben Marzinski 2011-06-30 05:53:28 UTC

multipath and kpartx now have a -u option that will force the udev dm-disk rules to be run for reloads of existing devices.

Comment 22 Ben Marzinski 2011-08-14 21:06:16 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Whenever a multipath device table was reloaded, udev would regather information about the device with blkid. If the device had no useable paths and was set to queue IO, this would cause blkid to hang forever, keeping the device open.  reloading a already existing multipath device no longer triggers these udev rules, so blkid no longer keeps failed devices open.

Comment 24 Ben Marzinski 2011-10-06 18:19:37 UTC

Even with the -u option, the fix for this caused yet another regression.  Apparently, if blkid doesn't run every time the device gets a change event, that information is removed from the udev database.  This causes utilities that rely on the udev database for information about the device to not work correctly.  So, I'm backing this fix out.  I should be possible to have the 13-dm-disk.rules fix this by calling IMPORT{db} to repopulate the udev database when a change even comes in with DM_UDEV_DISABLE_DISK_RULES_FLAG

Comment 28 Ben Marzinski 2013-05-07 16:27:09 UTC

This occurrence of this issue has been greatly reduced. Fixing this issue completely involves having the waiter daemon occasionally check devices with no
paths, to see if they can be removed.  This is work that should get done in RHEL7 first, and then possibly backported to RHEL6.

Comment 30 Ben Marzinski 2013-11-08 01:13:25 UTC

This will actually get handled using the new DM_DEFERRED_REMOVE flag.

http://www.redhat.com/archives/dm-devel/2013-September/msg00074.html

Comment 31 Ludek Smid 2014-06-26 10:41:31 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 32 Ludek Smid 2014-06-26 11:16:25 UTC

The comment above is incorrect. The correct version is bellow.
I'm sorry for any inconvenience.
---------------------------------------------------------------

This request was NOT resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you need
to escalate this bug.

Comment 34 Ben Marzinski 2014-10-28 10:06:34 UTC

Fixed this, using the new DM_DEFERRED_REMOVE flag.  When deferred_remove is set, multipathd will now used deferred removes when the last path is deleted.  If a new path is added before the deferred remove completes, it is cancelled.

Comment 36 Mehmet Selim Göztok 2014-11-21 08:57:03 UTC

Hi. I had same case before. You need to restart multipath service. After restart it, you should see removal of multipath disk device/LUN information from "multipath -ll" output. Then you should run "blkid" successfully without system hang.

Comment 38 Mehmet Selim Göztok 2014-11-21 10:10:27 UTC

I want to correct my message at Comment #36. Because of i couldn't find way to edit message #36, i write as a new comment.

If you UNMAP LUNs/Disk devices before remove by "dmsetup remove <device path>", you should clean it after;
#service multipathd stop
#dmsetup remove /dev/mapper/mpathep1
#dmsetup remove /dev/mapper/mpathep3
#dmsetup remove /dev/mapper/mpathep2
#dmsetup remove /dev/mapper/mpathe
#service multipathd restart

process. That's all.

Comment 53 Lin Li 2015-10-16 08:18:52 UTC

change to verified.

Comment 55 errata-xmlrpc 2015-11-19 12:56:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2132.html