RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2125357 - when running iscsiadm login and quick logout the logout didn't run as expected
Summary: when running iscsiadm login and quick logout the logout didn't run as expected
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: device-mapper-multipath
Version: 9.1
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Lin Li
URL:
Whiteboard:
Depends On: 2110485
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-08 17:43 UTC by Ben Marzinski
Modified: 2023-05-09 10:14 UTC (History)
10 users (show)

Fixed In Version: device-mapper-multipath-0.8.7-13.el9
Doc Type: Bug Fix
Doc Text:
Cause: multipathd wasn't respecting the flush_on_last_del and deferred_remove configuration parameters if a multipath device lost all of it's paths before it received the uevent for the device being created. Consequence: If paths were quickly added and removed from a system, the multipath devices could fail to be removed, even if flush_on_last_del and deferred_remove were set. Fix: If multipathd notices that all the paths have been removed, when it is finalizing device setup after receiving the uevent for the multipath device's creation, it now attempts to remove the device, just like if the paths were removed after the device was fully set up. Result: multipath devices are correctly removed, even when paths are rapidly added and removed.
Clone Of: 2110485
Environment:
Last Closed: 2023-05-09 08:14:07 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-133574 0 None None None 2022-09-08 17:50:05 UTC
Red Hat Product Errata RHSA-2023:2459 0 None None None 2023-05-09 08:14:30 UTC

Description Ben Marzinski 2022-09-08 17:43:06 UTC
+++ This bug was initially created as a clone of Bug #2110485 +++

Description of problem:
when running iscsiadm login and quick logout the logout didn't run as expected  

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux release 8.6 (Ootpa)
iscsi-initiator-utils-6.2.1.4-4.git095f59c.el8.x86_64
libiscsi-1.18.0-8.module+el8.6.0+14480+c0a3aa0f.x86_64

How reproducible:
30%

Steps to Reproduce:
1. Make sure the package "device-mapper-multipath" installed
2. Run: mpathconf --enable
3. Run the command(s): 
lsblk; iscsiadm --mode discoverydb --type sendtargets --portal <iscsi_server_ip> --discover; iscsiadm --mode node --login; iscsiadm --mode session --logout; iscsiadm --mode node -o delete; lsblk

Actual results:
after a few times you will see that one lun didn't logout as expected like:
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                 8:0    0     xG  0 disk  
├─sda1                              8:1    0     1G  0 part  /boot
└─sda2                              8:2    0     yG  0 part  
  ├─vg-root                       253:0    0     zG  0 lvm   /
3600a098xxxxxxxxxxxxxxxxxxxxxxxxx 253:9    0     nG  0 mpath 

This situation cause various commands to get stuck like:
1.  blkid /dev/mapper/3600a098xxxxxxxxxxxxxxxxxxxxxxxxx
2.  ansible localhost -m setup
when running automation it causes all flow to be stuck

Expected results:
Don't allow the result above 

Additional info:

--- Additional comment from Martin Hoyer on 2022-07-25 14:02:35 UTC ---

Unable to reproduce this myself, but looking at the description, the multipath device is part of lvm group / ?

--- Additional comment from Kobi Hakimi on 2022-07-25 14:42:09 UTC ---

(In reply to Martin Hoyer from comment #1)
> Unable to reproduce this myself, but looking at the description, the
> multipath device is part of lvm group / ?
it looks like that because I cut all the subdevices of sda2.
maybe this is better:

└─sda2                              8:2    0     yG  0 part  
  ├─vg-root                       253:0    0     zG  0 lvm   /
...
...
3600a098xxxxxxxxxxxxxxxxxxxxxxxxx 253:9    0     nG  0 mpath 

about the reproduction of it I did it in a few environments. 
just rerun the long command(s) until you will see the lsblk result like the above

--- Additional comment from Martin Hoyer on 2022-07-26 15:06:41 UTC ---

Thanks for the explanation. I've tried on other machine, with different setup, but still not reproducing it. 
Can you share your /etc/multipath.conf? Is this vanilla OS, or with some changes? Are you using multiple network paths to the target? 

In any case, this is not necessarily an iscsi-initiator-utils bug.
Ben, is this something you can take a look at?


--- Additional comment from Kobi Hakimi on 2022-07-31 09:00:43 UTC ---

We tried to WA this issue in our automation by run the commands: 
1. "multipath -ll" and print it
2. "lsblk" and print it

but still, we can see this issue from time to time.

Do you have an idea how to WA this issue?

--- Additional comment from Ben Marzinski on 2022-08-02 18:31:56 UTC ---

I've been using a system that apparently can reproduce this. I can see multipath devices sitting around after all their paths have been deleted, but I can't get commands to hang. This might come down to what kind of array is being used.  The machine I'm using has NetApp LUNs. The default configuration for these devices sets "flush_on_last_del yes". This means that when the last path to a multipath device is deleted, queueing is disabled on the device, so no IO to that device will hang anymore.

When I see the devices left behind, I can see "multipath -l" output like this:

3600a09803830447a4f244c4657595038 dm-28 ##,##
size=50G features='0' hwhandler='0' wp=rw

This means that there is a multipath device without any paths. When the last paths is removed from a multipath device, multipathd will try to remove the multipath device.  If flush_on_last_del is set, it will first disable queueing so that any stalled IO can fail. However, if the multipath device is in use, then multipathd will fail at removing it, and it will remain behind like you see.  As long as queue_if_no_path is not set in the "features" output, all IO to the device will fail. If it is, all IO to the device will queue.  These devices are generally in use, but as the paths get removed, the device get's reloaded with fewer paths. This will trigger a uevent, and some udev rules can open the device while they run.  If multipath sees the last path removed, while a udev rule the opens the device is still running, it can fail to remove the device. To avoid this issue, you can set "deferred_remove yes" in /etc/multipath.conf. When this option is set and the last path is removed,
multipathd will try a deferred remove. If the device cannot be removed immediately, then device-mapper will monitor it in the kernel. When the last opener closes the device, it will be removed. If a path comes back before the last opener closes the device, the deferred remove will be
cancelled. You should note that while a deferred remove is waiting on the last opener to close the multipath device, new attempts to open the device will fail.

Does this work around the issue for you?

--- Additional comment from Ben Marzinski on 2022-08-02 18:34:14 UTC ---

Oh, also, when I reproduced the issue, I could see lines like these in the logs:

multipathd[123307]: 3600a09803830447a4f244c4657595038 Last path deleted, disabling queueing
multipathd[123307]: 3600a09803830447a4f244c4657595038: map in use
multipathd[123307]: 3600a09803830447a4f244c4657595038: can't flush 
multipathd[123307]: 3600a09803830447a4f244c4657595038: queueing disabled
multipathd[123307]: 3600a09803830447a4f244c4657595038: load table [0 104857600 multipath 0 0 0 0]

This shows multipathd disabling queueing, and try to remove the device, but failing, and finally reloading the device with no paths.

--- Additional comment from Kobi Hakimi on 2022-08-02 20:25:32 UTC ---

(In reply to Ben Marzinski from comment #6)
> I've been using a system that apparently can reproduce this. I can see
> multipath devices sitting around after all their paths have been deleted,
> but I can't get commands to hang. This might come down to what kind of array
> is being used.  The machine I'm using has NetApp LUNs. The default
> configuration for these devices sets "flush_on_last_del yes". This means
> that when the last path to a multipath device is deleted, queueing is
> disabled on the device, so no IO to that device will hang anymore.
> 
> When I see the devices left behind, I can see "multipath -l" output like
> this:
> 
> 3600a09803830447a4f244c4657595038 dm-28 ##,##
> size=50G features='0' hwhandler='0' wp=rw
> 
> This means that there is a multipath device without any paths. When the last
> paths is removed from a multipath device, multipathd will try to remove the
> multipath device.  If flush_on_last_del is set, it will first disable
> queueing so that any stalled IO can fail. However, if the multipath device
> is in use, then multipathd will fail at removing it, and it will remain
> behind like you see.  As long as queue_if_no_path is not set in the
> "features" output, all IO to the device will fail. If it is, all IO to the
> device will queue.  These devices are generally in use, but as the paths get
> removed, the device get's reloaded with fewer paths. This will trigger a
> uevent, and some udev rules can open the device while they run.  If
> multipath sees the last path removed, while a udev rule the opens the device
> is still running, it can fail to remove the device. To avoid this issue, you
> can set "deferred_remove yes" in /etc/multipath.conf. When this option is
> set and the last path is removed,
> multipathd will try a deferred remove. If the device cannot be removed
> immediately, then device-mapper will monitor it in the kernel. When the last
> opener closes the device, it will be removed. If a path comes back before
> the last opener closes the device, the deferred remove will be
> cancelled. You should note that while a deferred remove is waiting on the
> last opener to close the multipath device, new attempts to open the device
> will fail.
> 
> Does this work around the issue for you?

I am afraid that it will not be good enough for us.
because our automation doing discovery/login/logout many times.
as you said "we can fail in the next attempt to open the device" 
so IMHO we will move the issue to the next step.
Thanks Ben!!

--- Additional comment from Ben Marzinski on 2022-08-02 21:47:11 UTC ---

(In reply to Kobi Hakimi from comment #8)
> (In reply to Ben Marzinski from comment #6)
> > You should note that while a deferred remove is waiting on the
> > last opener to close the multipath device, new attempts to open the device
> > will fail.
> > 
> > Does this work around the issue for you?
> 
> I am afraid that it will not be good enough for us.
> because our automation doing discovery/login/logout many times.
> as you said "we can fail in the next attempt to open the device" 
> so IMHO we will move the issue to the next step.
> Thanks Ben!!

On the machine that you loaned me, do you still see an issue with "deferred_remove yes" added to /etc/multipath.conf? I'm not sure how this would interfere with your discovery/login/logout loop. When you logout, if the multipath device is being held open, it will remain until it gets closed. Either you will log back in before the device is removed, in which case the deferred remove will be cancelled, or whatever is keeping the multipath device open (probably udev from the last reload) will close it before you login again, and the multipath device will get removed.

The way that refusing to open the device could cause problems would be if you intentionally had the multipath device open while you logged out of the iscsi session, and actually wanted to open the multipath device for something new while you were still logged out of the iscsi session, and you had flush_on_last_del turned off, so IO to the multipath device hung instead of failing.  In this case, opening the device and attempting to use it would hang until a new path appeared, and then would continue to work.  If flush_on_last_del is turn on, like it is on the setup you loaned me, trying to use the device would fail anyway, most likely causing the device to get closed again. But you aren't trying to use the multipath devices when you've logged out of the iscsi session.

with "deferred_remove yes", if you check the multipath devices immediately after logging out, you might still see a multipath with no paths, if you check before whatever has the device has closed it, but assuming that the thing is udev, it should be cleared up within a second.

--- Additional comment from Kobi Hakimi on 2022-08-02 21:53:13 UTC ---

(In reply to Ben Marzinski from comment #9)
> (In reply to Kobi Hakimi from comment #8)
> > (In reply to Ben Marzinski from comment #6)
> > > You should note that while a deferred remove is waiting on the
> > > last opener to close the multipath device, new attempts to open the device
> > > will fail.
> > > 
> > > Does this work around the issue for you?
> > 
> > I am afraid that it will not be good enough for us.
> > because our automation doing discovery/login/logout many times.
> > as you said "we can fail in the next attempt to open the device" 
> > so IMHO we will move the issue to the next step.
> > Thanks Ben!!
> 
> On the machine that you loaned me, do you still see an issue with
> "deferred_remove yes" added to /etc/multipath.conf? I'm not sure how this
> would interfere with your discovery/login/logout loop. When you logout, if
> the multipath device is being held open, it will remain until it gets
> closed. Either you will log back in before the device is removed, in which
> case the deferred remove will be cancelled, or whatever is keeping the
> multipath device open (probably udev from the last reload) will close it
> before you login again, and the multipath device will get removed.
> 
> The way that refusing to open the device could cause problems would be if
> you intentionally had the multipath device open while you logged out of the
> iscsi session, and actually wanted to open the multipath device for
> something new while you were still logged out of the iscsi session, and you
> had flush_on_last_del turned off, so IO to the multipath device hung instead
> of failing.  In this case, opening the device and attempting to use it would
> hang until a new path appeared, and then would continue to work.  If
> flush_on_last_del is turn on, like it is on the setup you loaned me, trying
> to use the device would fail anyway, most likely causing the device to get
> closed again. But you aren't trying to use the multipath devices when you've
> logged out of the iscsi session.
> 
> with "deferred_remove yes", if you check the multipath devices immediately
> after logging out, you might still see a multipath with no paths, if you
> check before whatever has the device has closed it, but assuming that the
> thing is udev, it should be cleared up within a second.

Even after you changed the "deferred_remove yes" I still reproduced the same issue manually.

--- Additional comment from Ben Marzinski on 2022-08-04 20:06:25 UTC ---

Got it. The issue is that the multipathd device is still setting up when the paths get removed. The LVM udev rules require that the device is not suspended when they run for the first time. This can be a problem for multipath devices, since a new path can appear immediately after the device is created, causing it to suspend to update its table. To deal with this, when multipathd creates a new multipath device, it won't add any new paths until it receives the uevent for the multipath device being added. Once it gets that uevent, it will reload the device with all the paths.

What's happening here is that between when multipathd creates the device, and when it gets the uevent for that creation, all the paths are deleted. When the uevent for the device's creation comes in, multipathd reloads the device, but there are no longer any path, so it loads the device without any. This is a different code path than multipathd goes through when it is processing the remove uevents for the paths.  When it is reloading the device to do the work it delayed, it needs to deal with the case where there are no paths just like how it deals with having
the paths removed, i.e. it should disable queueing if configured to, and it should attempt to remove the device (including settup up a deferred remove). None of this currently happens.

So you can work around this by not immediately deleting all your paths after creating them. Putting a small wait between the login and logout should avoid this. For instance, this avoided the issue for me

# iscsiadm --mode discoverydb --type sendtargets --portal <iscsi_server_ip> --discover; iscsiadm --mode node --login; sleep 5; iscsiadm --mode session --logout; iscsiadm --mode node -o delete; lsblk

I'll work on getting this fixed so that the sleep is unnecessary.

--- Additional comment from Martin Hoyer on 2022-08-09 09:31:44 UTC ---

Thank You Ben!
From your comment, it sounds like this is not related to iscsi-initiator-utils, but rather udev or dm-multipath. Would you be ok with changing the component field?

--- Additional comment from Ben Marzinski on 2022-08-09 15:35:01 UTC ---

(In reply to Martin Hoyer from comment #12)
> Thank You Ben!
> From your comment, it sounds like this is not related to
> iscsi-initiator-utils, but rather udev or dm-multipath. Would you be ok with
> changing the component field?

Oops. I meant to change that when I took the bug. Done.

--- Additional comment from Ben Marzinski on 2022-08-10 19:58:58 UTC ---

Test packages that should fix this issue are available here:

https://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL8/2110485/

Kobi, are you able to verify that these packages resolve the issue?

--- Additional comment from Ben Marzinski on 2022-08-17 23:51:07 UTC ---

RHEL-8.6 test packages are available here (the other ones were build against rhel-8.7):

https://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL8/2110485/rhel-8.6/

Comment 6 errata-xmlrpc 2023-05-09 08:14:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: device-mapper-multipath security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2459


Note You need to log in before you can comment on or make changes to this bug.