1544958 – multipathd reconfigure times out with large number of luns

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1544958 - multipathd reconfigure times out with large number of luns

Summary: multipathd reconfigure times out with large number of luns

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	7.4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ben Marzinski
QA Contact:	Lin Li
Docs Contact:	Marek Suchánek
URL:
Whiteboard:
Depends On:
Blocks:	1546181
TreeView+	depends on / blocked

Reported:	2018-02-13 21:05 UTC by John Pittman
Modified:	2021-09-03 12:04 UTC (History)
CC List:	10 users (show)
Fixed In Version:	device-mapper-multipath-0.4.9-120.el7
Doc Type:	Release Note
Doc Text:	DM Multipath is much less likely to output an incorrect timeout error Previously, Device Mapper Multipath (DM Multipath) printed an error message after reconfiguring devices for more than 10 seconds. The error was displayed even if the reconfigure was progressing successfully. As a consequence, it sometimes seemed that the reconfigure failed when reconfiguring a large number of devices. With this update, the timeout limit has been increased from 10 to 60 seconds, and DM Multipath is now much less likely to print incorrect timeout errors when reconfiguring a large number of devices.
Clone Of:
Environment:
Last Closed:	2018-10-30 11:27:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3236	0	None	None	None	2018-10-30 11:28:35 UTC

Description John Pittman 2018-02-13 21:05:34 UTC

Description of problem:

multipathd will time out with a large number of luns.  If the command takes over 10 seconds to complete, a socket error will be received, "error receiving packet".

Version-Release number of selected component (if applicable):

device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64
device-mapper-multipath-0.4.9-111.el7_4.2.x86_64
device-mapper-1.02.140-8.el7.x86_64
kernel-3.10.0-693.17.1.el7.x86_64

How reproducible:

[root@localhost ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=500
[root@localhost ~]# multipathd -k"reconfigure"
error receiving packet

Actual results:

Timeout

Expected results:

Command should not timeout if still working.

Comment 2 John Pittman 2018-02-13 21:16:47 UTC

I noticed in /usr/lib/systemd/system/multipathd.service, ExecReload is '/sbin/multipathd reconfigure'.

The reload command completes successfully, is this a workaround?  Or failing on the backend where we can't see?

[root@localhost multipath-tools]# multipathd -k"reconfigure"
error receiving packet
[root@localhost multipath-tools]# multipathd reconfigure
error receiving packet
[root@localhost multipath-tools]# systemctl reload multipathd.service
[root@localhost multipath-tools]# echo $?
0

Comment 3 Jon Magrini 2018-02-13 21:29:14 UTC

The issue seems to be with poll() timing out as we can see from a strace.  The timeout occurs with either multipathd -k reconfigured or /sbin/multipathd/reconfigure:

31183 16:23:56.966192 socket(PF_LOCAL, SOCK_STREAM, 0) = 3 <0.001124>
31183 16:23:56.967338 connect(3, {sa_family=AF_LOCAL, sun_path=@"/org/kernel/linux/storage/multipathd"}, 39) = 0 <0.000713>
31183 16:23:56.968079 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 <0.000007>
31183 16:23:56.968112 write(3, "\16\0\0\0\0\0\0\0", 8) = 8 <0.000010>
31183 16:23:56.968136 write(3, "reconfigure \n\0", 14) = 14 <0.000008>
31183 16:23:56.968156 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000007>
31183 16:23:56.968179 poll([{fd=3, events=POLLIN}], 1, 10000) = 0 (Timeout) <10.010081>

The threshold seems to be 10s: 

# time multipathd -kreconfigure
ok

real	0m9.512s
user	0m0.000s
sys	0m0.003s

# time multipathd -kreconfigure
error receiving packet

real	0m10.006s
user	0m0.001s
sys	0m0.003s

No Drops and the receive buffer seems large enough: 

sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
ffff88ff23a6b000 15  3322   00000002 0        0        0 2        0        30769
UNCONN     0      0      uevent:multipathd/3322           *
         skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0)
--
UNCONN     0      0      uevent:multipathd/3322           *
         skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0)

Comment 4 John Pittman 2018-02-14 14:37:10 UTC

Have not been able to reproduce this issue at F27 levels, the command returns much quicker.

device-mapper-multipath-0.7.1-9.git847cc43.fc27
device-mapper-1.02.144-1.fc27
kernel-4.14.18-300.fc27

$ multipath -ll | grep -c dm-
1003
$ time multipathd -k"reconfigure"
ok

real   0m0.088s
user   0m0.004s
sys    0m0.004s

Comment 6 Ben Marzinski 2018-02-14 17:16:37 UTC

f27 does reconfiguration completely differently. That command returns as soon as multipathd starts doing the reconfiguration, instead of when it finishes. Multipathd continues to use the old configuration until reconfiguration completes, and then it switches over. So this shouldn't be an issue in f27 and beyond. For rhel7, this is probably simply a matter of increasing the timeout. Reconfiguration will complete at any rate, the multipathd command client simply times out waiting for a reply from multipathd. I don't think this is exception or blocker worthy (which means it will miss rhel-7.5), since the reconfiguration does happen, and most users use systemctl to reconfigure multipathd anyway, and won't even notice an issue.

Comment 7 John Pittman 2018-02-14 17:24:01 UTC

Thanks a lot Ben, that's great news that the socket error does not mean a failed command.  I saw that a uxsock_timeout keyword is implemented in later versions, 96e8b90227ce (Implement 'uxsock_timeout' keyword).  In theory, would this parameter, if set, have been a workaround for the error?

Comment 8 Ben Marzinski 2018-02-14 18:05:08 UTC

Yes. uxsock_timeout controls how long multipathd will wait for a message from a command client, and how long the client will wait for a reply from multipathd. So increasing that would have solved the issue.

Comment 9 John Pittman 2018-02-14 18:33:17 UTC

Thanks again Ben.  Decreasing sev/prio.

Comment 11 Ben Marzinski 2018-06-13 21:14:23 UTC

I've increased the timeout from 10 seconds to 60 seconds.

Comment 17 Lin Li 2018-08-01 17:47:17 UTC

Reproduced on device-mapper-multipath-0.4.9-111
1, [root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-111.el7.x86_64
device-mapper-multipath-libs-0.4.9-111.el7.x86_64

2,[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=500
[root@storageqe-06 ~]# multipathd -k"reconfigure"
ok

3,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=600
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m2.549s
user	0m0.001s
sys	0m0.003s

4,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
error receiving packet  <----------------------------

real	0m10.014s
user	0m0.002s
sys	0m0.002s


Verified on device-mapper-multipath-0.4.9-121.el7
1,[root@storageqe-06 ~]# rpm -qa | grep multipath 
device-mapper-multipath-0.4.9-121.el7.x86_64
device-mapper-multipath-libs-0.4.9-121.el7.x86_64
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700

2,[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m13.626s
user	0m0.001s
sys	0m0.003s

3,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=800
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m0.130s
user	0m0.002s
sys	0m0.002s



Test result: Both attempts were successful over 10s threshold

Comment 20 errata-xmlrpc 2018-10-30 11:27:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3236

Note You need to log in before you can comment on or make changes to this bug.