Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1544958 - multipathd reconfigure times out with large number of luns
multipathd reconfigure times out with large number of luns
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
7.4
All Linux
medium Severity medium
: rc
: ---
Assigned To: Ben Marzinski
Lin Li
Marek Suchanek
:
Depends On:
Blocks: 1546181
  Show dependency treegraph
 
Reported: 2018-02-13 16:05 EST by John Pittman
Modified: 2018-10-30 07:28 EDT (History)
10 users (show)

See Also:
Fixed In Version: device-mapper-multipath-0.4.9-120.el7
Doc Type: Release Note
Doc Text:
DM Multipath is much less likely to output an incorrect timeout error Previously, Device Mapper Multipath (DM Multipath) printed an error message after reconfiguring devices for more than 10 seconds. The error was displayed even if the reconfigure was progressing successfully. As a consequence, it sometimes seemed that the reconfigure failed when reconfiguring a large number of devices. With this update, the timeout limit has been increased from 10 to 60 seconds, and DM Multipath is now much less likely to print incorrect timeout errors when reconfiguring a large number of devices.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-10-30 07:27:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3236 None None None 2018-10-30 07:28 EDT

  None (edit)
Description John Pittman 2018-02-13 16:05:34 EST
Description of problem:

multipathd will time out with a large number of luns.  If the command takes over 10 seconds to complete, a socket error will be received, "error receiving packet".

Version-Release number of selected component (if applicable):

device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64
device-mapper-multipath-0.4.9-111.el7_4.2.x86_64
device-mapper-1.02.140-8.el7.x86_64
kernel-3.10.0-693.17.1.el7.x86_64

How reproducible:

[root@localhost ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=500
[root@localhost ~]# multipathd -k"reconfigure"
error receiving packet

Actual results:

Timeout

Expected results:

Command should not timeout if still working.
Comment 2 John Pittman 2018-02-13 16:16:47 EST
I noticed in /usr/lib/systemd/system/multipathd.service, ExecReload is '/sbin/multipathd reconfigure'.

The reload command completes successfully, is this a workaround?  Or failing on the backend where we can't see?

[root@localhost multipath-tools]# multipathd -k"reconfigure"
error receiving packet
[root@localhost multipath-tools]# multipathd reconfigure
error receiving packet
[root@localhost multipath-tools]# systemctl reload multipathd.service
[root@localhost multipath-tools]# echo $?
0
Comment 3 Jon Magrini 2018-02-13 16:29:14 EST
The issue seems to be with poll() timing out as we can see from a strace.  The timeout occurs with either multipathd -k reconfigured or /sbin/multipathd/reconfigure:

31183 16:23:56.966192 socket(PF_LOCAL, SOCK_STREAM, 0) = 3 <0.001124>
31183 16:23:56.967338 connect(3, {sa_family=AF_LOCAL, sun_path=@"/org/kernel/linux/storage/multipathd"}, 39) = 0 <0.000713>
31183 16:23:56.968079 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 <0.000007>
31183 16:23:56.968112 write(3, "\16\0\0\0\0\0\0\0", 8) = 8 <0.000010>
31183 16:23:56.968136 write(3, "reconfigure \n\0", 14) = 14 <0.000008>
31183 16:23:56.968156 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000007>
31183 16:23:56.968179 poll([{fd=3, events=POLLIN}], 1, 10000) = 0 (Timeout) <10.010081>

The threshold seems to be 10s: 

# time multipathd -kreconfigure
ok

real	0m9.512s
user	0m0.000s
sys	0m0.003s

# time multipathd -kreconfigure
error receiving packet

real	0m10.006s
user	0m0.001s
sys	0m0.003s

No Drops and the receive buffer seems large enough: 

sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
ffff88ff23a6b000 15  3322   00000002 0        0        0 2        0        30769
UNCONN     0      0      uevent:multipathd/3322           *
         skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0)
--
UNCONN     0      0      uevent:multipathd/3322           *
         skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0)
Comment 4 John Pittman 2018-02-14 09:37:10 EST
Have not been able to reproduce this issue at F27 levels, the command returns much quicker.

device-mapper-multipath-0.7.1-9.git847cc43.fc27
device-mapper-1.02.144-1.fc27
kernel-4.14.18-300.fc27

$ multipath -ll | grep -c dm-
1003
$ time multipathd -k"reconfigure"
ok

real   0m0.088s
user   0m0.004s
sys    0m0.004s
Comment 6 Ben Marzinski 2018-02-14 12:16:37 EST
f27 does reconfiguration completely differently. That command returns as soon as multipathd starts doing the reconfiguration, instead of when it finishes. Multipathd continues to use the old configuration until reconfiguration completes, and then it switches over. So this shouldn't be an issue in f27 and beyond. For rhel7, this is probably simply a matter of increasing the timeout. Reconfiguration will complete at any rate, the multipathd command client simply times out waiting for a reply from multipathd. I don't think this is exception or blocker worthy (which means it will miss rhel-7.5), since the reconfiguration does happen, and most users use systemctl to reconfigure multipathd anyway, and won't even notice an issue.
Comment 7 John Pittman 2018-02-14 12:24:01 EST
Thanks a lot Ben, that's great news that the socket error does not mean a failed command.  I saw that a uxsock_timeout keyword is implemented in later versions, 96e8b90227ce (Implement 'uxsock_timeout' keyword).  In theory, would this parameter, if set, have been a workaround for the error?
Comment 8 Ben Marzinski 2018-02-14 13:05:08 EST
Yes. uxsock_timeout controls how long multipathd will wait for a message from a command client, and how long the client will wait for a reply from multipathd. So increasing that would have solved the issue.
Comment 9 John Pittman 2018-02-14 13:33:17 EST
Thanks again Ben.  Decreasing sev/prio.
Comment 11 Ben Marzinski 2018-06-13 17:14:23 EDT
I've increased the timeout from 10 seconds to 60 seconds.
Comment 17 Lin Li 2018-08-01 13:47:17 EDT
Reproduced on device-mapper-multipath-0.4.9-111
1, [root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-111.el7.x86_64
device-mapper-multipath-libs-0.4.9-111.el7.x86_64

2,[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=500
[root@storageqe-06 ~]# multipathd -k"reconfigure"
ok

3,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=600
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m2.549s
user	0m0.001s
sys	0m0.003s

4,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
error receiving packet  <----------------------------

real	0m10.014s
user	0m0.002s
sys	0m0.002s


Verified on device-mapper-multipath-0.4.9-121.el7
1,[root@storageqe-06 ~]# rpm -qa | grep multipath 
device-mapper-multipath-0.4.9-121.el7.x86_64
device-mapper-multipath-libs-0.4.9-121.el7.x86_64
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700

2,[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m13.626s
user	0m0.001s
sys	0m0.003s

3,[root@storageqe-06 ~]# modprobe -r scsi_debug
[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=800
[root@storageqe-06 ~]# time multipathd -k"reconfigure"
ok

real	0m0.130s
user	0m0.002s
sys	0m0.002s



Test result: Both attempts were successful over 10s threshold
Comment 20 errata-xmlrpc 2018-10-30 07:27:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3236

Note You need to log in before you can comment on or make changes to this bug.