Bug 1544958
Summary: | multipathd reconfigure times out with large number of luns | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Pittman <jpittman> |
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED ERRATA | QA Contact: | Lin Li <lilin> |
Severity: | medium | Docs Contact: | Marek Suchánek <msuchane> |
Priority: | medium | ||
Version: | 7.4 | CC: | agk, bmarzins, heinzm, jbrassow, jmagrini, lilin, loberman, msnitzer, prajnoha, rhandlin |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | device-mapper-multipath-0.4.9-120.el7 | Doc Type: | Release Note |
Doc Text: |
DM Multipath is much less likely to output an incorrect timeout error
Previously, Device Mapper Multipath (DM Multipath) printed an error message after reconfiguring devices for more than 10 seconds. The error was displayed even if the reconfigure was progressing successfully. As a consequence, it sometimes seemed that the reconfigure failed when reconfiguring a large number of devices.
With this update, the timeout limit has been increased from 10 to 60 seconds, and DM Multipath is now much less likely to print incorrect timeout errors when reconfiguring a large number of devices.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-30 11:27:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1546181 |
Description
John Pittman
2018-02-13 21:05:34 UTC
I noticed in /usr/lib/systemd/system/multipathd.service, ExecReload is '/sbin/multipathd reconfigure'. The reload command completes successfully, is this a workaround? Or failing on the backend where we can't see? [root@localhost multipath-tools]# multipathd -k"reconfigure" error receiving packet [root@localhost multipath-tools]# multipathd reconfigure error receiving packet [root@localhost multipath-tools]# systemctl reload multipathd.service [root@localhost multipath-tools]# echo $? 0 The issue seems to be with poll() timing out as we can see from a strace. The timeout occurs with either multipathd -k reconfigured or /sbin/multipathd/reconfigure: 31183 16:23:56.966192 socket(PF_LOCAL, SOCK_STREAM, 0) = 3 <0.001124> 31183 16:23:56.967338 connect(3, {sa_family=AF_LOCAL, sun_path=@"/org/kernel/linux/storage/multipathd"}, 39) = 0 <0.000713> 31183 16:23:56.968079 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 <0.000007> 31183 16:23:56.968112 write(3, "\16\0\0\0\0\0\0\0", 8) = 8 <0.000010> 31183 16:23:56.968136 write(3, "reconfigure \n\0", 14) = 14 <0.000008> 31183 16:23:56.968156 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000007> 31183 16:23:56.968179 poll([{fd=3, events=POLLIN}], 1, 10000) = 0 (Timeout) <10.010081> The threshold seems to be 10s: # time multipathd -kreconfigure ok real 0m9.512s user 0m0.000s sys 0m0.003s # time multipathd -kreconfigure error receiving packet real 0m10.006s user 0m0.001s sys 0m0.003s No Drops and the receive buffer seems large enough: sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff88ff23a6b000 15 3322 00000002 0 0 0 2 0 30769 UNCONN 0 0 uevent:multipathd/3322 * skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0) -- UNCONN 0 0 uevent:multipathd/3322 * skmem:(r0,rb268435456,t0,tb212992,f0,w0,o96,bl0) Have not been able to reproduce this issue at F27 levels, the command returns much quicker. device-mapper-multipath-0.7.1-9.git847cc43.fc27 device-mapper-1.02.144-1.fc27 kernel-4.14.18-300.fc27 $ multipath -ll | grep -c dm- 1003 $ time multipathd -k"reconfigure" ok real 0m0.088s user 0m0.004s sys 0m0.004s f27 does reconfiguration completely differently. That command returns as soon as multipathd starts doing the reconfiguration, instead of when it finishes. Multipathd continues to use the old configuration until reconfiguration completes, and then it switches over. So this shouldn't be an issue in f27 and beyond. For rhel7, this is probably simply a matter of increasing the timeout. Reconfiguration will complete at any rate, the multipathd command client simply times out waiting for a reply from multipathd. I don't think this is exception or blocker worthy (which means it will miss rhel-7.5), since the reconfiguration does happen, and most users use systemctl to reconfigure multipathd anyway, and won't even notice an issue. Thanks a lot Ben, that's great news that the socket error does not mean a failed command. I saw that a uxsock_timeout keyword is implemented in later versions, 96e8b90227ce (Implement 'uxsock_timeout' keyword). In theory, would this parameter, if set, have been a workaround for the error? Yes. uxsock_timeout controls how long multipathd will wait for a message from a command client, and how long the client will wait for a reply from multipathd. So increasing that would have solved the issue. Thanks again Ben. Decreasing sev/prio. I've increased the timeout from 10 seconds to 60 seconds. Reproduced on device-mapper-multipath-0.4.9-111 1, [root@storageqe-06 ~]# rpm -qa | grep multipath device-mapper-multipath-0.4.9-111.el7.x86_64 device-mapper-multipath-libs-0.4.9-111.el7.x86_64 2,[root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=500 [root@storageqe-06 ~]# multipathd -k"reconfigure" ok 3,[root@storageqe-06 ~]# modprobe -r scsi_debug [root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=600 [root@storageqe-06 ~]# time multipathd -k"reconfigure" ok real 0m2.549s user 0m0.001s sys 0m0.003s 4,[root@storageqe-06 ~]# modprobe -r scsi_debug [root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700 [root@storageqe-06 ~]# time multipathd -k"reconfigure" error receiving packet <---------------------------- real 0m10.014s user 0m0.002s sys 0m0.002s Verified on device-mapper-multipath-0.4.9-121.el7 1,[root@storageqe-06 ~]# rpm -qa | grep multipath device-mapper-multipath-0.4.9-121.el7.x86_64 device-mapper-multipath-libs-0.4.9-121.el7.x86_64 [root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=700 2,[root@storageqe-06 ~]# time multipathd -k"reconfigure" ok real 0m13.626s user 0m0.001s sys 0m0.003s 3,[root@storageqe-06 ~]# modprobe -r scsi_debug [root@storageqe-06 ~]# modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=800 [root@storageqe-06 ~]# time multipathd -k"reconfigure" ok real 0m0.130s user 0m0.002s sys 0m0.002s Test result: Both attempts were successful over 10s threshold Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3236 |