Description of problem: wait_for_file() times out on the environment with a large number of multipath devices and high load. customer's environment: +-- Server ---+ +--- Disk Array ---+ | SAS-HBA+------+ LUN * 256 | | SAS-HBA+------+ | +-------------+ +------------------+ * Load: 32 dd commands were issued to each LUN at the same time. On this environment, customer unplugged a cable and plugged it again for testing. And, several paths were not recovered. For each pathes, the following errors were outputed in debug logfile of multipathd. multipathd-v3-d.log --------------------------- Aug 11 10:49:08 | failed to open /dev/sddw Aug 11 10:49:08 | sddw: failed to get path uid Aug 11 10:49:08 | uevent trigger error Aug 11 10:55:28 | failed to open /dev/sdfc Aug 11 10:55:28 | sdfc: failed to get path uid Aug 11 10:55:28 | uevent trigger error Aug 11 10:57:11 | failed to open /dev/sdfg Aug 11 10:57:11 | sdfg: failed to get path uid Aug 11 10:57:11 | uevent trigger error Aug 11 11:02:31 | failed to open /dev/sdgf Aug 11 11:02:31 | sdgf: failed to get path uid Aug 11 11:02:31 | uevent trigger error This means that the device files didn't exist in /dev. It should be created by udevd. And, multipathd waits the completion in wait_for_file() for 60 seconds. But, those device files were created after that. /var/log/messages with "udevcontrol log_priority=debug" -------------------------------- Aug 11 10:49:13 RX300S3-K udevd-event[29057]: udev_node_add: creating device node '/dev/sddw', major = '71', minor = '224', mode = '0640', uid = '0', gid = '6' Aug 11 10:49:15 RX300S3-K udevd-event[29057]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d00000000000000007d0000' to '../../sddw' Aug 11 10:55:31 RX300S3-K udevd-event[30135]: udev_node_add: creating device node '/dev/sdfc', major = '129', minor = '224', mode = '0640', uid = '0', gid = '6' Aug 11 10:55:35 RX300S3-K udevd-event[30135]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d00000000000000009d0000' to '../../sdfc' Aug 11 10:57:14 RX300S3-K udevd-event[30220]: udev_node_add: creating device node '/dev/sdfg', major = '130', minor = '32', mode = '0640', uid = '0', gid = '6' Aug 11 10:57:14 RX300S3-K udevd-event[30220]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d0000000000000000a10000' to '../../sdfg' Aug 11 11:02:17 RX300S3-K udevd-event[31241]: udev_node_add: creating device node '/dev/sdgf', major = '131', minor = '176', mode = '0640', uid = '0', gid = '6' Aug 11 11:02:39 RX300S3-K udevd-event[31241]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d0000000000000000ba0000' to '../../sdgf' I created a patch to extend the wait time into 90 seconds. This problem didn't occur with this patch on their environment. I'll attach the proposed patch. Version-Release number of selected component (if applicable): device-mapper-multipath-0.4.7-34.el5.x86_64 How reproducible: Always on high load Steps to Reproduce: 1. Prepare a large number of multipath devices 2. Create highload on the multipath devices 3. Pull out a cable 4. Insert the cable again Actual results: some mpath devices of 256 didn't recover after connecting their paths under a high load situation. Expected results: All devices should recover after connecting their paths even if it's under a high load situation. Additional info:
Seems reasonable.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
Patch applied. Thanks.
Actually, I've changed the solution to this bug. The default timeout is still 90 second, but it is now configurable. There is a new /etc/multipath.conf defaults parameter, "file_timeout". It defaults to 90 seconds, but it can be set to any positive value.
*** Bug 674366 has been marked as a duplicate of this bug. ***
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds
Confirm patch checked in device-mapper-multipath-0.4.7-46.el5
Ben, When I use "file_timeout 0" in configuration, multipathd -k'reconfig' will got this: ============= *** glibc detected *** multipathd: double free or corruption (out): 0x00007fff6419c110 *** ======= Backtrace: ========= /lib64/libc.so.6[0x31a867245f] /lib64/libc.so.6(cfree+0x4b)[0x31a86728bb] multipathd(xfree+0x27)[0x41b1d1] multipathd[0x408b46] multipathd[0x408b98] multipathd[0x4082ab] /lib64/libc.so.6(__libc_start_main+0xf4)[0x31a861d994] multipathd[0x403c09] ======= Memory map: ======== 00400000-00442000 r-xp 00000000 fd:00 16449547 /sbin/multipathd 00642000-00646000 rw-p 00042000 fd:00 16449547 /sbin/multipathd 00646000-00647000 rw-p 00646000 00:00 0 00845000-00847000 rw-p 00045000 fd:00 16449547 /sbin/multipathd 07a6d000-07a8e000 rw-p 07a6d000 00:00 0 [heap] 31a8200000-31a821c000 r-xp 00000000 fd:00 7143741 /lib64/ld-2.5.so 31a841c000-31a841d000 r--p 0001c000 fd:00 7143741 /lib64/ld-2.5.so 31a841d000-31a841e000 rw-p 0001d000 fd:00 7143741 /lib64/ld-2.5.so 31a8600000-31a874e000 r-xp 00000000 fd:00 7143742 /lib64/libc-2.5.so 31a874e000-31a894e000 ---p 0014e000 fd:00 7143742 /lib64/libc-2.5.so 31a894e000-31a8952000 r--p 0014e000 fd:00 7143742 /lib64/libc-2.5.so 31a8952000-31a8953000 rw-p 00152000 fd:00 7143742 /lib64/libc-2.5.so 31a8953000-31a8958000 rw-p 31a8953000 00:00 0 31a8a00000-31a8a02000 r-xp 00000000 fd:00 7143743 /lib64/libdl-2.5.so 31a8a02000-31a8c02000 ---p 00002000 fd:00 7143743 /lib64/libdl-2.5.so 31a8c02000-31a8c03000 r--p 00002000 fd:00 7143743 /lib64/libdl-2.5.so 31a8c03000-31a8c04000 rw-p 00003000 fd:00 7143743 /lib64/libdl-2.5.so 31a8e00000-31a8e16000 r-xp 00000000 fd:00 7143751 /lib64/libpthread-2.5.so 31a8e16000-31a9015000 ---p 00016000 fd:00 7143751 /lib64/libpthread-2.5.so 31a9015000-31a9016000 r--p 00015000 fd:00 7143751 /lib64/libpthread-2.5.so 31a9016000-31a9017000 rw-p 00016000 fd:00 7143751 /lib64/libpthread-2.5.so 31a9017000-31a901b000 rw-p 31a9017000 00:00 0 31a9200000-31a9221000 r-xp 00000000 fd:00 7143641 /lib64/libdevmapper.so.1.02 31a9221000-31a9421000 ---p 00021000 fd:00 7143641 /lib64/libdevmapper.so.1.02 31a9421000-31a9423000 rw-p 00021000 fd:00 7143641 /lib64/libdevmapper.so.1.02 31a9600000-31a960a000 r-xp 00000000 fd:00 1331528 /usr/lib64/libsysfs.so.2.0.1 31a960a000-31a980a000 ---p 0000a000 fd:00 1331528 /usr/lib64/libsysfs.so.2.0.1 31a980a000-31a980b000 rw-p 0000a000 fd:00 1331528 /usr/lib64/libsysfs.so.2.0.1 31a9a00000-31a9a15000 r-xp 00000000 fd:00 7143750 /lib64/libselinux.so.1 31a9a15000-31a9c15000 ---p 00015000 fd:00 7143750 /lib64/libselinux.so.1 31a9c15000-31a9c17000 rw-p 00015000 fd:00 7143750 /lib64/libselinux.so.1 31a9c17000-31a9c18000 rw-p 31a9c17000 00:00 0 31aa200000-31aa23b000 r-xp 00000000 fd:00 7143749 /lib64/libsepol.so.1 31aa23b000-31aa43b000 ---p 0003b000 fd:00 7143749 /lib64/libsepol.so.1 31aa43b000-31aa43c000 rw-p 0003b000 fd:00 7143749 /lib64/libsepol.so.1 31aa43c000-31aa446000 rw-p 31aa43c000 00:00 0 31aa600000-31aa635000 r-xp 00000000 fd:00 1318090 /usr/lib64/libreadline.so.5.1 31aa635000-31aa834000 ---p 00035000 fd:00 1318090 /usr/lib64/libreadline.so.5.1 31aa834000-31aa83c000 rw-p 00034000 fd:00 1318090 /usr/lib64/libreadline.so.5.1 31aa83c000-31aa83d000 rw-p 31aa83c000 00:00 0 31aaa00000-31aaa4f000 r-xp 00000000 fd:00 1319216 /usr/lib64/libncurses.so.5.5 31aaa4f000-31aac4e000 ---p 0004f000 fd:00 1319216 /usr/lib64/libncurses.so.5.5 31aac4e000-31aac5c000 rw-p 0004e000 fd:00 1319216 /usr/lib64/libncurses.so.5.5 31aac5c000-31aac5d000 rw-p 31aac5c000 00:00 0 31ac600000-31ac60d000 r-xp 00000000 fd:00 7143756 /lib64/libgcc_s-4.1.2-20080825.so.1 31ac60d000-31ac80d000 ---p 0000d000 fd:00 7143756 /lib64/libgcc_s-4.1.2-20080825.so.1 31ac80d000-31ac80e000 rw-p 0000d000 fd:00 7143756 /lib64/libgcc_s-4.1.2-20080825.so.1 2ab5e330b000-2ab5e330d000 rw-p 2ab5e330b000 00:00 0 2ab5e331e000-2ab5e3322000 rw-p 2ab5e331e000 00:00 0 7fff64189000-7fff6419e000 rw-p 7ffffffe9000 00:00 0 [stack] 7fff641fd000-7fff64200000 r-xp 7fff641fd000 00:00 0 [vdso] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vsyscall] Aborted =========== Do I need a file a bug for this or you want to fix it in this bug? BTW, I opened a Bug #715524 for document update issue.
Um... 0 is obviously not a valid value, and this will only happen because of user error. I'm not sure that this reaches the level of blocker, so opening a new bug makes the most sense, unless you have any serious objections. Good catch, though. Thanks.
I have verified negative number it will revert back to default value just like code said. Try to create 256 LUNs via 8 paths each. Each mpath have dt with 16 thread running against it, but still cannot trigger udev timeout. As this bug just change a hard coded value into configeable, I am OK for verify it. For 0 value of file_time, it goes to another bug.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds+multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_time 0' will casue mutipathd crash, please refer to Bug 716329.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_time 0' will casue mutipathd crash, please refer to Bug 716329.+multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_timeout 0' will casue mutipathd crash, please refer to Bug 716329.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1032.html