Bug 627911 - wait_for_file() times out on the environment with a large number of multipath devices and high load.
Summary: wait_for_file() times out on the environment with a large number of multipath...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Gris Ge
URL:
Whiteboard:
: 674366 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-27 11:27 UTC by Masahiro Matsuya
Modified: 2018-11-14 19:13 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_timeout 0' will casue mutipathd crash, please refer to Bug 716329.
Clone Of:
Environment:
Last Closed: 2011-07-21 08:21:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1032 0 normal SHIPPED_LIVE device-mapper-multipath bug fix and enhancement update 2011-07-20 15:44:06 UTC

Description Masahiro Matsuya 2010-08-27 11:27:50 UTC
Description of problem:

wait_for_file() times out on the environment with a large number of multipath devices and high load.

customer's environment:

  +-- Server ---+      +--- Disk Array ---+
  |      SAS-HBA+------+  LUN * 256       |
  |      SAS-HBA+------+                  |
  +-------------+      +------------------+

 * Load: 32 dd commands were issued to each LUN at the same time.

On this environment, customer unplugged a cable and plugged it again for testing.
And, several paths were not recovered.

For each pathes, the following errors were outputed in debug logfile of multipathd.

multipathd-v3-d.log
---------------------------
Aug 11 10:49:08 | failed to open /dev/sddw
Aug 11 10:49:08 | sddw: failed to get path uid
Aug 11 10:49:08 | uevent trigger error

Aug 11 10:55:28 | failed to open /dev/sdfc
Aug 11 10:55:28 | sdfc: failed to get path uid
Aug 11 10:55:28 | uevent trigger error

Aug 11 10:57:11 | failed to open /dev/sdfg
Aug 11 10:57:11 | sdfg: failed to get path uid
Aug 11 10:57:11 | uevent trigger error

Aug 11 11:02:31 | failed to open /dev/sdgf
Aug 11 11:02:31 | sdgf: failed to get path uid
Aug 11 11:02:31 | uevent trigger error

This means that the device files didn't exist in /dev. It should be created by udevd. And, multipathd waits the completion in wait_for_file() for 60 seconds.

But, those device files were created after that.


/var/log/messages with "udevcontrol log_priority=debug"
--------------------------------
Aug 11 10:49:13 RX300S3-K udevd-event[29057]: udev_node_add: creating device node '/dev/sddw', major = '71', minor = '224', mode = '0640', uid = '0', gid = '6'
Aug 11 10:49:15 RX300S3-K udevd-event[29057]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d00000000000000007d0000' to '../../sddw'

Aug 11 10:55:31 RX300S3-K udevd-event[30135]: udev_node_add: creating device node '/dev/sdfc', major = '129', minor = '224', mode = '0640', uid = '0', gid = '6'
Aug 11 10:55:35 RX300S3-K udevd-event[30135]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d00000000000000009d0000' to '../../sdfc'

Aug 11 10:57:14 RX300S3-K udevd-event[30220]: udev_node_add: creating device node '/dev/sdfg', major = '130', minor = '32', mode = '0640', uid = '0', gid = '6'
Aug 11 10:57:14 RX300S3-K udevd-event[30220]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d0000000000000000a10000' to '../../sdfg'

Aug 11 11:02:17 RX300S3-K udevd-event[31241]: udev_node_add: creating device node '/dev/sdgf', major = '131', minor = '176', mode = '0640', uid = '0', gid = '6'
Aug 11 11:02:39 RX300S3-K udevd-event[31241]: udev_node_add: creating symlink '/dev/disk/by-id/scsi-3600000e00d0000000000000000ba0000' to '../../sdgf'


I created a patch to extend the wait time into 90 seconds. This problem didn't occur with this patch on their environment.

I'll attach the proposed patch.

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.7-34.el5.x86_64

How reproducible:
Always on high load

Steps to Reproduce:
1. Prepare a large number of multipath devices
2. Create highload on the multipath devices
3. Pull out a cable
4. Insert the cable again


Actual results:
some mpath devices of 256 didn't recover after connecting their paths under a high load situation.

Expected results:
All devices should recover after connecting their paths even if it's under a high load situation.

Additional info:

Comment 5 Ben Marzinski 2010-08-27 17:12:21 UTC
Seems reasonable.

Comment 6 RHEL Program Management 2011-01-11 20:50:43 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 7 RHEL Program Management 2011-01-11 22:27:16 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 9 Ben Marzinski 2011-03-07 15:45:54 UTC
Patch applied. Thanks.

Comment 11 Ben Marzinski 2011-04-07 21:09:12 UTC
Actually, I've changed the solution to this bug.  The default timeout is still 90 second, but it is now configurable. There is a new /etc/multipath.conf defaults parameter, "file_timeout". It defaults to 90 seconds, but it can be set to any positive value.

Comment 12 Ben Marzinski 2011-04-07 21:10:05 UTC
*** Bug 674366 has been marked as a duplicate of this bug. ***

Comment 13 Ben Marzinski 2011-04-08 02:11:55 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds

Comment 15 Chao Ye 2011-06-15 07:06:00 UTC
Confirm patch checked in device-mapper-multipath-0.4.7-46.el5

Comment 16 Gris Ge 2011-06-23 08:34:35 UTC
Ben,

When I use "file_timeout 0" in configuration, multipathd -k'reconfig' will got this:
=============
*** glibc detected *** multipathd: double free or corruption (out): 0x00007fff6419c110 ***
======= Backtrace: =========
/lib64/libc.so.6[0x31a867245f]
/lib64/libc.so.6(cfree+0x4b)[0x31a86728bb]
multipathd(xfree+0x27)[0x41b1d1]
multipathd[0x408b46]
multipathd[0x408b98]
multipathd[0x4082ab]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x31a861d994]
multipathd[0x403c09]
======= Memory map: ========
00400000-00442000 r-xp 00000000 fd:00 16449547                           /sbin/multipathd
00642000-00646000 rw-p 00042000 fd:00 16449547                           /sbin/multipathd
00646000-00647000 rw-p 00646000 00:00 0 
00845000-00847000 rw-p 00045000 fd:00 16449547                           /sbin/multipathd
07a6d000-07a8e000 rw-p 07a6d000 00:00 0                                  [heap]
31a8200000-31a821c000 r-xp 00000000 fd:00 7143741                        /lib64/ld-2.5.so
31a841c000-31a841d000 r--p 0001c000 fd:00 7143741                        /lib64/ld-2.5.so
31a841d000-31a841e000 rw-p 0001d000 fd:00 7143741                        /lib64/ld-2.5.so
31a8600000-31a874e000 r-xp 00000000 fd:00 7143742                        /lib64/libc-2.5.so
31a874e000-31a894e000 ---p 0014e000 fd:00 7143742                        /lib64/libc-2.5.so
31a894e000-31a8952000 r--p 0014e000 fd:00 7143742                        /lib64/libc-2.5.so
31a8952000-31a8953000 rw-p 00152000 fd:00 7143742                        /lib64/libc-2.5.so
31a8953000-31a8958000 rw-p 31a8953000 00:00 0 
31a8a00000-31a8a02000 r-xp 00000000 fd:00 7143743                        /lib64/libdl-2.5.so
31a8a02000-31a8c02000 ---p 00002000 fd:00 7143743                        /lib64/libdl-2.5.so
31a8c02000-31a8c03000 r--p 00002000 fd:00 7143743                        /lib64/libdl-2.5.so
31a8c03000-31a8c04000 rw-p 00003000 fd:00 7143743                        /lib64/libdl-2.5.so
31a8e00000-31a8e16000 r-xp 00000000 fd:00 7143751                        /lib64/libpthread-2.5.so
31a8e16000-31a9015000 ---p 00016000 fd:00 7143751                        /lib64/libpthread-2.5.so
31a9015000-31a9016000 r--p 00015000 fd:00 7143751                        /lib64/libpthread-2.5.so
31a9016000-31a9017000 rw-p 00016000 fd:00 7143751                        /lib64/libpthread-2.5.so
31a9017000-31a901b000 rw-p 31a9017000 00:00 0 
31a9200000-31a9221000 r-xp 00000000 fd:00 7143641                        /lib64/libdevmapper.so.1.02
31a9221000-31a9421000 ---p 00021000 fd:00 7143641                        /lib64/libdevmapper.so.1.02
31a9421000-31a9423000 rw-p 00021000 fd:00 7143641                        /lib64/libdevmapper.so.1.02
31a9600000-31a960a000 r-xp 00000000 fd:00 1331528                        /usr/lib64/libsysfs.so.2.0.1
31a960a000-31a980a000 ---p 0000a000 fd:00 1331528                        /usr/lib64/libsysfs.so.2.0.1
31a980a000-31a980b000 rw-p 0000a000 fd:00 1331528                        /usr/lib64/libsysfs.so.2.0.1
31a9a00000-31a9a15000 r-xp 00000000 fd:00 7143750                        /lib64/libselinux.so.1
31a9a15000-31a9c15000 ---p 00015000 fd:00 7143750                        /lib64/libselinux.so.1
31a9c15000-31a9c17000 rw-p 00015000 fd:00 7143750                        /lib64/libselinux.so.1
31a9c17000-31a9c18000 rw-p 31a9c17000 00:00 0 
31aa200000-31aa23b000 r-xp 00000000 fd:00 7143749                        /lib64/libsepol.so.1
31aa23b000-31aa43b000 ---p 0003b000 fd:00 7143749                        /lib64/libsepol.so.1
31aa43b000-31aa43c000 rw-p 0003b000 fd:00 7143749                        /lib64/libsepol.so.1
31aa43c000-31aa446000 rw-p 31aa43c000 00:00 0 
31aa600000-31aa635000 r-xp 00000000 fd:00 1318090                        /usr/lib64/libreadline.so.5.1
31aa635000-31aa834000 ---p 00035000 fd:00 1318090                        /usr/lib64/libreadline.so.5.1
31aa834000-31aa83c000 rw-p 00034000 fd:00 1318090                        /usr/lib64/libreadline.so.5.1
31aa83c000-31aa83d000 rw-p 31aa83c000 00:00 0 
31aaa00000-31aaa4f000 r-xp 00000000 fd:00 1319216                        /usr/lib64/libncurses.so.5.5
31aaa4f000-31aac4e000 ---p 0004f000 fd:00 1319216                        /usr/lib64/libncurses.so.5.5
31aac4e000-31aac5c000 rw-p 0004e000 fd:00 1319216                        /usr/lib64/libncurses.so.5.5
31aac5c000-31aac5d000 rw-p 31aac5c000 00:00 0 
31ac600000-31ac60d000 r-xp 00000000 fd:00 7143756                        /lib64/libgcc_s-4.1.2-20080825.so.1
31ac60d000-31ac80d000 ---p 0000d000 fd:00 7143756                        /lib64/libgcc_s-4.1.2-20080825.so.1
31ac80d000-31ac80e000 rw-p 0000d000 fd:00 7143756                        /lib64/libgcc_s-4.1.2-20080825.so.1
2ab5e330b000-2ab5e330d000 rw-p 2ab5e330b000 00:00 0 
2ab5e331e000-2ab5e3322000 rw-p 2ab5e331e000 00:00 0 
7fff64189000-7fff6419e000 rw-p 7ffffffe9000 00:00 0                      [stack]
7fff641fd000-7fff64200000 r-xp 7fff641fd000 00:00 0                      [vdso]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vsyscall]
Aborted
===========
Do I need a file a bug for this or you want to fix it in this bug?

BTW, I opened a Bug #715524 for document update issue.

Comment 17 Ben Marzinski 2011-06-23 17:19:56 UTC
Um... 0 is obviously not a valid value, and this will only happen because of user error.  I'm not sure that this reaches the level of blocker, so opening a new bug makes the most sense, unless you have any serious objections. Good catch, though. Thanks.

Comment 18 Gris Ge 2011-06-24 01:41:13 UTC
I have verified negative number it will revert back to default value just like code said.

Try to create 256 LUNs via 8 paths each. Each mpath have dt with 16 thread running against it, but still cannot trigger udev timeout.

As this bug just change a hard coded value into configeable, I am OK for verify it.

For 0 value of file_time, it goes to another bug.

Comment 19 Gris Ge 2011-06-28 08:47:32 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds+multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_time 0' will casue mutipathd crash, please refer to Bug 716329.

Comment 20 Gris Ge 2011-06-28 08:48:50 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_time 0' will casue mutipathd crash, please refer to Bug 716329.+multipath.conf has a new defaults section parameter, file_timeout. This parameter controls how many seconds multipathd will wait for a necessary file to appear while setting up a multipath device. The default is 90 seconds. It's know issue that 'file_timeout 0' will casue mutipathd crash, please refer to Bug 716329.

Comment 21 errata-xmlrpc 2011-07-21 08:21:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1032.html


Note You need to log in before you can comment on or make changes to this bug.