Description of problem: I have a system configured with 1024 LUNs and 4 partitions per LUN. When dm-multipath tries to start up, it eventually faults becuase too many files are open. ulimit is set to unlimited. Version-Release number of selected component (if applicable): device-mapper-multipath-0.4.7-10.el5 2.6.18-34.el5 #1 SMP How reproducible: Every time. Steps to Reproduce: 1.Configure 1K LUNs 2.put 4 partitions on each LUNs 3.multipath -v2 Actual results: The multipath daemon can't start up. Errors: error calling out /sbin/mpath_prio_emc /dev/sddy error calling out /sbin/mpath_prio_emc /dev/sdel Cannot open bindings file [/var/lib/multipath/bindings] : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdaaa 360060160a3261100d90092f0a190d911: failed to access path sdaaa DM message failed [queue_if_no_path] Cannot dup bindings file descriptor : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdaac DM message failed [queue_if_no_path Cannot dup bindings file descriptor : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdaad DM message failed [queue_if_no_path] Cannot dup bindings file descriptor : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdaae DM message failed [queue_if_no_path] ...ETC Cannot dup bindings file descriptor : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdahe DM message failed [queue_if_no_path] Cannot dup bindings file descriptor : Too many open files error calling out /sbin/mpath_prio_emc /dev/sdahf DM message failed [queue_if_no_path] Expected results:multipath daemon should start up without error. Additional info:xeon3 is configured in this state now.
I found the file descriptor leak (it wasn't hard) and installed a patched version of multipath on xeon3 that seems to work. However, the fix is truly a hack. It seems like a fundamental part of the design of the device-mapper-multipath tools is that it will hold an open file descriptor on every SCSI disk. Can you retest everything with the (currently installed) tools and verify that it's still working? Chip
Created attachment 160974 [details] plug a file descriptor leak This is the patch that seems to fix the problem (but might introduce others).
Chip - did you see this bz I have: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217130 I am playing with "ulimit -n" - at one point this made a big difference in my setup but right now I'm having a problem with reproducing it. I'm wondering if putting a "ulimit -n" setting in /etc/dev.d/block/multipath.dev will help solve the problems with the fds - multipath needs a crazy # of fds to run. since they are shortlived cmds fd leaks probably aren't a big deal right?
I wouldn't characterize what the patch in comment #2 does as fixing a file descriptor leak. It closes a file descriptor that would otherwise be held open, but there is still a pointer to this file descriptor, and it gets reused. The reason for this is that it is problematic to open that file descriptor, say you have just lost your last path to your root filesystem. Also, when you loose access to paths, memory pressure can increase rapidly, making opening a file descriptor impossible, even if it would be possible to send IO to that file descriptor if it was open. However this function gets called in many code paths, and it some of them, you can probably guarantee that closing the file descriptor will not cause any problems. It is worth looking into closing file descriptors in callers that don't need to keep them open. Unfortunately, I'm not sure that this will get you any benefit with multipathd. I would be very interested in having you try something like Daves ideas from Comment #3 and bz #217130 Also, if are able to create all the multipaths, please send me the output from # multipath -l I don't have the hardware to recreate this directly in hardware. I am currently trying with gnbd devices, but it would be easier if I could see exactly what your setup looked like. If I could have access to the machine that you are working on, that would work too.
Created attachment 161066 [details] output of "multipath -l" This is the output of "multipath -l" with the patch installed.
(In reply to comment #4) > > The reason for this is that it is problematic to open that file descriptor, say > you have just lost your last path to your root filesystem. Interesting point. > I would be very interested in having you try > something like Daves ideas from Comment #3 and bz #217130 "ulimit -n 4096" solved an earlier, similar problem on the same system with vgscan. Obviously, that won't work with 1024 LUNs and 4 partitions each (if you have at least one more fd open). Chip
Have you tried "ulimit -n 8192" to see if that works. By the way. Multipath doesn't do anything with partitions. So the number of partitions you create on your LUNs shouldn't effect multipath itself. You can see in your attachment, that multipath only creates 1024 devices. These are the only devices that multipathd operates on. I'm not saying that partitioning your devices doesn't effect anything. But it would have to be through device-mapper itself, since multipath doesn't know or care at all about the partitions.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
There is now a max_fds parameter to /etc/multipath.conf. If it's not set, multipathd simply uses the regular max open fds limit for the shell. You can set it to a number or unlimited. It works exactly like ulimit -n. I also fixed a bug that was keeping all of the fds open unnecessarily for multipath. With this option, I set up a system with 1025 devices with 4 paths each.
*** Bug 280621 has been marked as a duplicate of this bug. ***
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
I just tested this on Snapshot3 and setting the limit to "unlimited" is not working. But setting the max_fds to a sufficiently big number (tried with 10000) solves the issue. I have mapped 256 Luns to the host, with 4 paths each. (Totally, 1024 paths.) [root@lnx199-115 ~]# [root@lnx199-115 ~]# uname -a Linux lnx199-115.lab.eng.btc.netapp.in 2.6.18-87.el5 #1 SMP Tue Mar 25 17:28:02 EDT 2008 i686 i686 i386 GNU/Linux [root@lnx199-115 ~]# [root@lnx199-115 ~]# /etc/init.d/multipathd restart Stopping multipathd daemon: [ OK ] Starting multipathd daemon: [ OK ] [root@lnx199-115 ~]# After this step, I see the following logs in /var/log/messages - ------------- /var/log/messages ------------------- Apr 7 22:56:40 lnx199-115 multipathd: can't set open fds limit to -1 : Operation not permitted Apr 7 22:56:40 lnx199-115 multipathd: cannot open /sbin/dasd_id : No such file or directory Apr 7 22:56:40 lnx199-115 multipathd: cannot open /sbin/gnbd_import : No such file or directory Apr 7 22:56:40 lnx199-115 multipathd: [copy.c] cannot open /sbin/dasd_id Apr 7 22:56:40 lnx199-115 multipathd: cannot copy /sbin/dasd_id in ramfs : No such file or directory Apr 7 22:56:40 lnx199-115 multipathd: [copy.c] cannot open /sbin/gnbd_import Apr 7 22:56:40 lnx199-115 multipathd: cannot copy /sbin/gnbd_import in ramfs : No such file or directory Apr 7 22:56:46 lnx199-115 multipathd: error calling out /sbin/mpath_prio_ontap /dev/sdzt Apr 7 22:56:46 lnx199-115 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdzt Apr 7 22:56:46 lnx199-115 multipathd: error calling out /sbin/mpath_prio_ontap /dev/sdzu Apr 7 22:56:46 lnx199-115 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdzu ------------- End of /var/log/messages ------------------- And the status of multipathd is as below - [root@lnx199-115 ~]# [root@lnx199-115 ~]# /etc/init.d/multipathd status multipathd dead but pid file exists [root@lnx199-115 ~]# My multipath.conf looks as below - [root@lnx199-115 ~]# [root@lnx199-115 ~]# cat /etc/multipath.conf defaults { user_friendly_names yes max_fds unlimited } blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^(hd|xvd)[a-z]*" devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" } devices { device { vendor "NETAPP" product "LUN" getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_ontap /dev/%n" features "1 queue_if_no_path" hardware_handler "0" path_grouping_policy group_by_prio failback immediate rr_weight uniform rr_min_io 128 path_checker directio } } [root@lnx199-115 ~]# Other Observations - 1. max_fds 10000 --- works fine. I can restart multipathd. But, unlimited is not working. 2. SELinux - Does not matter. This behavior is same regardless of the SELinux setting. 3. We wrote simple program to set limit to RLIM_INFINITY using setrlimit(). This also failed, with the same error (Operation not permitted). So, this seems to be some generic issue with setting the resource limit and not specifically with multipathd as such.
Unfortunately, unlimited does seem like it will not work do to an in-kernel limit of 1048576 open file descriptors. However, for all practical purposes, this limit shouldn't be a big problem. I'll remove the "unlimited" option in future releases. It might be best to replace it with a "max" option, that simply sets the limit to the in-kernel max.
this bug has been tagged for inclusion in the RHEl5.2 release notes. please post the necessary content for it. thanks!
Don - I think the release note should say something like: In /etc/multipath.conf, the 'max_fds unlimited' option should not be used. Please use a sufficiently high value instead of 'unlimited'. This will be addressed in a future minor release.
revising as follows: <quote> In /etc/multipath.conf, setting max_fds to unlimited will prevent the multipathd daemon from starting up properly. As such, you should use a sufficiently high value instead for this setting. </quote> please advise if any further revisions are required. thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0337.html
Ben, In comment #20, you had mentioned that you would be adding an option "max" for the parameter max_fds. When is this fix expected? Or is this fix already available in RHEL 4.7 Snapshots? Thank you, Raj
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. This Release Note is currently located in the Known Issues section.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team.