Bug 251346 - dm-multipath fails to start up on a system with 1K LUNs and 4 partitions/LUN with a too many files error.
dm-multipath fails to start up on a system with 1K LUNs and 4 partitions/LUN ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
5.1
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Ben Marzinski
Corey Marthaler
: OtherQA
: 280621 (view as bug list)
Depends On:
Blocks: 217208 RHEL5u2_relnotes RHEL5u3_relnotes 457226
  Show dependency treegraph
 
Reported: 2007-08-08 11:14 EDT by Barry Donahue
Modified: 2010-03-14 17:31 EDT (History)
17 users (show)

See Also:
Fixed In Version: RHBA-2008-0337
Doc Type: Bug Fix
Doc Text:
(all architectures) In /etc/multipath.conf, setting max_fds to unlimited will prevent the multipathd daemon from starting up properly. As such, you should use a sufficiently high value instead for this setting.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:35:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
plug a file descriptor leak (369 bytes, patch)
2007-08-09 09:18 EDT, Chip Coldwell
no flags Details | Diff
output of "multipath -l" (198.44 KB, text/plain)
2007-08-10 13:54 EDT, Chip Coldwell
no flags Details

  None (edit)
Description Barry Donahue 2007-08-08 11:14:43 EDT
Description of problem: I have a system configured with 1024 LUNs and 4
partitions per LUN. When dm-multipath tries to start up, it eventually faults
becuase too many files are open. ulimit is set to unlimited.


Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.7-10.el5
2.6.18-34.el5 #1 SMP

How reproducible: Every time.


Steps to Reproduce:
1.Configure 1K LUNs
2.put 4 partitions on each LUNs
3.multipath -v2
  
Actual results: The multipath daemon can't start up.
Errors:

error calling out /sbin/mpath_prio_emc /dev/sddy
error calling out /sbin/mpath_prio_emc /dev/sdel
Cannot open bindings file [/var/lib/multipath/bindings] : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdaaa
360060160a3261100d90092f0a190d911: failed to access path sdaaa
DM message failed [queue_if_no_path]
Cannot dup bindings file descriptor : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdaac
DM message failed [queue_if_no_path
Cannot dup bindings file descriptor : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdaad
DM message failed [queue_if_no_path]
Cannot dup bindings file descriptor : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdaae
DM message failed [queue_if_no_path]
...ETC
Cannot dup bindings file descriptor : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdahe
DM message failed [queue_if_no_path]
Cannot dup bindings file descriptor : Too many open files
error calling out /sbin/mpath_prio_emc /dev/sdahf
DM message failed [queue_if_no_path]


Expected results:multipath daemon should start up without error.


Additional info:xeon3 is configured in this state now.
Comment 1 Chip Coldwell 2007-08-08 16:20:14 EDT
I found the file descriptor leak (it wasn't hard) and installed a patched
version of multipath on xeon3 that seems to work.  However, the fix is truly a hack.

It seems like a fundamental part of the design of the device-mapper-multipath
tools is that it will hold an open file descriptor on every SCSI disk.  Can you
retest everything with the (currently installed) tools and verify that it's
still working?

Chip
Comment 2 Chip Coldwell 2007-08-09 09:18:30 EDT
Created attachment 160974 [details]
plug a file descriptor leak

This is the patch that seems to fix the problem (but might introduce others).
Comment 3 Dave Wysochanski 2007-08-09 10:50:57 EDT
Chip - did you see this bz I have:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217130

I am playing with "ulimit -n" - at one point this made a big difference in my
setup but right now I'm having a problem with reproducing it.

I'm wondering if putting a "ulimit -n" setting in /etc/dev.d/block/multipath.dev
will help solve the problems with the fds - multipath needs a crazy # of fds to
run.  since they are shortlived cmds fd leaks probably aren't a big deal right?
Comment 4 Ben Marzinski 2007-08-10 13:10:14 EDT
I wouldn't characterize what the patch in comment #2 does as fixing a file
descriptor leak.  It closes a file descriptor that would otherwise be held open,
but there is still a pointer to this file descriptor, and it gets reused.

The reason for this is that it is problematic to open that file descriptor, say
you have just lost your last path to your root filesystem.  Also, when you loose
access to paths, memory pressure can increase rapidly, making opening a file
descriptor impossible, even if it would be possible to send IO to that file
descriptor if it was open.

However this function gets called in many code paths, and it some of them, you
can probably guarantee that closing the file descriptor will not cause any
problems.  It is worth looking into closing file descriptors in callers that
don't need to keep them open.   Unfortunately, I'm not sure that this will get
you any benefit with multipathd.  I would be very interested in having you try
something like Daves ideas from Comment #3 and bz #217130

Also, if are able to create all the multipaths, please send me the output from
# multipath -l

I don't have the hardware to recreate this directly in hardware. I am currently
trying with gnbd devices, but it would be easier if I could see exactly what
your setup looked like.  If I could have access to the machine that you are
working on, that would work too.
Comment 5 Chip Coldwell 2007-08-10 13:54:18 EDT
Created attachment 161066 [details]
output of "multipath -l"

This is the output of "multipath -l" with the patch installed.
Comment 6 Chip Coldwell 2007-08-10 14:00:38 EDT
(In reply to comment #4)
> 
> The reason for this is that it is problematic to open that file descriptor, say
> you have just lost your last path to your root filesystem.

Interesting point.

> I would be very interested in having you try
> something like Daves ideas from Comment #3 and bz #217130

"ulimit -n 4096" solved an earlier, similar problem on the same system with
vgscan.  Obviously, that won't work with 1024 LUNs and 4 partitions each (if you
have at least one more fd open).

Chip
Comment 7 Ben Marzinski 2007-08-10 14:31:53 EDT
Have you tried "ulimit -n 8192" to see if that works. By the way. Multipath
doesn't do anything with partitions. So the number of partitions you create on
your LUNs shouldn't effect multipath itself.  You can see in your attachment,
that multipath only creates 1024 devices.  These are the only devices that
multipathd operates on.

I'm not saying that partitioning your devices doesn't effect anything.  But it
would have to be through device-mapper itself, since multipath doesn't know or
care at all about the partitions.
Comment 9 RHEL Product and Program Management 2007-11-14 16:44:20 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 10 Ben Marzinski 2008-01-14 21:01:40 EST
There is now a max_fds parameter to /etc/multipath.conf. If it's not set,
multipathd simply uses the regular max open fds limit for the shell. You can set
it to a number or unlimited. It works exactly like ulimit -n.  I also fixed a
bug that was keeping all of the fds open unnecessarily for multipath.  With this
option, I set up a system with 1025 devices with 4 paths each.
Comment 11 Ben Marzinski 2008-01-14 21:13:10 EST
*** Bug 280621 has been marked as a duplicate of this bug. ***
Comment 15 John Poelstra 2008-03-20 23:59:48 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 16 Don Domingo 2008-04-01 22:17:32 EDT
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don
Comment 17 John Poelstra 2008-04-02 17:40:27 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 19 Rajashekhar M A 2008-04-09 14:33:46 EDT
I just tested this on Snapshot3 and setting the limit to "unlimited" is not
working. But setting the max_fds to a sufficiently big number (tried with 10000)
solves the issue.

I have mapped 256 Luns to the host, with 4 paths each. (Totally, 1024 paths.)

[root@lnx199-115 ~]#
[root@lnx199-115 ~]# uname -a
Linux lnx199-115.lab.eng.btc.netapp.in 2.6.18-87.el5 #1 SMP Tue Mar 25 17:28:02
EDT 2008 i686 i686 i386 GNU/Linux
[root@lnx199-115 ~]#
[root@lnx199-115 ~]# /etc/init.d/multipathd restart
Stopping multipathd daemon:                                [  OK  ]
Starting multipathd daemon:                                [  OK  ]
[root@lnx199-115 ~]#

After this step, I see the following logs in /var/log/messages -

------------- /var/log/messages -------------------

Apr  7 22:56:40 lnx199-115 multipathd: can't set open fds limit to -1 :
Operation not permitted
Apr  7 22:56:40 lnx199-115 multipathd: cannot open /sbin/dasd_id : No such file
or directory
Apr  7 22:56:40 lnx199-115 multipathd: cannot open /sbin/gnbd_import : No such
file or directory
Apr  7 22:56:40 lnx199-115 multipathd: [copy.c] cannot open /sbin/dasd_id
Apr  7 22:56:40 lnx199-115 multipathd: cannot copy /sbin/dasd_id in ramfs : No
such file or directory
Apr  7 22:56:40 lnx199-115 multipathd: [copy.c] cannot open /sbin/gnbd_import
Apr  7 22:56:40 lnx199-115 multipathd: cannot copy /sbin/gnbd_import in ramfs :
No such file or directory
Apr  7 22:56:46 lnx199-115 multipathd: error calling out /sbin/mpath_prio_ontap
/dev/sdzt
Apr  7 22:56:46 lnx199-115 multipathd: error calling out /sbin/scsi_id -g -u -s
/block/sdzt
Apr  7 22:56:46 lnx199-115 multipathd: error calling out /sbin/mpath_prio_ontap
/dev/sdzu
Apr  7 22:56:46 lnx199-115 multipathd: error calling out /sbin/scsi_id -g -u -s
/block/sdzu

------------- End of /var/log/messages -------------------

And the status of multipathd is as below -

[root@lnx199-115 ~]#
[root@lnx199-115 ~]# /etc/init.d/multipathd status
multipathd dead but pid file exists
[root@lnx199-115 ~]#

My multipath.conf looks as below -

[root@lnx199-115 ~]#
[root@lnx199-115 ~]# cat /etc/multipath.conf
defaults {
        user_friendly_names yes
        max_fds         unlimited
}

blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^(hd|xvd)[a-z]*"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

devices {
        device {
                vendor                       "NETAPP"
                product                      "LUN"
                getuid_callout               "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout                 "/sbin/mpath_prio_ontap /dev/%n"
                features                     "1 queue_if_no_path"
                hardware_handler             "0"
                path_grouping_policy         group_by_prio
                failback                     immediate
                rr_weight                    uniform
                rr_min_io                    128
                path_checker                 directio
        }
}
[root@lnx199-115 ~]#


Other Observations -

1. max_fds   10000 --- works fine. I can restart multipathd. But, unlimited is
not working.
2. SELinux - Does not matter. This behavior is same regardless of the SELinux
setting.
3. We wrote simple program to set limit to RLIM_INFINITY using setrlimit(). This
also failed, with the same error (Operation not permitted). So, this seems to be
some generic issue with setting the resource limit and not specifically with
multipathd as such.
Comment 20 Ben Marzinski 2008-04-22 13:42:13 EDT
Unfortunately, unlimited does seem like it will not work do to an in-kernel
limit of 1048576 open file descriptors. However, for all practical purposes,
this limit shouldn't be a big problem.  I'll remove the "unlimited" option in
future releases. It might be best to replace it with a "max" option, that simply
sets the limit to the in-kernel max.
Comment 21 Don Domingo 2008-04-22 20:32:36 EDT
this bug has been tagged for inclusion in the RHEl5.2 release notes. please post
the necessary content for it. thanks!
Comment 22 Andrius Benokraitis 2008-04-24 10:23:57 EDT
Don - I think the release note should say something like:

In /etc/multipath.conf, the 'max_fds unlimited' option should not be used.
Please use a sufficiently high value instead of 'unlimited'. This will be
addressed in a future minor release.
Comment 23 Don Domingo 2008-04-27 19:36:13 EDT
revising as follows:

<quote>
In /etc/multipath.conf, setting max_fds to unlimited will prevent the multipathd
daemon from starting up properly. As such, you should use a sufficiently high
value instead for this setting.
</quote>

please advise if any further revisions are required. thanks!
Comment 24 errata-xmlrpc 2008-05-21 11:35:25 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0337.html
Comment 25 Rajashekhar M A 2008-06-26 07:55:33 EDT
Ben,

In comment #20, you had mentioned that you would be adding an option "max" for
the parameter max_fds. When is this fix expected?
Or is this fix already available in RHEL 4.7 Snapshots?

Thank you,
Raj
Comment 26 Ryan Lerch 2008-08-10 21:25:20 EDT
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. 

This Release Note is currently located in the Known Issues section.
Comment 27 Ryan Lerch 2008-08-10 21:25:20 EDT
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Note You need to log in before you can comment on or make changes to this bug.