Bug 478643 - multipath test causes memory leak and eventual system deadlock
multipath test causes memory leak and eventual system deadlock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
ia64 Linux
urgent Severity urgent
: rc
: 5.4
Assigned To: Ben Marzinski
Red Hat Kernel QE team
: ZStream
Depends On:
Blocks: 461676 480170 481576 483701 483784 485920
  Show dependency treegraph
 
Reported: 2009-01-02 13:38 EST by Jamie Wellnitz
Modified: 2010-03-14 17:30 EDT (History)
25 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel update. To workaround this issue, ensure that the process calling do_pipe has adequate file descriptors allocated. This issue has been observed with multipathd in particular. To avoid the problem with multipathd, calculate the number of file descriptors (FDs) required using the formula: "FDs = Number of Paths + 32". If the result is greater than the default value of 1024, then the max_fds value in the defaults section of multipath.conf should be set to the previously calculated value. For example, if there are 255 LUNs with 8 paths each, the line to be added to the defaults section of multipath.conf would be: max_fds 2072
Story Points: ---
Clone Of:
: 480170 (view as bug list)
Environment:
Last Closed: 2009-09-02 04:03:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
output of multipath -ll (140.81 KB, application/octet-stream)
2009-01-05 17:03 EST, Jamie Wellnitz
no flags Details
multipath.conf file from failing system (4.31 KB, application/octet-stream)
2009-01-07 19:57 EST, Jamie Wellnitz
no flags Details
systemtap script to find leaker (165 bytes, application/octet-stream)
2009-01-09 18:02 EST, Ben Marzinski
no flags Details
Patch to fix the leak during pipe() failure (716 bytes, patch)
2009-01-15 12:03 EST, Ben Marzinski
no flags Details | Diff

  None (edit)
Description Jamie Wellnitz 2009-01-02 13:38:51 EST
Description of problem: When running dt to a raw partition (no filesystem) on multipath devices the kernel slab for 1024 byte objects grows indefinitely and causes an out of memory condition.  We ran the test without device mapper installed and that test ran without issue. This issue is reproducible in a DM environment with Emulex or QLogic Fibre Channel HBAs.  We've reproduced this on ia64 systems only at this point.



Version-Release number of selected component (if applicable):
kernel-2.6.18-125.el5
device-mapper-1.02.28-2.el5
device-mapper-multipath-0.4.7-22.el5
device-mapper-event-1.02.28-2.el5

How reproducible:  System deadlocks after ~16 hrs into test.


Steps to Reproduce:
1. Run dt with 255 raw LUNs on an HP EVA-5000 with 2 HBAs with 4 paths each to target (8 paths total to each LUN)
2. Run for >16 hours
3.
  
Actual results: Memory leak leading to system deadlock.


Expected results: System should not deadlock.


Additional info: This problem does not occur on RHEL 5.2 (kernel 2.6.18-92.el5) in the same test setup.
Comment 2 Ben Marzinski 2009-01-05 15:49:18 EST
Can you please run

# multipath -ll

and place the output in the bugzilla.  Depending on the firmware in the EVA-5000, you may or may not be using a hardware handler.  The product information in the multipath -ll output will show if you are using the new firmware.
Comment 3 Jamie Wellnitz 2009-01-05 17:03:18 EST
Created attachment 328236 [details]
output of multipath -ll
Comment 4 Tom Coughlan 2009-01-06 10:40:51 EST
(In reply to comment #0)

> Version-Release number of selected component (if applicable):
> kernel-2.6.18-125.el5

We removed scsi_dh_alua in kernel-2.6.18-126. 

I can't tell for sure, but I suspect that is relevant in this case. Please check that, and re-test with the latest kernel.
Comment 6 Jamie Wellnitz 2009-01-06 12:18:37 EST
We'll retest with 2.6.18-128.el5
Comment 7 Andrius Benokraitis 2009-01-07 10:54:53 EST
Emulex - Any updates on testing?
Comment 8 laurie barry 2009-01-07 11:01:52 EST
We're working to resolve a kernel panic that we saw while trying to boot from a freshly installed RHEL5.3 Rc1 OS last night. Three attempts failed in the same fashion:

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0).

We're doing everything possible to get this environment back up to verify RH has indeed fixed this problem in RC and it takes ~17 hours to repro.  We'll give you 2x daily status until this completes.
Comment 9 Jamie Wellnitz 2009-01-07 11:19:35 EST
The "new" panic we're hitting is obviously the standard kernel can't find it's root partition - i.e. the system's boot config is broken.  We'll get that fixed and proceed with the testing on 2.6.18-128.el5.
Comment 10 Mike Gahagan 2009-01-07 11:29:41 EST
It sounds like you might be hitting bz 477576 which was a recently fixed package ordering bug that would cause the initrd to not be created. There are a few workarounds listed in that bug.
Comment 11 Ben Marzinski 2009-01-07 11:46:47 EST
Do you need to wait the entire 17 hours?  Have you tried running slabtop?  It should be able to show you if your size-1024 slab cache is constantly growing.  If for some reason, there is a massive memory leak that happens all of a sudden after 16 hours, then this won't help.  But if there is a steady leak over the entire time, you should be able to be pretty certain that you are going to eventually run out of memory if the size-1024 slab cache size keeps increasing over the first half hour or so.

Another question. You said that you have only reproduced this on ia64 systems thus far.  Does that mean you haven't tried this test on other architectures, or that you have tried it and you couldn't reproduce the problem?
Comment 12 Jamie Wellnitz 2009-01-07 12:21:10 EST
Re: the boot issue - We do see the initrd not being created with RHEL 5.3 RC1.  I'll check bz 477576.  We've got RHEL 5.3 snapshot 6 running now (we'll put -128.el5 on it in a minute).

Re: comment #11 the 17 hours - we can check the 1024 byte slab for unbound growth, so no, we should have results sooner (maybe 8 hours instead of 17).

Re: other arches - We did run this test successfully (it did not reproduce the 1024 byte slab leak) on x86_64 with 5.3 snapshot 4 or 5.
Comment 13 Mike Christie 2009-01-07 13:15:22 EST
Can we narrow this down to the kernel or userspace by running the kernel from 5.2 with the userspace tools from 5.3?
Comment 14 Jamie Wellnitz 2009-01-07 13:24:08 EST
We can try - we're just getting the retest on RHEL 5.3 RC1 (-128.el5) running now.  Depending on how that turns out, we'll have to decide what the next test should be.
Comment 15 Mike Christie 2009-01-07 13:50:40 EST
For your testing, are you using the dt that comes with RHEL or some other dt?

For the tests are you just running dt, or do you create dm devices, run it, destroy devices then repeat.

While the test is running is there any failovers or path failures occurring?
Comment 16 Mike Christie 2009-01-07 13:51:40 EST
(In reply to comment #14)
> We can try - we're just getting the retest on RHEL 5.3 RC1 (-128.el5) running
> now.  Depending on how that turns out, we'll have to decide what the next test
> should be.

Ok, if you are not creating and destroying devices and there is no events going on then my suggestion is probably a waste of time.
Comment 17 Jamie Wellnitz 2009-01-07 14:28:23 EST
There is no error-injection in this test (no failover, etc.) just data.  The dm devices are created at the start and never destroyed.
Comment 18 Jamie Wellnitz 2009-01-07 16:38:07 EST
It appears that we don't even need to run dt.

I brought up the system with 4 paths to each LUN (instead of 8) and let the system sit for an hour.  The size-1024 slab is about 77MB (72k objects) - it seems to grow at around 1600 objects per minute.
Comment 19 Jamie Wellnitz 2009-01-07 16:43:14 EST
Tom,

(In reply to comment #4)
> (In reply to comment #0)
> > Version-Release number of selected component (if applicable):
> > kernel-2.6.18-125.el5
> We removed scsi_dh_alua in kernel-2.6.18-126. 
> I can't tell for sure, but I suspect that is relevant in this case. Please
> check that, and re-test with the latest kernel.

Just reviewing the comments on this bug.

Were you asking us to check something other than simply retesting with 2.6.18-128.el5?
Comment 20 Tom Coughlan 2009-01-07 17:11:57 EST
(In reply to comment #19)
> Tom,
> 
> (In reply to comment #4)
> > (In reply to comment #0)
> > > Version-Release number of selected component (if applicable):
> > > kernel-2.6.18-125.el5
> > We removed scsi_dh_alua in kernel-2.6.18-126. 
> > I can't tell for sure, but I suspect that is relevant in this case. Please
> > check that, and re-test with the latest kernel.
> 
> Just reviewing the comments on this bug.
> 
> Were you asking us to check something other than simply retesting with
> 2.6.18-128.el5?

An lsmod would show whether the scsi_dh_alua module was in use (that is, a "used by" greater than zero). 

Your current results, with a kernel that no longer has that module, suggests that the module was not the cause. 

The best course of action now would be to backtrack and determine which 5.3 kernel version introduced the problem. Can you try some earlier 5.3 kernels to to narrow this down?  (When you do go back to earlier kernels, you should try lsmod, and confirm the scsi_dh_alua use count is zero.) 

Ben tried to reproduce this on an ia64, with different storage, and did not see the problem. We are setting up the hardware here in Westford to try to reproduce this, hopefully tomorrow morning. Can someone at HP set this up and try it?
Comment 21 Mike Christie 2009-01-07 17:22:36 EST
(In reply to comment #20)
> to narrow this down?  (When you do go back to earlier kernels, you should try
> lsmod, and confirm the scsi_dh_alua use count is zero.) 
> 

Just to avoid any confusion. If you do not even see scsi_dh_alua loaded do not worry. It might not even be getting loaded and we do not want it loaded.
Comment 22 Jamie Wellnitz 2009-01-07 18:03:05 EST
Is there a place I can download kernels before 2.6.18-125.el5?  I haven't kept the older RHEL 5.3 snapshots.
Comment 23 Ben Marzinski 2009-01-07 18:15:05 EST
Just to make sure, dt is actually successfully doing IO to the devices until you run out of memory, correct?

Also, are you using the default configuration for your device, or did you manually configure it in /etc/multipath.conf?

Finally, this seems really unlikely, but does anyone think it would be worthwhile to set up linear dm devices on top off these LUNs and run the test on them, just to verify that this is in the multipath code.  I would think that any gerneric device-mapper memory leaks would have been caught by now.  But you never really know until you try.
Comment 24 Jamie Wellnitz 2009-01-07 19:57:47 EST
Created attachment 328429 [details]
multipath.conf file from failing system

Here's the multipath.conf file that's in use on the failing ia64 system.  This file is from HP to support their arrays.
Comment 25 Ben Marzinski 2009-01-07 20:09:38 EST
Mike had a really good idea.  Can you try stopping multipathd and seeing if the size-1024 slab stops growing. multipathd checks the paths using by sending a SG_IO every twenty seconds. If there is a memory leak there, the cache size could keep increasing without anything but multipath running.
Comment 26 Jamie Wellnitz 2009-01-07 20:10:25 EST
(In reply to comment #23)
> Just to make sure, dt is actually successfully doing IO to the devices until
> you run out of memory, correct?
> 

dt isn't running at all now (it turns out it's not needed to reproduce the size-1024 leak), but when it was running, yes, it was successfully completing IOs.

I'm also seeing these messages in the syslog at the rate of several per second:

Jan  7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdach 
Jan  7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaci 
Jan  7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdacj 
Jan  7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdajo

If I try running "/sbin/mpath_prio_alua /dev/sdajo" manually it works (and prints 50).  Any other sd devices that I've tried work as well.

This is a big config (running 4 paths to each of 255 LUNs right now).  With the full config (8 paths to each LUN), multipathd has to be started with ulimit -n 2048 to avoid running out of file handles.  That might be related to the "error calling out" messages.
Comment 27 Jamie Wellnitz 2009-01-07 20:40:30 EST
Stopped multipathd and the size-1024 slab stopped growing (size-1024 number of objects and number of slabs from /proc/slabinfo, every 5 minutes):

Wed Jan 7 19:50:57 EST 2009: 157560 10504
Wed Jan 7 19:55:57 EST 2009: 165030 11002
Wed Jan 7 20:00:57 EST 2009: 172695 11513
Wed Jan 7 20:05:57 EST 2009: 180180 12012
Wed Jan 7 20:10:57 EST 2009: 187350 12490
Wed Jan 7 20:15:57 EST 2009: 187350 12490
Wed Jan 7 20:20:57 EST 2009: 187340 12490
Wed Jan 7 20:25:57 EST 2009: 187340 12490
Wed Jan 7 20:30:57 EST 2009: 187350 12490
Wed Jan 7 20:35:57 EST 2009: 187350 12490
Comment 28 Tom Coughlan 2009-01-08 09:13:50 EST
(In reply to comment #22)
> Is there a place I can download kernels before 2.6.18-125.el5?  I haven't kept
> the older RHEL 5.3 snapshots.

Start with these. I can upload some more as needed. 

http://people.redhat.com/coughlan/el5/

kernel-2.6.18-92.el5.ia64.rpm   5.2 kernel
kernel-2.6.18-118.el5.ia64.rpm  5.3 alpha
kernel-2.6.18-120.el5.ia64.rpm  5.3 beta

Thanks for doing this.
Comment 29 Jamie Wellnitz 2009-01-08 10:59:00 EST
(In reply to comment #28)
> kernel-2.6.18-92.el5.ia64.rpm   5.2 kernel
> kernel-2.6.18-118.el5.ia64.rpm  5.3 alpha
> kernel-2.6.18-120.el5.ia64.rpm  5.3 beta

Thanks.

-92.el5 does NOT leak size-1024 objects
-120.el5 DOES appear to leak size-1024 objects

I'm trying -118.el5 now.
Comment 30 Mike Christie 2009-01-08 11:15:44 EST
Tom, could you put up 115? This is the kernel before all the scsi changes were done.
Comment 31 Ben Marzinski 2009-01-08 11:40:29 EST
Comment #27 narrows the memory leak location way down. I'd like to try narrowing it down further.  Can you please edit /etc/multipath.conf, and in the device section for the EVA-5000 (product  "HSV1[01]1|HSV2[01]0|HSV300"),  change

path_checker   tur

to

path_checker   readsector0

Once you change this, you need to start up or restart multipathd. This path checker also sends a SG_IO ioctl to the devices, but instead of doing a Test Unit Ready, it reads from sector 0 of the device.  This should help to narrow down where in the SG_IO ioctl handling code the leak is.

Thanks.
Comment 32 Tom Coughlan 2009-01-08 13:09:12 EST
(In reply to comment #30)
> Tom, could you put up 115? This is the kernel before all the scsi changes were
> done.

I uploaded -113. (I am sticking with kernels that went to QA here, to avoid any that may just not work.)

It would be best to try Ben's path_checker test first, then, if -118 fails, try -113.
Comment 33 Jamie Wellnitz 2009-01-08 15:45:01 EST
Running 2.6.18-128.el5 and pathchecker set to readsector0 in /etc/multipath.conf, it still leaks size-1024 objects.

The results with 2.6.18-118.el5 were inconclusive, but I may have been impatient getting the test running, so I'll try that one again.
Comment 37 Jamie Wellnitz 2009-01-08 16:43:00 EST
2.6.18-118.el5 leaks, too.  Trying -113 next ...
Comment 38 Suzanne Yeghiayan 2009-01-08 18:56:15 EST
RC2 has been posted to the partner list.
Please retest with full iso.
Comment 40 Jamie Wellnitz 2009-01-09 10:32:55 EST
2.6.18-113.el5 does NOT appear to exhibit the size-1024 leak.  I asked our QA guy to start the original full test to verify this lack-of-a-leak.  Hopefully we'll have an update on that by the end of today.
Comment 42 Jamie Wellnitz 2009-01-09 15:19:55 EST
We've run the original dt with multipath (8 paths x 255 LUNs) with 2.6.18-113.el5 for 5 hours without the size-1024 growth in slabtop.
Comment 43 Ben Marzinski 2009-01-09 15:23:28 EST
Just to be extra sure about this, it would be nice to just make sure that it is the SG_IO call that is doing this. In the device section for the EVA 5000, could you change

prio_callout   "/sbin/mpath_prio_alua /dev/%n"

to

prio_callout   "none"

and change

path_checker   tur

to

path_checker   directio

And restart multipathd.  With these changes, multipathd shouldn't send any SG_IO ioctls to your devices, and you shouldn't see any leak, even on the recent kernels.  Actually, if you could do those changes one at a time, and see if the leak stops with 

prio_callout    "none"
path_checker    "tur"

That would be helpful too.
Comment 44 Jamie Wellnitz 2009-01-09 15:45:36 EST
Okay, I'll give those a try on Monday.  (Our lab is going offline soon for some  weekend maintenance.)
Comment 45 Tom Coughlan 2009-01-09 17:17:38 EST
I have uploaded the next step in the bisection, kernel -116. Try this when the time is right (i.e., there are no any other bright ideas...). 

Barry tried to reproduce this here on an ia64 with a multipath connection to an EVA. So far, no memory leak is visible. He'll let run over the weekend. 

Jamie, it would be a good idea for you to run sosreport and post the result, so we can at look at the differences between our systems.
Comment 46 Ben Marzinski 2009-01-09 18:02:45 EST
Created attachment 328603 [details]
systemtap script to find leaker

This systemtap script should print a backtrace for the process that grows the size-1024 slab.  Unfortunately, this won't necessarily be the leaker.  It will simply be whatever process uses up the last object in the cache.  However, since the leaks come in big bursts, when multipathd is checking the paths, given enough times, the leaker should end up taking the last object and growing the cache.

To setup and use this script you need the following packages:
crash
systemtap-runtime
systemtap
kernel-debuginfo* 
kernel-debuginfo-common*
kernel-devel*

* For your running kernel (which should obviously be one that leaks).

Once you have all of those pacakges, the first thing you need to do is edit the script. Start up crash by running (if you are using the -125.el5 kernel):

# crash /usr/lib/debug/lib/modules/2.6.18-125.el5/vmlinux

At the crash prompt, run:

crash> kmem -s size-1024

You should see something like:

CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
e0000040ffcd8480 size-1024               1024        374       405     27    16k

Take the number under CACHE (0xe0000040ffcd8480), and copy it into the "if" statement in the systemtap script, cache_check.stp

So, with the above number, the "if" statement in my script would look like

        if ($cachep == 0xe0000040ffcd8480) {

This will make the script only print backtraces for the size-1024 slab. Then you can start the script with

# stap cache_check.stp

and start up multipathd. Hopefully, the script will start printing process info, and backtraces whenever a process grows the slab.  It may simply crash your kernel. systemtap is finicky like that.  But it works for me, and it should tell us what allocation isn't being freed.
Comment 47 Ben Marzinski 2009-01-09 18:07:21 EST
Bah! Bugzilla wrapped the crash output.

It should be:

CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS
e0000040ffcd8480 size-1024               1024        374       405     27


With:

SSIZE
16k

stuck on at the end of the line. At any rate, the only thing that is important is the number under CACHE.
Comment 48 Jamie Wellnitz 2009-01-13 10:02:49 EST
Sorry for the delay - our weekend update lasted well into Monday.

The stap script periodically prints out the same stack trace:

multipathd(12588) growing size-1024 cache
 0xa00000010015e5c0 : cache_grow+0x0/0xa20 [kernel]
 0xa00000010015f790 : cache_alloc_refill+0x3d0/0x4c0 [kernel]
 0xa000000100160160 : kmem_cache_zalloc+0x120/0x1c0 [kernel]
 0xa000000100190830 : alloc_pipe_info+0x30/0xc0 [kernel]
 0xa000000100190960 : create_write_pipe+0xa0/0x3e0 [kernel]
 0xa000000100192240 : do_pipe+0x20/0x120 [kernel]
 0xa000000100035fb0 : sys_pipe+0x30/0x80 [kernel]
 0xa00000010000bdd0 : __ia64_trace_syscall+0xd0/0x140 [kernel]
 0xa000000000010620
multipathd(12588) growing size-1024 cache
 0xa00000010015e5c0 : cache_grow+0x0/0xa20 [kernel]
 0xa00000010015f790 : cache_alloc_refill+0x3d0/0x4c0 [kernel]
 0xa000000100160160 : kmem_cache_zalloc+0x120/0x1c0 [kernel]
 0xa000000100190830 : alloc_pipe_info+0x30/0xc0 [kernel]
 0xa000000100190960 : create_write_pipe+0xa0/0x3e0 [kernel]
 0xa000000100192240 : do_pipe+0x20/0x120 [kernel]
 0xa000000100035fb0 : sys_pipe+0x30/0x80 [kernel]
 0xa00000010000bdd0 : __ia64_trace_syscall+0xd0/0x140 [kernel]
 0xa000000000010620

so it looks like something in sys_pipe().
Comment 49 Jamie Wellnitz 2009-01-13 10:07:16 EST
(In reply to comment #43)
> prio_callout   "/sbin/mpath_prio_alua /dev/%n"
> 
> to
> 
> prio_callout   "none"

This change stops the leak.  I haven't had a chance to check all permutations.
Comment 50 Ben Marzinski 2009-01-13 10:54:55 EST
Well, that would certainly explain why we couldn't find the leak in the scsi ioctl  handling code :)

There's no need to check the other permutations.  Multipathd doesn't call pipe when it's running the path checkers, just the priority callouts.
Comment 51 Ben Marzinski 2009-01-13 11:43:10 EST
Well there's definitely a bug in multipathd. After it creates the pipe, if the fork() call fails, multipathd doesn't clean up the pipe.  I don't know what could have changed in the kernel that would cause the fork() call to fail on newer kernels. Also, if this is the case, you should be bumping against the open file descriptors limit before you run out of memory,  unless ulimit -n really is set much higher that 2048.  Another option is that it's the pipe() call itself is failing but not cleaning itself up. However, after a quick check, it doesn't look like the sys_pipe() call path has changed between -113 and -118.  The last option is that the pipe isn't being clean up correctly when the mpath_prio_alua program exits.  Multipathd is correctly closing down both ends of its pipe (if the fork call works correctly). But exit() should be cleaning up the pipe for mpath_prio_alua.

If this is the first problem, then killing multipathd should give you all of your memory back.

I'm going to make a new multipathd package that correctly closes the pipe when fork() fails, and also adds more error messages so that we know exactly where multipathd is failing in the prio callout function.

Another question. On the kernels where there is no memory leak, are you still seeing the: 

Jan  7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua
/dev/sdach

messages.
Comment 52 Ben Marzinski 2009-01-13 15:54:14 EST
You can download the new multipath test packages at:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/device-mapper-multipath-0.4.7-24.el5.ia64.rpm
http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/device-mapper-multipath-debuginfo-0.4.7-24.el5.ia64.rpm
http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/kpartx-0.4.7-24.el5.ia64.rpm 

These will print out some more verbose error messages, and correctly close the pipe fds if fork fails.

Let me know what messages get printed out to /var/log/messages by multipathd with these packages installed.
Comment 53 Jamie Wellnitz 2009-01-14 09:19:45 EST
We ran this version of multipathd (from the packages in Comment #52).  It still seems to leak objects from the size-1024 slab.  I see a lot of items in /var/log/messages like:

Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaab 
Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaac 
Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaad 
Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaae 
Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaaf 

And one section of this:

Jan 13 17:39:33 dhcp-101-132 multipathd: /sbin/mpath_prio_alua exitted with 255 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdblq 
Jan 13 17:39:33 dhcp-101-132 multipathd: /sbin/scsi_id exitted with 255 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdblq 
Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdblr 
Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdblr 
Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdbls 
Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files 
Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdbls
Comment 54 Alasdair Kergon 2009-01-14 09:46:36 EST
I don't know if we have this already, but obviously it would be good to see full ps and lsof output corresponding to that situation.
Comment 55 Ben Marzinski 2009-01-14 10:53:44 EST
Here's a quick guess.  This looks like the second possibility listed above.  The sys_pipe call is failing and not cleaning itself up.  It appears that it may have allocated the memory for the pipe but then could not find a free file descriptor. Try increasing your ulimit -n to something higher, like umlimit -n 4096, and see if this fixes it. I'll go check and see if I can find the leak in the kernel code.  I'll also go look through multipathd and figure out what the most number of file descriptors you could possibly need at one time for a given number of multipath devices and paths.
Comment 56 Ben Marzinski 2009-01-14 12:12:56 EST
Also, for anyone trying to reproduce this: In the defaults section of multipath.conf, try setting max_fds equal to exactly the number of paths that you have.

For example if you have 32 LUNs with 4 paths each, use

defaults {
    ...
    max_fds  128
}
Comment 57 Ben Marzinski 2009-01-14 13:12:26 EST
Pretty sure I found it!

Comment #55 was pretty much right on.  do_pipe() allocs the pipe files, and then tries to get free file descriptors for them.  When the process is out of file descriptors, it fails.  When the read pipe gets freed, it doesn't do a dput on the dentry.  This means that there is still a reference to the pipe dentry, and the inode never gets cleaned up, including the pipe_inode_info, which is what the systemtap script was pointing to.

I checked upstream and this is fixed there.  The odd thing is, this isn't fixed in the earlier kernels, like -92.el5 and -113.el5.  So why you can't see it there is a mystery.  But increasing the ulimit -n until you don't see those "Too many open files" messages should work around the problem.

I'll go hunt down the upstream kernel commit, and we'll get you a patched kernel to prove that this does actually fix the issue.
Comment 58 Ben Marzinski 2009-01-14 23:59:54 EST
Looking through the multipathd code, and just counting the number of fds that are always open, plus the maximum number of fds that could be used by each thread at any time, without any consideration of whether it would actually be possible to actually use them all at the same time, I get:


18 fds + 1 per path + 1 per interactive multipathd client.

Since only one thread services the multipathd interactive clients (what you get by running multipathd -k), there's no real good reason to allow multiple clients to connect at once.  Just allowing one client and making SIGHUP force the connection to close seems like it would work just fine. It seems like a good rule of thumb would be

32 fds + 1 per path

could you try this and make sure that it doesn't get any of the "Too many open files" messages.
Comment 59 Tom Coughlan 2009-01-15 08:34:17 EST
A kernel with a proposed fix for the leak is available at:

http://people.redhat.com/coughlan/el5/bz478643/

Maybe after you run the test in comment 58, you can try this kernel with all default settings, and the stock 5.3 dm-multipath, to be sure we nailed it. After that try a run with the updated dm-multipath in comment 52. That right Ben?

Tom
Comment 60 Jamie Wellnitz 2009-01-15 09:30:08 EST
(In reply to comment #58)
> Just allowing one client and making SIGHUP force
> the connection to close seems like it would work just fine. It seems like a
> good rule of thumb would be
> 
> 32 fds + 1 per path
> 
> could you try this and make sure that it doesn't get any of the "Too many open
> files" messages.

This setup has 8 paths to each of 255 LUNs, so I'll try:
32 fds + 8 * 255 = 2072 for the open files limit.

Then I'll try Tom's kernel from Comment #59.  We should have results by 1 pm or so.
Comment 61 Tom Coughlan 2009-01-15 10:26:12 EST
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed.

It has been determined that 1024 byte objects in kernel slab may be lost when a call to do_pipe() fails. The problem occurs because do_pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, do_pipe fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. 

A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors. 

This problem has been observed with multipathd in particular. To avoid the problem with multipahtd, you should set max_fds in the defaults section of multipath.conf to 

32 fds + 1 fd per path

For example if you have 32 LUNs with 4 paths each, use

defaults {
    ...
    max_fds  160
}
Comment 62 Ben Marzinski 2009-01-15 12:03:49 EST
Created attachment 329108 [details]
Patch to fix the leak during pipe() failure

This fix is a port of:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ccac88eeb5659c716af8e695e2943509c80d172
Comment 63 Ben Marzinski 2009-01-15 12:20:49 EST
Just a clarification to Comment #59.  When you try to test the workaround, you should be running with the multipath packages from Comment #52.  These are necessary to print out the useful error messages when the callout program fails.
Comment 64 Tom Coughlan 2009-01-15 14:20:04 EST
Ben updated the release note as follows: 

I changed do_pipe() to pipe() since I assume our customers don't care
what internal kernel function is failing.  They just want to know what system
call could cause them problems.  I also changed the instructions so that users
know that they only need to change max_fds if it needs to be above 1024.  The
vast majority of users don't need to do anything.
Comment 65 Tom Coughlan 2009-01-15 14:20:04 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,6 +1,6 @@
 We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed.
 
-It has been determined that 1024 byte objects in kernel slab may be lost when a call to do_pipe() fails. The problem occurs because do_pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, do_pipe fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. 
+It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. 
 
 A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors. 
 
@@ -8,9 +8,9 @@
 
 32 fds + 1 fd per path
 
-For example if you have 32 LUNs with 4 paths each, use
+if this number is greater than the default of 1024. For example if you have 255 LUNs with 8 paths each, use
 
 defaults {
     ...
-    max_fds  160
+    max_fds  2072
 }
Comment 67 Suzanne Yeghiayan 2009-01-15 14:26:03 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,10 +1,8 @@
-We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed.
-
 It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. 
 
 A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors. 
 
-This problem has been observed with multipathd in particular. To avoid the problem with multipahtd, you should set max_fds in the defaults section of multipath.conf to 
+This problem has been observed with multipathd in particular. To avoid the problem with multipathd, you should set max_fds in the defaults section of multipath.conf to 
 
 32 fds + 1 fd per path
Comment 68 Ryan Lerch 2009-01-15 17:53:00 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,14 +1,9 @@
-It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. 
+It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them.  If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel update. 
 
-A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors. 
+To workaround this issue, ensure that the process calling do_pipe has adequate file descriptors allocated.
 
-This problem has been observed with multipathd in particular. To avoid the problem with multipathd, you should set max_fds in the defaults section of multipath.conf to 
+This issue has been observed with multipathd in particular. To avoid the problem with multipathd, calculate the number of file descriptors (FDs) required using the formula: "FDs = Number of Paths + 32". If the result is greater than the default value of 1024, then the max_fds value in the defaults section of multipath.conf should be set to the previously calculated value.
 
-32 fds + 1 fd per path
+For example, if there are 255 LUNs with 8 paths each, the line to be added to the defaults section of multipath.conf would be:
-
+ 
-if this number is greater than the default of 1024. For example if you have 255 LUNs with 8 paths each, use
+max_fds  2072-
-defaults {
-    ...
-    max_fds  2072
-}
Comment 69 Jamie Wellnitz 2009-01-15 21:31:54 EST
(In reply to comment #63)
> Just a clarification to Comment #59.  When you try to test the workaround, you
> should be running with the multipath packages from Comment #52.  These are
> necessary to print out the useful error messages when the callout program
> fails.

With the modified multipath packages from Comment #52 and max_fds = 2072, we don't see the size-1024 slab growing, so that config option does seem to work around the leak issue.

I don't see any "Too many open files" messages in /var/log/messsages, either.

(Sorry for the slow updates - unfortunately, we're still having trouble with our network in the lab where this machine & array are located.)
Comment 70 Jamie Wellnitz 2009-01-19 15:37:25 EST
(In reply to comment #59)
> A kernel with a proposed fix for the leak is available at:
> 
> http://people.redhat.com/coughlan/el5/bz478643/
> 
> Maybe after you run the test in comment 58, you can try this kernel with all
> default settings, and the stock 5.3 dm-multipath, to be sure we nailed it.
> After that try a run with the updated dm-multipath in comment 52. That right
> Ben?
> 
> Tom

We tried this kernel, 2.6.18-128.el5.bz478643, with the stock device-mapper-multipath package and max_fds set to 2048 (high enough for multipathd to start up but not high enough to avoid the "error calling out /sbin/mpath_prio_alua" messages).

This got frequent "error calling out" messages, but the size-1024 slab doesn't grow.  So I think that kernel fix works.
Comment 73 Don Zickus 2009-02-02 14:48:01 EST
in kernel-2.6.18-130.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 74 RHEL Product and Program Management 2009-02-16 10:17:33 EST
Updating PM score.
Comment 77 Jan Tluka 2009-07-21 08:42:10 EDT
Flipping back to ON_QA, since I saw that the test /kernel/errata/5.3.z/481576 failed. The machine is nec-tx7-1.rhts.bos.redhat.com. RHTS logs could be found here http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=9138721

I'm not sure if this is reproducable as there were no issues seen in previous test runs. I launched 4 runs of this test on same machine to see if it occurs once again. RHTS job http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=74878.
Comment 78 Jan Tluka 2009-07-22 12:27:37 EDT
(In reply to comment #77)
> Flipping back to ON_QA, since I saw that the test /kernel/errata/5.3.z/481576
> failed. The machine is nec-tx7-1.rhts.bos.redhat.com. RHTS logs could be found
> here http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=9138721
> 
> I'm not sure if this is reproducable as there were no issues seen in previous
> test runs. I launched 4 runs of this test on same machine to see if it occurs
> once again. RHTS job http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=74878.  

Okay, there were two failures of this test that claims that there are still leak problems, so I'd like to check with dev if these test failures are real.
Comment 85 errata-xmlrpc 2009-09-02 04:03:32 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.