Bug 478643
| Summary: | multipath test causes memory leak and eventual system deadlock | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Jamie Wellnitz <jamie.wellnitz> | ||||||||||
| Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 5.3 | CC: | agk, andriusb, bdonahue, benl, bino.sebastian, bmarzins, coughlan, dchapman, dhoward, dmair, dwysocha, dzickus, emcnabb, jpirko, jplans, jtluka, laurie.barry, mbroz, mchristi, phinchman, rick.hester, senthil-kumar.veluswamy, syeghiay, vaios.papadimitriou, vijayakumar | ||||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||||
| Target Release: | 5.4 | ||||||||||||
| Hardware: | ia64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: |
It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel update.
To workaround this issue, ensure that the process calling do_pipe has adequate file descriptors allocated.
This issue has been observed with multipathd in particular. To avoid the problem with multipathd, calculate the number of file descriptors (FDs) required using the formula: "FDs = Number of Paths + 32". If the result is greater than the default value of 1024, then the max_fds value in the defaults section of multipath.conf should be set to the previously calculated value.
For example, if there are 255 LUNs with 8 paths each, the line to be added to the defaults section of multipath.conf would be:
max_fds 2072
|
Story Points: | --- | ||||||||||
| Clone Of: | |||||||||||||
| : | 480170 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2009-09-02 08:03:32 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 461676, 480170, 481576, 483701, 483784, 485920 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Jamie Wellnitz
2009-01-02 18:38:51 UTC
Can you please run # multipath -ll and place the output in the bugzilla. Depending on the firmware in the EVA-5000, you may or may not be using a hardware handler. The product information in the multipath -ll output will show if you are using the new firmware. Created attachment 328236 [details]
output of multipath -ll
(In reply to comment #0) > Version-Release number of selected component (if applicable): > kernel-2.6.18-125.el5 We removed scsi_dh_alua in kernel-2.6.18-126. I can't tell for sure, but I suspect that is relevant in this case. Please check that, and re-test with the latest kernel. We'll retest with 2.6.18-128.el5 Emulex - Any updates on testing? We're working to resolve a kernel panic that we saw while trying to boot from a freshly installed RHEL5.3 Rc1 OS last night. Three attempts failed in the same fashion: Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0). We're doing everything possible to get this environment back up to verify RH has indeed fixed this problem in RC and it takes ~17 hours to repro. We'll give you 2x daily status until this completes. The "new" panic we're hitting is obviously the standard kernel can't find it's root partition - i.e. the system's boot config is broken. We'll get that fixed and proceed with the testing on 2.6.18-128.el5. It sounds like you might be hitting bz 477576 which was a recently fixed package ordering bug that would cause the initrd to not be created. There are a few workarounds listed in that bug. Do you need to wait the entire 17 hours? Have you tried running slabtop? It should be able to show you if your size-1024 slab cache is constantly growing. If for some reason, there is a massive memory leak that happens all of a sudden after 16 hours, then this won't help. But if there is a steady leak over the entire time, you should be able to be pretty certain that you are going to eventually run out of memory if the size-1024 slab cache size keeps increasing over the first half hour or so. Another question. You said that you have only reproduced this on ia64 systems thus far. Does that mean you haven't tried this test on other architectures, or that you have tried it and you couldn't reproduce the problem? Re: the boot issue - We do see the initrd not being created with RHEL 5.3 RC1. I'll check bz 477576. We've got RHEL 5.3 snapshot 6 running now (we'll put -128.el5 on it in a minute). Re: comment #11 the 17 hours - we can check the 1024 byte slab for unbound growth, so no, we should have results sooner (maybe 8 hours instead of 17). Re: other arches - We did run this test successfully (it did not reproduce the 1024 byte slab leak) on x86_64 with 5.3 snapshot 4 or 5. Can we narrow this down to the kernel or userspace by running the kernel from 5.2 with the userspace tools from 5.3? We can try - we're just getting the retest on RHEL 5.3 RC1 (-128.el5) running now. Depending on how that turns out, we'll have to decide what the next test should be. For your testing, are you using the dt that comes with RHEL or some other dt? For the tests are you just running dt, or do you create dm devices, run it, destroy devices then repeat. While the test is running is there any failovers or path failures occurring? (In reply to comment #14) > We can try - we're just getting the retest on RHEL 5.3 RC1 (-128.el5) running > now. Depending on how that turns out, we'll have to decide what the next test > should be. Ok, if you are not creating and destroying devices and there is no events going on then my suggestion is probably a waste of time. There is no error-injection in this test (no failover, etc.) just data. The dm devices are created at the start and never destroyed. It appears that we don't even need to run dt. I brought up the system with 4 paths to each LUN (instead of 8) and let the system sit for an hour. The size-1024 slab is about 77MB (72k objects) - it seems to grow at around 1600 objects per minute. Tom, (In reply to comment #4) > (In reply to comment #0) > > Version-Release number of selected component (if applicable): > > kernel-2.6.18-125.el5 > We removed scsi_dh_alua in kernel-2.6.18-126. > I can't tell for sure, but I suspect that is relevant in this case. Please > check that, and re-test with the latest kernel. Just reviewing the comments on this bug. Were you asking us to check something other than simply retesting with 2.6.18-128.el5? (In reply to comment #19) > Tom, > > (In reply to comment #4) > > (In reply to comment #0) > > > Version-Release number of selected component (if applicable): > > > kernel-2.6.18-125.el5 > > We removed scsi_dh_alua in kernel-2.6.18-126. > > I can't tell for sure, but I suspect that is relevant in this case. Please > > check that, and re-test with the latest kernel. > > Just reviewing the comments on this bug. > > Were you asking us to check something other than simply retesting with > 2.6.18-128.el5? An lsmod would show whether the scsi_dh_alua module was in use (that is, a "used by" greater than zero). Your current results, with a kernel that no longer has that module, suggests that the module was not the cause. The best course of action now would be to backtrack and determine which 5.3 kernel version introduced the problem. Can you try some earlier 5.3 kernels to to narrow this down? (When you do go back to earlier kernels, you should try lsmod, and confirm the scsi_dh_alua use count is zero.) Ben tried to reproduce this on an ia64, with different storage, and did not see the problem. We are setting up the hardware here in Westford to try to reproduce this, hopefully tomorrow morning. Can someone at HP set this up and try it? (In reply to comment #20) > to narrow this down? (When you do go back to earlier kernels, you should try > lsmod, and confirm the scsi_dh_alua use count is zero.) > Just to avoid any confusion. If you do not even see scsi_dh_alua loaded do not worry. It might not even be getting loaded and we do not want it loaded. Is there a place I can download kernels before 2.6.18-125.el5? I haven't kept the older RHEL 5.3 snapshots. Just to make sure, dt is actually successfully doing IO to the devices until you run out of memory, correct? Also, are you using the default configuration for your device, or did you manually configure it in /etc/multipath.conf? Finally, this seems really unlikely, but does anyone think it would be worthwhile to set up linear dm devices on top off these LUNs and run the test on them, just to verify that this is in the multipath code. I would think that any gerneric device-mapper memory leaks would have been caught by now. But you never really know until you try. Created attachment 328429 [details]
multipath.conf file from failing system
Here's the multipath.conf file that's in use on the failing ia64 system. This file is from HP to support their arrays.
Mike had a really good idea. Can you try stopping multipathd and seeing if the size-1024 slab stops growing. multipathd checks the paths using by sending a SG_IO every twenty seconds. If there is a memory leak there, the cache size could keep increasing without anything but multipath running. (In reply to comment #23) > Just to make sure, dt is actually successfully doing IO to the devices until > you run out of memory, correct? > dt isn't running at all now (it turns out it's not needed to reproduce the size-1024 leak), but when it was running, yes, it was successfully completing IOs. I'm also seeing these messages in the syslog at the rate of several per second: Jan 7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdach Jan 7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaci Jan 7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdacj Jan 7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdajo If I try running "/sbin/mpath_prio_alua /dev/sdajo" manually it works (and prints 50). Any other sd devices that I've tried work as well. This is a big config (running 4 paths to each of 255 LUNs right now). With the full config (8 paths to each LUN), multipathd has to be started with ulimit -n 2048 to avoid running out of file handles. That might be related to the "error calling out" messages. Stopped multipathd and the size-1024 slab stopped growing (size-1024 number of objects and number of slabs from /proc/slabinfo, every 5 minutes): Wed Jan 7 19:50:57 EST 2009: 157560 10504 Wed Jan 7 19:55:57 EST 2009: 165030 11002 Wed Jan 7 20:00:57 EST 2009: 172695 11513 Wed Jan 7 20:05:57 EST 2009: 180180 12012 Wed Jan 7 20:10:57 EST 2009: 187350 12490 Wed Jan 7 20:15:57 EST 2009: 187350 12490 Wed Jan 7 20:20:57 EST 2009: 187340 12490 Wed Jan 7 20:25:57 EST 2009: 187340 12490 Wed Jan 7 20:30:57 EST 2009: 187350 12490 Wed Jan 7 20:35:57 EST 2009: 187350 12490 (In reply to comment #22) > Is there a place I can download kernels before 2.6.18-125.el5? I haven't kept > the older RHEL 5.3 snapshots. Start with these. I can upload some more as needed. http://people.redhat.com/coughlan/el5/ kernel-2.6.18-92.el5.ia64.rpm 5.2 kernel kernel-2.6.18-118.el5.ia64.rpm 5.3 alpha kernel-2.6.18-120.el5.ia64.rpm 5.3 beta Thanks for doing this. (In reply to comment #28) > kernel-2.6.18-92.el5.ia64.rpm 5.2 kernel > kernel-2.6.18-118.el5.ia64.rpm 5.3 alpha > kernel-2.6.18-120.el5.ia64.rpm 5.3 beta Thanks. -92.el5 does NOT leak size-1024 objects -120.el5 DOES appear to leak size-1024 objects I'm trying -118.el5 now. Tom, could you put up 115? This is the kernel before all the scsi changes were done. Comment #27 narrows the memory leak location way down. I'd like to try narrowing it down further. Can you please edit /etc/multipath.conf, and in the device section for the EVA-5000 (product "HSV1[01]1|HSV2[01]0|HSV300"), change path_checker tur to path_checker readsector0 Once you change this, you need to start up or restart multipathd. This path checker also sends a SG_IO ioctl to the devices, but instead of doing a Test Unit Ready, it reads from sector 0 of the device. This should help to narrow down where in the SG_IO ioctl handling code the leak is. Thanks. (In reply to comment #30) > Tom, could you put up 115? This is the kernel before all the scsi changes were > done. I uploaded -113. (I am sticking with kernels that went to QA here, to avoid any that may just not work.) It would be best to try Ben's path_checker test first, then, if -118 fails, try -113. Running 2.6.18-128.el5 and pathchecker set to readsector0 in /etc/multipath.conf, it still leaks size-1024 objects. The results with 2.6.18-118.el5 were inconclusive, but I may have been impatient getting the test running, so I'll try that one again. 2.6.18-118.el5 leaks, too. Trying -113 next ... RC2 has been posted to the partner list. Please retest with full iso. 2.6.18-113.el5 does NOT appear to exhibit the size-1024 leak. I asked our QA guy to start the original full test to verify this lack-of-a-leak. Hopefully we'll have an update on that by the end of today. We've run the original dt with multipath (8 paths x 255 LUNs) with 2.6.18-113.el5 for 5 hours without the size-1024 growth in slabtop. Just to be extra sure about this, it would be nice to just make sure that it is the SG_IO call that is doing this. In the device section for the EVA 5000, could you change prio_callout "/sbin/mpath_prio_alua /dev/%n" to prio_callout "none" and change path_checker tur to path_checker directio And restart multipathd. With these changes, multipathd shouldn't send any SG_IO ioctls to your devices, and you shouldn't see any leak, even on the recent kernels. Actually, if you could do those changes one at a time, and see if the leak stops with prio_callout "none" path_checker "tur" That would be helpful too. Okay, I'll give those a try on Monday. (Our lab is going offline soon for some weekend maintenance.) I have uploaded the next step in the bisection, kernel -116. Try this when the time is right (i.e., there are no any other bright ideas...). Barry tried to reproduce this here on an ia64 with a multipath connection to an EVA. So far, no memory leak is visible. He'll let run over the weekend. Jamie, it would be a good idea for you to run sosreport and post the result, so we can at look at the differences between our systems. Created attachment 328603 [details]
systemtap script to find leaker
This systemtap script should print a backtrace for the process that grows the size-1024 slab. Unfortunately, this won't necessarily be the leaker. It will simply be whatever process uses up the last object in the cache. However, since the leaks come in big bursts, when multipathd is checking the paths, given enough times, the leaker should end up taking the last object and growing the cache.
To setup and use this script you need the following packages:
crash
systemtap-runtime
systemtap
kernel-debuginfo*
kernel-debuginfo-common*
kernel-devel*
* For your running kernel (which should obviously be one that leaks).
Once you have all of those pacakges, the first thing you need to do is edit the script. Start up crash by running (if you are using the -125.el5 kernel):
# crash /usr/lib/debug/lib/modules/2.6.18-125.el5/vmlinux
At the crash prompt, run:
crash> kmem -s size-1024
You should see something like:
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
e0000040ffcd8480 size-1024 1024 374 405 27 16k
Take the number under CACHE (0xe0000040ffcd8480), and copy it into the "if" statement in the systemtap script, cache_check.stp
So, with the above number, the "if" statement in my script would look like
if ($cachep == 0xe0000040ffcd8480) {
This will make the script only print backtraces for the size-1024 slab. Then you can start the script with
# stap cache_check.stp
and start up multipathd. Hopefully, the script will start printing process info, and backtraces whenever a process grows the slab. It may simply crash your kernel. systemtap is finicky like that. But it works for me, and it should tell us what allocation isn't being freed.
Bah! Bugzilla wrapped the crash output. It should be: CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS e0000040ffcd8480 size-1024 1024 374 405 27 With: SSIZE 16k stuck on at the end of the line. At any rate, the only thing that is important is the number under CACHE. Sorry for the delay - our weekend update lasted well into Monday. The stap script periodically prints out the same stack trace: multipathd(12588) growing size-1024 cache 0xa00000010015e5c0 : cache_grow+0x0/0xa20 [kernel] 0xa00000010015f790 : cache_alloc_refill+0x3d0/0x4c0 [kernel] 0xa000000100160160 : kmem_cache_zalloc+0x120/0x1c0 [kernel] 0xa000000100190830 : alloc_pipe_info+0x30/0xc0 [kernel] 0xa000000100190960 : create_write_pipe+0xa0/0x3e0 [kernel] 0xa000000100192240 : do_pipe+0x20/0x120 [kernel] 0xa000000100035fb0 : sys_pipe+0x30/0x80 [kernel] 0xa00000010000bdd0 : __ia64_trace_syscall+0xd0/0x140 [kernel] 0xa000000000010620 multipathd(12588) growing size-1024 cache 0xa00000010015e5c0 : cache_grow+0x0/0xa20 [kernel] 0xa00000010015f790 : cache_alloc_refill+0x3d0/0x4c0 [kernel] 0xa000000100160160 : kmem_cache_zalloc+0x120/0x1c0 [kernel] 0xa000000100190830 : alloc_pipe_info+0x30/0xc0 [kernel] 0xa000000100190960 : create_write_pipe+0xa0/0x3e0 [kernel] 0xa000000100192240 : do_pipe+0x20/0x120 [kernel] 0xa000000100035fb0 : sys_pipe+0x30/0x80 [kernel] 0xa00000010000bdd0 : __ia64_trace_syscall+0xd0/0x140 [kernel] 0xa000000000010620 so it looks like something in sys_pipe(). (In reply to comment #43) > prio_callout "/sbin/mpath_prio_alua /dev/%n" > > to > > prio_callout "none" This change stops the leak. I haven't had a chance to check all permutations. Well, that would certainly explain why we couldn't find the leak in the scsi ioctl handling code :) There's no need to check the other permutations. Multipathd doesn't call pipe when it's running the path checkers, just the priority callouts. Well there's definitely a bug in multipathd. After it creates the pipe, if the fork() call fails, multipathd doesn't clean up the pipe. I don't know what could have changed in the kernel that would cause the fork() call to fail on newer kernels. Also, if this is the case, you should be bumping against the open file descriptors limit before you run out of memory, unless ulimit -n really is set much higher that 2048. Another option is that it's the pipe() call itself is failing but not cleaning itself up. However, after a quick check, it doesn't look like the sys_pipe() call path has changed between -113 and -118. The last option is that the pipe isn't being clean up correctly when the mpath_prio_alua program exits. Multipathd is correctly closing down both ends of its pipe (if the fork call works correctly). But exit() should be cleaning up the pipe for mpath_prio_alua. If this is the first problem, then killing multipathd should give you all of your memory back. I'm going to make a new multipathd package that correctly closes the pipe when fork() fails, and also adds more error messages so that we know exactly where multipathd is failing in the prio callout function. Another question. On the kernels where there is no memory leak, are you still seeing the: Jan 7 20:02:22 dhcp-74-153 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdach messages. You can download the new multipath test packages at: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/device-mapper-multipath-0.4.7-24.el5.ia64.rpm http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/device-mapper-multipath-debuginfo-0.4.7-24.el5.ia64.rpm http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/kpartx-0.4.7-24.el5.ia64.rpm These will print out some more verbose error messages, and correctly close the pipe fds if fork fails. Let me know what messages get printed out to /var/log/messages by multipathd with these packages installed. We ran this version of multipathd (from the packages in Comment #52). It still seems to leak objects from the size-1024 slab. I see a lot of items in /var/log/messages like: Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaab Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaac Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaad Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaae Jan 13 17:31:39 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:31:39 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaaf And one section of this: Jan 13 17:39:33 dhcp-101-132 multipathd: /sbin/mpath_prio_alua exitted with 255 Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdblq Jan 13 17:39:33 dhcp-101-132 multipathd: /sbin/scsi_id exitted with 255 Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdblq Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdblr Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdblr Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdbls Jan 13 17:39:33 dhcp-101-132 multipathd: error creating pipe for callout: Too many open files Jan 13 17:39:33 dhcp-101-132 multipathd: error calling out /sbin/scsi_id -g -u -s /block/sdbls I don't know if we have this already, but obviously it would be good to see full ps and lsof output corresponding to that situation. Here's a quick guess. This looks like the second possibility listed above. The sys_pipe call is failing and not cleaning itself up. It appears that it may have allocated the memory for the pipe but then could not find a free file descriptor. Try increasing your ulimit -n to something higher, like umlimit -n 4096, and see if this fixes it. I'll go check and see if I can find the leak in the kernel code. I'll also go look through multipathd and figure out what the most number of file descriptors you could possibly need at one time for a given number of multipath devices and paths. Also, for anyone trying to reproduce this: In the defaults section of multipath.conf, try setting max_fds equal to exactly the number of paths that you have.
For example if you have 32 LUNs with 4 paths each, use
defaults {
...
max_fds 128
}
Pretty sure I found it! Comment #55 was pretty much right on. do_pipe() allocs the pipe files, and then tries to get free file descriptors for them. When the process is out of file descriptors, it fails. When the read pipe gets freed, it doesn't do a dput on the dentry. This means that there is still a reference to the pipe dentry, and the inode never gets cleaned up, including the pipe_inode_info, which is what the systemtap script was pointing to. I checked upstream and this is fixed there. The odd thing is, this isn't fixed in the earlier kernels, like -92.el5 and -113.el5. So why you can't see it there is a mystery. But increasing the ulimit -n until you don't see those "Too many open files" messages should work around the problem. I'll go hunt down the upstream kernel commit, and we'll get you a patched kernel to prove that this does actually fix the issue. Looking through the multipathd code, and just counting the number of fds that are always open, plus the maximum number of fds that could be used by each thread at any time, without any consideration of whether it would actually be possible to actually use them all at the same time, I get: 18 fds + 1 per path + 1 per interactive multipathd client. Since only one thread services the multipathd interactive clients (what you get by running multipathd -k), there's no real good reason to allow multiple clients to connect at once. Just allowing one client and making SIGHUP force the connection to close seems like it would work just fine. It seems like a good rule of thumb would be 32 fds + 1 per path could you try this and make sure that it doesn't get any of the "Too many open files" messages. A kernel with a proposed fix for the leak is available at: http://people.redhat.com/coughlan/el5/bz478643/ Maybe after you run the test in comment 58, you can try this kernel with all default settings, and the stock 5.3 dm-multipath, to be sure we nailed it. After that try a run with the updated dm-multipath in comment 52. That right Ben? Tom (In reply to comment #58) > Just allowing one client and making SIGHUP force > the connection to close seems like it would work just fine. It seems like a > good rule of thumb would be > > 32 fds + 1 per path > > could you try this and make sure that it doesn't get any of the "Too many open > files" messages. This setup has 8 paths to each of 255 LUNs, so I'll try: 32 fds + 8 * 255 = 2072 for the open files limit. Then I'll try Tom's kernel from Comment #59. We should have results by 1 pm or so.
Release note added. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.
New Contents:
We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed.
It has been determined that 1024 byte objects in kernel slab may be lost when a call to do_pipe() fails. The problem occurs because do_pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, do_pipe fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel.
A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors.
This problem has been observed with multipathd in particular. To avoid the problem with multipahtd, you should set max_fds in the defaults section of multipath.conf to
32 fds + 1 fd per path
For example if you have 32 LUNs with 4 paths each, use
defaults {
...
max_fds 160
}
Created attachment 329108 [details] Patch to fix the leak during pipe() failure This fix is a port of: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ccac88eeb5659c716af8e695e2943509c80d172 Just a clarification to Comment #59. When you try to test the workaround, you should be running with the multipath packages from Comment #52. These are necessary to print out the useful error messages when the callout program fails. Ben updated the release note as follows: I changed do_pipe() to pipe() since I assume our customers don't care what internal kernel function is failing. They just want to know what system call could cause them problems. I also changed the instructions so that users know that they only need to change max_fds if it needs to be above 1024. The vast majority of users don't need to do anything.
Release note updated. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1,6 +1,6 @@
We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed.
-It has been determined that 1024 byte objects in kernel slab may be lost when a call to do_pipe() fails. The problem occurs because do_pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, do_pipe fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel.
+It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel.
A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors.
@@ -8,9 +8,9 @@
32 fds + 1 fd per path
-For example if you have 32 LUNs with 4 paths each, use
+if this number is greater than the default of 1024. For example if you have 255 LUNs with 8 paths each, use
defaults {
...
- max_fds 160
+ max_fds 2072
}
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,10 +1,8 @@ -We need a 5.3 release note for this. Here is a draft. Ben, please review and improve as needed. - It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel. A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors. -This problem has been observed with multipathd in particular. To avoid the problem with multipahtd, you should set max_fds in the defaults section of multipath.conf to +This problem has been observed with multipathd in particular. To avoid the problem with multipathd, you should set max_fds in the defaults section of multipath.conf to 32 fds + 1 fd per path
Release note updated. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1,14 +1,9 @@
-It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel.
+It has been determined that 1024 byte objects in kernel slab may be lost when a call to pipe() fails. The problem occurs because pipe() allocates pipe files, and then tries to get free file descriptors for them. If the process is out of file descriptors, pipe() fails, but it does not clean up properly. A fix for this problem is planned for a forthcoming 5.3 kernel update.
-A workaround to avoid this problem is to ensure that the process calling do_pipe has adequate file descriptors.
+To workaround this issue, ensure that the process calling do_pipe has adequate file descriptors allocated.
-This problem has been observed with multipathd in particular. To avoid the problem with multipathd, you should set max_fds in the defaults section of multipath.conf to
+This issue has been observed with multipathd in particular. To avoid the problem with multipathd, calculate the number of file descriptors (FDs) required using the formula: "FDs = Number of Paths + 32". If the result is greater than the default value of 1024, then the max_fds value in the defaults section of multipath.conf should be set to the previously calculated value.
-32 fds + 1 fd per path
+For example, if there are 255 LUNs with 8 paths each, the line to be added to the defaults section of multipath.conf would be:
-
+
-if this number is greater than the default of 1024. For example if you have 255 LUNs with 8 paths each, use
+max_fds 2072-
-defaults {
- ...
- max_fds 2072
-}
(In reply to comment #63) > Just a clarification to Comment #59. When you try to test the workaround, you > should be running with the multipath packages from Comment #52. These are > necessary to print out the useful error messages when the callout program > fails. With the modified multipath packages from Comment #52 and max_fds = 2072, we don't see the size-1024 slab growing, so that config option does seem to work around the leak issue. I don't see any "Too many open files" messages in /var/log/messsages, either. (Sorry for the slow updates - unfortunately, we're still having trouble with our network in the lab where this machine & array are located.) (In reply to comment #59) > A kernel with a proposed fix for the leak is available at: > > http://people.redhat.com/coughlan/el5/bz478643/ > > Maybe after you run the test in comment 58, you can try this kernel with all > default settings, and the stock 5.3 dm-multipath, to be sure we nailed it. > After that try a run with the updated dm-multipath in comment 52. That right > Ben? > > Tom We tried this kernel, 2.6.18-128.el5.bz478643, with the stock device-mapper-multipath package and max_fds set to 2048 (high enough for multipathd to start up but not high enough to avoid the "error calling out /sbin/mpath_prio_alua" messages). This got frequent "error calling out" messages, but the size-1024 slab doesn't grow. So I think that kernel fix works. in kernel-2.6.18-130.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Updating PM score. Flipping back to ON_QA, since I saw that the test /kernel/errata/5.3.z/481576 failed. The machine is nec-tx7-1.rhts.bos.redhat.com. RHTS logs could be found here http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=9138721 I'm not sure if this is reproducable as there were no issues seen in previous test runs. I launched 4 runs of this test on same machine to see if it occurs once again. RHTS job http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=74878. (In reply to comment #77) > Flipping back to ON_QA, since I saw that the test /kernel/errata/5.3.z/481576 > failed. The machine is nec-tx7-1.rhts.bos.redhat.com. RHTS logs could be found > here http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=9138721 > > I'm not sure if this is reproducable as there were no issues seen in previous > test runs. I launched 4 runs of this test on same machine to see if it occurs > once again. RHTS job http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=74878. Okay, there were two failures of this test that claims that there are still leak problems, so I'd like to check with dev if these test failures are real. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |