Bug 119351
Description
Jim Richard
2004-03-29 19:33:51 UTC
Created attachment 98943 [details]
LARRD Chart of memory usage durring incidents described earlier
Chart Documents memory utilization durring OOM incidents.
It's not so much a matter of total free memory, but most likely a lowmem (<1GB) memory deficit that is the problem here. This issue will need retesting when the U2 beta becomes available for external sites (probably within a week?). In the U2 kernel the default page reclamation aggressiveness has been re-tuned, which may address the problem in this case. In the interim, please try the following: $ echo 30 > /proc/sys/vm/inactive_clean_percent It is currently set to 5 percent by default, but will be set to 30 in the U2 kernel. (It can be set as high as 100, but we'd like to know whether the new default alleviates your particular problem.) If that does not help, the U2 kernel will dump a complete "Alt-sysrq-m" type output with each OOM kill. Without that debug data, it's impossible to know what VM circumstances precipitated each OOM kill. Is there any way for me to make my current kernel provide Alt-sysrq-m output. I'm at the latest post u1 errata for AS 3, also are there any other diagnostics I can turn on that will assist in problem resolution? BTW. After the last crash, my google research indicated that I update /proc/sys/vm/inactive_clean_percent to 30 and I've already done that. But thanks for the confirmation. I was also wondering if the problem might be caused by the fact that I've only got 8G of swap on a 16G system?... I've not had much luck in locating recommendations for swap on large memory systems. I originallly configured the system this way due to a shortage of OS disk space, I've resolved that issue, and if I need another 8G of swap I can add it. Unfortunately there is no way to make your current kernel provide Alt-sysrq-m data precisely at the time of the OOM kill, which is what is required. You mention that you've updated the inactive_clean_percent "after the last crash". Does this mean that the 30% setting has resulted in no further OOM kills? As to the swap question, since your free data shows no problem with swap utilization, that should not be an issue. Dave, Thanks for your response. Can you let me know where I can pick up the U2 beta? We have not experienced further OOM events since the update; but then again,we haven't attempted to reproduce the error either. About the swap space, I was concerned about this since I'm not intimate with all the linux VMM internals, some systems are very good about this kind of arrangement, and others I've worked with are not so graceful. When the first occurance happend, we identified a 3rd party app that was abusing memory, fixed the 3rd party app. and applied the last security kernel, that had some vmm fixes. Ran in test for 2 weeks and called it good. The last 2 errors happend during final testing prior to release into production. So we fell back to the server this new one is replacing. We are in the process of doing a full review of memory utilization by db2 in an attempt to better understand what is using what, before we attempt to reproduce the error. We will probably begin testing activities tomorrow morning. The plan will include running with both 5 and 30 set in /proc/sys/vm/inactive_clean_percent Thanks again for your help with this.. Jim
> Can you let me know where I can pick up the U2 beta?
I will do that, it should be available in a few days...
Dave
Dave, Fyi: We had also opened a PR with IBM, In case db2 was doing something untoward with memory. They have reviewed our configuration and the DB2 crash diagnostics and feel they are clean. They did provide some feedback regarding our db2 configuration but nothing significant related to this discussion. The have closed the ticket, but will re- open it if we identify something suspicious in db2's behavior. This seems reasonable since the crash was caused externally (kill -9). Thought you should know. Jim Created attachment 99100 [details]
Memory Chart durring re-production of problem.
Dave,
I've reproduced tha problem using inactive_clean_percent set to 5 30 90. When
set to 60, page cleaning kicked in, in time to prevent the problem. I've
Identified the source of the presure on the active pages, I've also identified
the source of presure on the overall cache. The Cache is poluted by our backup
process ( we use Mondo Archive). The presure on active memory was caused by a
script we implemented to alleviate the random number generator being run out of
entropy to the point where it never recovers (documented in BUG #s 117218 and
119526 ). The scrip randomly calls misc commands designed to generate I/O in
order to stir in entropy. When entropy becomes chronically exhausted the script
begins to run almost continiously, when this happens active pages begins to
rise steadily untill either OOM starts or the dead pages are cleaned from the
active pool.
I am attempting to recreate using inactive_clean_percent=100. Meanwhile I've
attached my messages.log from yesterday and today, the script used to generate
entropy, and the memory utilization chart from yesterday.
Please let me know if I can provide any other information, or if you have
suggestions on other vm parameters that could use tweaking. We can live without
the GenEntropy script, but the backups must obviously continue. But I'm
concerned about other script based daemons that could contribute to active page
utilization , for instance we use BigBrother, which is mostly script based,
granted it only runs every 5 minutes, but over extended periods of time could
present similar problems. We also have some application specific scripts that
run. I'd like to find a way to prevent these tasks from poluting the Active
Memory pages, once their terminate normally.
Created attachment 99101 [details]
messages.log from 4/3/2004 gzipped
gzipped messages.log from 4/3/2004
Created attachment 99102 [details]
Gzipped messages.log from 4/4/2004
Gzipped messages.log from 20040404
Created attachment 99103 [details]
Script that put's preasure on active page utilization
Script designed to stir entropy, that puts presure on active page utilization
Typo in last line of my last comment... Should read: ...I'd like to find a way to prevent these tasks from poluting the Active Memory pages, once their children terminate normally. Setting inactive_clean_percent to 100 is the best course of action for now. However, if you can gather Alt-sysrq-m data when the system is just about to start OOM-killing, there may be enough info to help.
Jim,
Re-reading your post, now I'm a bit confused:
> I've reproduced tha problem using inactive_clean_percent set
> to 5 30 90. When set to 60, page cleaning kicked in, in time to
> prevent the problem.
/proc/sys/vm/inactive_clean_percent is a single percentage value.
When you say "5 30 90", are you referring to /proc/sys/vm/pagecache?
Exactly what did you set to "60"? The pagecache max percent? And
if so, what was the inactive_clean_percent value at that time?
Dave, Sorry I should have been clearer, I reproduced the problem multiple times using different settings on inactive_clean_percent. So at this point I've produced the problem on 5 occasions, each using a different setting for inactive_clean_percent. BTW last night I re-created the situation using inactive_clean_percent set to 100. I haven't tried any other settings at this point. The funny thing about this whole thing is that the system is reporting between 2 and 5 gigs of active memory when there is nowhere near this much being used by any process. Typically there is less then 1 Gig in use by processes and ipc shared memory that I can tell. ( DB2 makes extensive use of shared memory). The GenEntropy script runs bunches of commands that just terminate normally, yet this appears to be what is driving the active memory counter up over time. When the script is not running I see no rise in active memory. Unfortunately it will be very difficult to catch the Alt-sysrq-m data since the timing of the event is not predictable. It can run for hours at a specific level of utilization before the problem occurs. For this, I think we'll have to wait for the new kernel. It seems to me the kernel is keeping pages marked active when the process that owned them is long gone. If you review the chart I sent last night, note the sudden drops in memory utilization at ~17:30 and 7:30 am. The drop at 17:30 happened while running inactive_clean_percent=60. This drop happened without any external events, no process terminations or OOM events. Something just woke up and released almost 5Gigs of memory. I'm not sure I understand the behavior or what drove the release. The release at 7:30 am happened in the middle of a multiple OOM events. The OOM events began at 5:00 and continued untill the release. A total of 139 processes were killed. Are there any knobs we can turn to force more frequent release of "Active" pages back to the inactive pool? Since holding pages active seems to be the problem. Is there something we can do with /proc/sys/vm/pagecache, I've seen references to /proc/sys/vm/freepages (doesn't appear in my /proc... though), is this an option? Thanks again for your help with this. Jim I guess I don't understand what you mean exactly by "active memory", and how you determine what it is? The Alt-sysrq-m output shows exactly what the page counts are on each list in each zone, and in particular, it gives the exact breakdown of the currently-active pages, be they either (1) pagecache pages or (2) anonymous memory pages used by processes. For example, here are the first few lines of an Alt-Sysrq-m output: SysRq : Show Memory Mem-info: Zone:DMA freepages: 2902 min: 0 low: 0 high: 0 Zone:Normal freepages:176074 min: 1279 low: 4544 high: 6304 Zone:HighMem freepages:2118244 min: 255 low: 34304 high: 51456 Free pages: 2297220 (2118244 HighMem) ( Active: 11955/1050, inactive_laundry: 317, inactive_clean: 0, free: 2297220 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2902 aa:0 ac:3926 id:1 il:0 ic:0 fr:176072 aa:2483 ac:5548 id:1049 il:317 ic:0 fr:2118244 ... The lines above starting with "aa:" give the page counts per zone, first the DMA zone, then the Normal zone, and last the Highmem zone. The letters are shorthand for: aa: active anonymous memory pages ac: active pagecache pages id: inactive_dirty pages il: inactive_laundry pages ic: inactive_clean pages fr: free pages The id, il, ic page lists contain combined anonymous/pageache pages, but the "active" page list is broken down into two sub-lists, the aa: and ac: lists. We're guessing that the pagecache is being flooded and not flushed quickly enough to avoid OOM kills when a user process is attempting to allocate a page. Setting inactive_clean_percent to 100 is the most important tuning knob to keep page reclamation going as aggressively as possible. Another thing you could try is tinkering with /proc/sys/vm/pagecache values, specifically the third ("max") value, which is set to 100 (percent) by default. If the percentage of active pages that are being used by the pagecache goes above that max percentage value, then only pagecache pages will be reclaimed, and anonymous memory pages will be left alone. Since its default value is 100, the active page list is allowed to be totally consumed with pagecache pages. So, if you set it to a lower value, pagecache pages will be selected for reclamation in preference to anonymous memory pages. It's not a hard limit, but it does influence page reclamation, and will keep user process memory around longer. But it's impossible to predict whether it will help. With the U2 kernel, the Alt-sysrq-m will show exactly the page count state that precipitated the OOM kill; it's strictly a matter of numbers at that point in time. Dave, Thanks for the response, I'll give the pagecache knob a try. It looks like I can script the Alt-sysrq-m by running the following cmd: echo "m" > /proc/sysrq-trigger When I did this it wrote the info to messages.log. Do you think once a minute would be frequent enough or do you think more frequent recording is in order? Let me know your thoughts and I'll set up another test. Thanks again! Jim It might be helpful, although unfortunately the problem with doing what you propose is that the bash shell process running the "echo" script may need memory, but probably won't be able to get any when the system gets into the memory-starved state. By the time it does run, the OOM kill has happened, or the memory made has been made available, etc... Created attachment 99121 [details]
Gzipped Messages.log With AltSysRq-m output
Dave,
I thought it'd be interesting, and ran it anyway. The extract of our
messages.log is attached. Also since echo is a built-in I don't think it'll
require mememory since any library code should already be in main storage... If
you think it'd be better I could run it under busybox or some other staticly
linked shell, with all required functionallity implemented in the shell.
Regardless I have AltSysRq-m data gotten 20 seconds before db2sysc was killed
by OOM. Another one 17 seconds before db2bp (db2 commandline) was killed by
OOM, and another 5 seconds before another db2bp got the ax... There's more in
there but these are the closest dumps of memory info to OOM events.
Let me know if anything leaps out at you.
Thanks,
Jim
Yes, something does leap out... The page counts for the Normal zone show a remarkably small number of pages being cycled through the pagecache/anonymous-memory reclamation process, typically around 5000 pages. (total the aa: through fr: counts for the Normal zone in any of the sysrq-m outputs) The Low/Normal zones in your system starts with 896MB of memory, or about 225,000 pages. Subtract from that the kernel's text and data, most notably the mem_map array taking 60 bytes per page of physical memory (~65,000 pages), and the remaining amount of available Normal memory pages would be roughly 160,000 pages. These pages are made available for the free page lists, but also for kernel memory allocations that must come from low memory, such as for the kmalloc() slab cache. And that's where the problem is here -- an unusually large number of pages (~149000 in each sysrq-m output) are consumed the slab cache. So what we need now is a dump of /proc/slabinfo during the "problem time" to see where it's all being allocated. Created attachment 99130 [details]
OOM with AltSysRq-M and Slabinfo dump
Dave,
Ok. Here's one for sendmail, with AltSysRq-m and slabinfo from 3 seconds prior
to OOM event.
Let me know where we go from here.
Thanks
Jim
Ernie, Thanks, I've located it and will download it tonight. I should be able to get it installed tomorrow. If the qla2xxx.conf module has been included, I should be ready to test tomorrow afternoon. If not I should have it ready tomorrow night. Thanks again, to all for your responsiveness in this matter, your making a happy customer here. Jim As suspected, the problem here is the enormous size of the pagecache, which can grow extremely large because of the 16MB of RAM in this system. The 5,000,000+ buffer_head structures -- which consume 130,000+ pages of lowmem slab cache memory -- are associated with those pagecache pages and filesystem metadata, which are all located in the Highmem zone. The state of the Highmem zone is fairly healthy, even at the times when OOM kills occur. The free page count ranges from slightly below, to well above, the "low" watermark of 64000 pages. When it does drop below the low value, they are being replenished with no problem. The combined number of inactive_launday and inactive_clean pages is staying equal to the number of inactive_dirty pages, so the "inactive_clean_percent" setting of 100 is doing its job. So, the page reclamation process sees no need be any more aggressive in flushing Highmem pagecache pages to disk. Have you considered the use of the hugemem kernel? It exists for situations like this to avoid lowmem exhaustion. In the standard kernels, the 4GB virtual address space is split between user and kernel virtual address spaces, with the lower 3GB given to user space, and the upper 1GB used for kernel virtual address space. Of this 1GB of kernel virtual address space, 896MB is unity-mapped, and that memory is used by lowmem (DMA/Normal) zones. The hugemem kernel splits the address space into two, with 4GB being given to both user and kernel virtual address spaces. This will increase the kernel's lowmem zone to ~4GB, and therefore alleviates the type of lowmem exhaustion that you are seeing. You do pay for this split, however, because a TLB flush will be done on every entry into the kernel. So the hugemem kernel is only to be used for cases where the extra lowmem requirement offsets the extra kernel overhead. As far as lowering the /proc/sys/vm/pagecache max value from 100 down to a lower value, although the sysrq-m output shows that the Highmem zone's active pages typically are between 70-80% pagecache pages, the sysrq-m output shows literally no swapping of anonymous memory going on at all. So the inactivation of pages is already selecting only pagecache pages as it is, so setting pagecache max won't accomplish anything. There's little else to be done with the kernel as it is. There are potential tests that could be done with instrumented kernels, but that would require your being willing to try some test kernels, and no guarantee that the problem can be easily overcome. Dave, Thanks for the analysis. After seeing the numbers from last night's test I've been comming to the same conclusions myself. I beleive I'll give the HugeMem kernel a shot. I do have a couple of concern's here. The first being, how much performance degradation I should expect, and the second regarding; the JVM incompatabilities noted in the release notes. Are there any user experience documents available on either of these items that are available for review? Tomorrow we'll re-run the offending workload against HugeMem. Assuming that goes well. We'll run some benchmarks to identify the performance hit. Then re-test our java based applications using SetArch to get around the 3 G address space limitations of the JVM(s). Thanks again for all your help with this. I'll update this report as soon as I have any additional findings. Jim A patch that fixes try_to_reclaim_buffers() in the manner suggested in Comment #29 has been queued for inclusion in RHEL4-U5. (i.e., it changes the 10% test to use "nr_used_buffer_heads" instead of "nr_unused_buffer_heads") I'm not sure what happened but comments 28 and 29 are missing here. I filed a comment some time ago indicating that switching to the huge- mem kernel along with elimination of the script that was poluting the memory worked around the problem. But I'm glad to see that a fix is in the works. Jim -- sorry, #28 and #29 were Red Hat private comments. Here's the part of try_to_reclaim_buffers() that is the problem: /* * Since removing buffer heads can be bad for performance, we * don't bother reclaiming any if the buffer heads take up less * than 10% of pageable low memory. */ if (nr_unused_buffer_heads * sizeof(struct buffer_head) * 10 < freeable_lowmem() * PAGE_SIZE) return 0; So in your case, even though the buffer_head slab cache was using an over 50% of freeable low memory, the test above would forestall the function from trying to reclaim them. The nr_unused_buffer_heads counter is not allowed to exceed 1600 buffer_heads on an 86 (~42 pages), so this function would always just return 0. The fix changes it to test "nr_used_buffer_heads", which in your test case, would reflect the ~5000000 of in-use buffer_heads. The fix for this problem was committed to the RHEL3 U5 patch pool Monday evening (in kernel version 2.4.21-25.1.EL). That's wonderful, thanks again for all the help with this! I'll look forward to U5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html |