Bug 134829
Summary: | kernel not "un-caching" and thinks it is running out of memory | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Joshua Jensen <joshua> | ||||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 3.0 | CC: | petrides, riel | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2004-10-14 17:44:00 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Joshua Jensen
2004-10-06 15:45:25 UTC
Joshua, the problem here is that the normal memory zone is exhausted.
Of the ~225000 pages the system started with there are only ~5000 left
and there are ~90000 in the slabcache, leaving ~120000 lowmem pages
unaccounted for. This is typically due to a buggy driver that
allocates lowmem and never frees it or a huge number of bounce buffers
outstanding. Can you get vmstat, iostat and "cat /proc/slabinfo"
outputs while the system is running so I can see where the memory is
going?
Also, please use the latest RHEL3-U4 candidate kernel to get this
inof, it located here:
>>>http://people.redhat.com/~lwoodman/.RHEL3/
Larry
Ok then... just loaded the latest kernel and it doesn't solve this problem by itself. I'll attach the vmstat, iostat, and cat /proc/slabinfo. These reading were taken for 10 minutes or so up until the box essentially locks up Created attachment 104896 [details]
vmstat taken every 2 seconds
Created attachment 104897 [details]
iostat taken every 2 seconds
Created attachment 104898 [details]
slabinfo taken every 2 seconds
And did you still get OOM kills? If so, please attache the dmesg output as well. Larry No OOM kill messages, though I was watching dmesg. The box just locked up this time after I saw: ENOMEM in journal_alloc_journal_head, retrying. However, from /var/log/messages I see this after a power reset: Oct 6 19:02:51 foxhound kernel: ENOMEM in journal_alloc_journal_head, retrying. Oct 6 19:03:02 foxhound kernel: Mem-info: Oct 6 19:03:02 foxhound kernel: Zone:DMA freepages: 1293 min: 0 low: 0 high: 0 Oct 6 19:03:02 foxhound kernel: Zone:Normal freepages: 638 min: 1279 low: 4544 high: 6304 Thoughts? Joshua, can you retry it and see of you get more data from AltSysrq M? Larry I have access to a serial console that goes to a conncentrator that I can tellnet to, and an HP ILO "console", and and ssh session. I can't get any of them to work with AltSysrqM (yes, I enabled it). Are they supposed to work, or do I have to have a direct serial or keyboard connection?
Joshua, this latest hang that you saw was my bug. Can you grab the
appropriate kernel from here and give it a try?
>>>http://people.redhat.com/coughlan/RHEL3-perf-test/
Larry
When booting from this kernel, I see from dmesg this: scsi::resize_dma_pool: WARNING, dma_sectors=19632, wanted=26160, scaling WARNING, not enough memory, pool not expanded Not sure if this has anything to do with anything though Hmmm... just before the scsi::resize_dma_pool message I see about 100 of these lines: Oct 13 18:06:23 foxhound kernel: Unable to attach sg device <3, 0, 0, 229> type=0, minor number exceed 255 Oct 13 18:06:23 foxhound kernel: Unable to attach sg device <3, 0, 0, 230> type=0, minor number exceed 255 Oct 13 18:06:23 foxhound kernel: Unable to attach sg device <3, 0, 0, 231> type=0, minor number exceed 255 Oct 13 18:06:23 foxhound kernel: Unable to attach sg device <3, 0, 0, 232> type=0, minor number exceed 255 Yes, this is attached to a massive SAN setup. Don't know if this is related though. Created attachment 105166 [details]
OOM messages
Ok... with the new perf-test kernel, I can still get dd to crash things... and this time it gave lots of OOM messages. See the attached file I just created in Comment #14 Wait a minute here, you are running an SMP kernel on a 32GB system! We dont support that, more than half of lowmem is consumed in the mem_map at boot-time. Please grab the hugemem kernel and run with that and let me know if that runs without problems ASAP. Larry Back to the original kernel we started off, 15.0.4, in hugemem form. Everything is working fine: total used free shared buffers cached Mem: 32277 23090 9186 0 113 20161 -/+ buffers/cache: 2815 29462 Swap: 4094 0 4094 What is the line between the smp and hugemem kernel... is it 16 gigs or more of memory? Yes, upto 16GB the smp kernel should be used and above 16GB the hugemem kernel should be used. BTW, the installation procedure should follow these rules, if it does not please open a bug for that. Thanks for your help Joshua, Larry Woodman |