Bug 149088 - kswapd0: page allocation failure. order:0, mode:0x50
kswapd0: page allocation failure. order:0, mode:0x50
Status: CLOSED DUPLICATE of bug 193542
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks: 176344
  Show dependency treegraph
 
Reported: 2005-02-18 11:29 EST by Jeff Burke
Modified: 2007-11-30 17:07 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-12-08 08:05:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Ouput from console when running stresstest on kernel 2.6.9-22.0.1.ELsmp (59.31 KB, text/plain)
2005-11-22 05:52 EST, Trond H. Amundsen
no flags Details
Log output from page allocation failure (12.77 KB, text/plain)
2006-11-04 07:51 EST, Steve Bergman
no flags Details
Log output about OOM-Killer activity (2.43 KB, text/plain)
2006-11-04 07:55 EST, Steve Bergman
no flags Details

  None (edit)
Description Jeff Burke 2005-02-18 11:29:43 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
While running the stress kernel regression suite. The system becomes unusable. The first observation is a message scrolling on all of the virtual terminals "EXT3-fs error (device dm-0) in start_transaction: Journal has aborted".

During the stress kernel test suite a sub-component is LTP. During the memory test phase of LTP is when the problem happens.

Version-Release number of selected component (if applicable):
stress-kernel-1.3.0-26.src.rpm

How reproducible:
Always

Steps to Reproduce:
1.Install stress-kernel rpm
2.Configure the hell-hound.sh to run all tests (except for destructive test)

  

Actual Results:  Feb 17 04:49:19 dhcp83-94 kernel: kswapd0: page allocation failure. order:0, mod e:0x50
Feb 17 04:52:03 dhcp83-94 kernel:  [<c0145f97>] __alloc_pages+0x28b/0x298
Feb 17 04:52:15 dhcp83-94 kernel:  [<c0145fbc>] __get_free_pages+0x18/0x24
Feb 17 04:52:16 dhcp83-94 kernel:  [<c0149436>] kmem_getpages+0x15/0x94
Feb 17 04:52:16 dhcp83-94 kernel:  [<c014a0f6>] cache_grow+0x10a/0x236
Feb 17 04:52:16 dhcp83-94 kernel:  [<c014a419>] cache_alloc_refill+0x1f7/0x227
Feb 17 04:52:16 dhcp83-94 kernel:  [<c014a68b>] kmem_cache_alloc+0x46/0x4c
Feb 17 04:52:16 dhcp83-94 kernel:  [<c01673cc>] alloc_buffer_head+0xd/0x22
Feb 17 04:52:16 dhcp83-94 kernel:  [<c0164f3c>] create_buffers+0x21/0x8b
Feb 17 04:52:16 dhcp83-94 kernel:  [<c0165813>] create_empty_buffers+0x10/0x127
Feb 17 04:52:16 dhcp83-94 kernel:  [<d08a3712>] ext3_ordered_writepage+0x95/0x13 a [ext3]
Feb 17 04:52:16 dhcp83-94 kernel:  [<c014c9b8>] pageout+0x88/0xc5
Feb 17 04:52:16 dhcp83-94 kernel:  [<c014cbff>] shrink_list+0x20a/0x4eb
Feb 17 04:52:50 dhcp83-94 kernel:  [<c0301b60>] common_interrupt+0x18/0x20
Feb 17 04:52:50 dhcp83-94 kernel:  [<c014d0df>] shrink_cache+0x1ff/0x454
Feb 17 04:52:50 dhcp83-94 kernel:  [<c014da99>] shrink_zone+0x8f/0x9e
Feb 17 04:52:51 dhcp83-94 kernel:  [<c014dde4>] balance_pgdat+0x188/0x2b5
Feb 17 04:52:51 dhcp83-94 kernel:  [<c014dfca>] kswapd+0xb9/0xbb
Feb 17 04:52:51 dhcp83-94 kernel:  [<c011d043>] autoremove_wake_function+0x0/0x2 d
Feb 17 04:52:51 dhcp83-94 kernel:  [<c030193a>] ret_from_fork+0x6/0x14
Feb 17 04:52:51 dhcp83-94 kernel:  [<c011d043>] autoremove_wake_function+0x0/0x2 d
Feb 17 04:52:51 dhcp83-94 kernel:  [<c014df11>] kswapd+0x0/0xbb
Feb 17 04:52:52 dhcp83-94 kernel:  [<c01041d9>] kernel_thread_helper+0x5/0xb
Feb 17 04:55:33 dhcp83-94 kernel: kswapd0: page allocation failure. order:0, mod e:0x50
Feb 17 04:55:49 dhcp83-94 kernel:  [<c0145f97>] __alloc_pages+0x28b/0x298
Feb 17 04:56:55 dhcp83-94 kernel:  [<c0145fbc>] __get_free_pages+0x18/0x24
Feb 17 04:56:57 dhcp83-94 kernel:  [<c0149436>] kmem_getpages+0x15/0x94
Feb 17 04:56:57 dhcp83-94 kernel:  [<c014a0f6>] cache_grow+0x10a/0x236
Feb 17 04:56:57 dhcp83-94 kernel:  [<c014a419>] cache_alloc_refill+0x1f7/0x227
Feb 17 04:56:57 dhcp83-94 kernel:  [<c014a68b>] kmem_cache_alloc+0x46/0x4c
Feb 17 04:56:57 dhcp83-94 kernel:  [<c01673cc>] alloc_buffer_head+0xd/0x22
Feb 17 04:56:57 dhcp83-94 kernel:  [<c0164f3c>] create_buffers+0x21/0x8b
Feb 17 04:56:57 dhcp83-94 kernel:  [<c0165813>] create_empty_buffers+0x10/0x127
Feb 17 04:56:57 dhcp83-94 kernel:  [<d08a3712>] ext3_ordered_writepage+0x95/0x13 a [ext3]
Feb 17 04:56:57 dhcp83-94 kernel:  [<c014c9b8>] pageout+0x88/0xc5
Feb 17 04:56:57 dhcp83-94 kernel:  [<c014cbff>] shrink_list+0x20a/0x4eb
Feb 17 04:57:27 dhcp83-94 kernel:  [<c0108622>] do_IRQ+0x239/0x242
Feb 17 04:57:28 dhcp83-94 kernel:  [<c0301b60>] common_interrupt+0x18/0x20
Feb 17 04:57:28 dhcp83-94 kernel:  [<c014d0df>] shrink_cache+0x1ff/0x454
Feb 17 04:57:28 dhcp83-94 kernel:  [<c014da99>] shrink_zone+0x8f/0x9e
Feb 17 04:57:29 dhcp83-94 kernel:  [<c014dde4>] balance_pgdat+0x188/0x2b5
Feb 17 04:57:29 dhcp83-94 kernel:  [<c014dfca>] kswapd+0xb9/0xbb
Feb 17 04:57:29 dhcp83-94 kernel:  [<c011d043>] autoremove_wake_function+0x0/0x2 d
Feb 17 04:57:29 dhcp83-94 kernel:  [<c030193a>] ret_from_fork+0x6/0x14
Feb 17 04:57:29 dhcp83-94 kernel:  [<c011d043>] autoremove_wake_function+0x0/0x2 d
Feb 17 04:57:30 dhcp83-94 kernel:  [<c014df11>] kswapd+0x0/0xbb
Feb 17 04:57:30 dhcp83-94 kernel:  [<c01041d9>] kernel_thread_helper+0x5/0xb


Expected Results:  Test would proceed without failure

Additional info:

people.redhat.com:/var/data/home/jburke/public_html/.test/stress-kernel-1.3.0-26.src.rpm

This system configuration is a single CPU Celeron 533MHz, minimum memory config 256 Meg.

This same system passes this test with RHEL3 U4.
Comment 3 Dave Jones 2005-02-23 23:22:10 EST
jeff, which kernel version was this ? There were a number of leaks that have now
been plugged in the beta U1 kernels in dist-4E-U1
Comment 4 Jeff Burke 2005-02-24 08:50:52 EST
Dave,
    This was the Day0 (2.6.9-5.0.3) i686 kernel.

    I am planning on starting the stress kernel suite on RHEL 4 Beta1 
U1 kernel on Friday afternoon.
Comment 5 Larry Woodman 2005-02-24 14:05:45 EST
Jeff, please grab me an AltSysrq-M output when this happens.  The stack
traceback is nice but I also need a show_mem() output in order to debug this
problem.

Thanks, Larry Woodman
Comment 6 Dave Jones 2005-02-24 14:38:45 EST
larry, your patch to show_mem() at oom_kill time got integrated to the U1
kernels, so you get some extra info when Jeff tries that one.
Comment 7 Larry Woodman 2005-02-25 09:15:10 EST
Jeff, can you grab the latest kernel and see if this still happens?

Larry
Comment 8 Jeff Burke 2005-02-25 11:26:14 EST
Larry,
   I have install 2.6.9-6.7 EL kernel. I have started the tests. I will get the
AltSysrq-M when the problem occurs.

Jeff
Comment 9 Jeff Burke 2005-03-01 13:57:08 EST
Larry,
   With the 2.6.9-6.7 the issue still happens. I missed getting the AltSysRq+M
output. But here is the failure. 

   Also I have moved to the 2.6.9-6.11 kernel. I have restarted hopefull I will
get it this time.

Feb 28 21:00:46 dhcp83-94 kernel: kswapd0: page allocation failure. order:0,
mode:0x50
Feb 28 21:00:46 dhcp83-94 kernel:  [<c013f96f>] __alloc_pages+0x28b/0x298
Feb 28 21:00:46 dhcp83-94 kernel:  [<c013f994>] __get_free_pages+0x18/0x24
Feb 28 21:00:46 dhcp83-94 kernel:  [<c01422d0>] kmem_getpages+0x1c/0xbb
Feb 28 21:00:46 dhcp83-94 kernel:  [<c0142e21>] cache_grow+0xae/0x136
Feb 28 21:00:46 dhcp83-94 kernel:  [<c014300e>] cache_alloc_refill+0x165/0x19d
Feb 28 21:00:46 dhcp83-94 kernel:  [<c0143209>] kmem_cache_alloc+0x51/0x57
Feb 28 21:00:46 dhcp83-94 kernel:  [<c0159ec0>] alloc_buffer_head+0xd/0x34
Feb 28 21:01:10 dhcp83-94 kernel:  [<c0157b8a>] create_buffers+0x21/0x8b
Feb 28 21:01:11 dhcp83-94 kernel:  [<c0158329>] create_empty_buffers+0x11/0x70
Feb 28 21:01:11 dhcp83-94 kernel:  [<d087b202>]
ext3_ordered_writepage+0x95/0x13a [ext3]
Feb 28 21:01:11 dhcp83-94 kernel:  [<c0144ebd>] pageout+0x8d/0xcc
Feb 28 21:01:11 dhcp83-94 kernel:  [<c0145104>] shrink_list+0x208/0x3ee
Feb 28 21:01:11 dhcp83-94 kernel:  [<c02c793c>] common_interrupt+0x18/0x20
Feb 28 21:01:11 dhcp83-94 kernel:  [<c01454c7>] shrink_cache+0x1dd/0x34d
Feb 28 21:01:12 dhcp83-94 kernel:  [<c0145b85>] shrink_zone+0xa7/0xb6
Feb 28 21:01:12 dhcp83-94 kernel:  [<c0145f28>] balance_pgdat+0x1b6/0x2f8
Feb 28 21:01:12 dhcp83-94 kernel:  [<c011f5a9>] prepare_to_wait+0x12/0x4c
Feb 28 21:01:12 dhcp83-94 kernel:  [<c0146134>] kswapd+0xca/0xcc
Feb 28 21:01:12 dhcp83-94 kernel:  [<c011f67e>] autoremove_wake_function+0x0/0x2d
Feb 28 21:01:12 dhcp83-94 kernel:  [<c02c6e9e>] ret_from_fork+0x6/0x14
Feb 28 21:01:12 dhcp83-94 kernel:  [<c011f67e>] autoremove_wake_function+0x0/0x2d
Feb 28 21:01:12 dhcp83-94 kernel:  [<c014606a>] kswapd+0x0/0xcc
Feb 28 21:01:12 dhcp83-94 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb


Comment 10 Jeff Burke 2005-03-04 08:28:42 EST
Larry,
    With your test kernel .18 I was able to get the info when it happened. Here
you go.

Mar  4 07:42:58 dhcp83-94 kernel: kswapd0: page allocation failure. order:0,
mode:0x50
Mar  4 07:42:58 dhcp83-94 kernel:  [<c013fa3d>] __alloc_pages+0x28b/0x29d
Mar  4 07:42:58 dhcp83-94 kernel:  [<c011f722>] autoremove_wake_function+0x0/0x2d
Mar  4 07:42:58 dhcp83-94 kernel:  [<c013fa67>] __get_free_pages+0x18/0x24
Mar  4 07:42:58 dhcp83-94 kernel:  [<c01423a8>] kmem_getpages+0x1c/0xbb
Mar  4 07:42:58 dhcp83-94 kernel:  [<c0142ef9>] cache_grow+0xae/0x136
Mar  4 07:42:58 dhcp83-94 kernel:  [<c01430e6>] cache_alloc_refill+0x165/0x19d
Mar  4 07:42:58 dhcp83-94 kernel:  [<c01432e1>] kmem_cache_alloc+0x51/0x57
Mar  4 07:42:58 dhcp83-94 kernel:  [<d0837946>]
journal_alloc_journal_head+0x10/0x5d [jbd]
Mar  4 07:42:58 dhcp83-94 kernel:  [<d08379b9>]
journal_add_journal_head+0x1a/0xe6 [jbd]
Mar  4 07:42:58 dhcp83-94 kernel:  [<d0832019>] journal_dirty_data+0x31/0x1b2 [jbd]
Mar  4 07:42:58 dhcp83-94 kernel:  [<d087be25>] ext3_journal_dirty_data+0xc/0x2a
[ext3]
Mar  4 07:42:58 dhcp83-94 kernel:  [<d087bcbb>] walk_page_buffers+0x62/0x87 [ext3]
Mar  4 07:42:59 dhcp83-94 kernel:  [<d087c25d>]
ext3_ordered_writepage+0xee/0x13a [ext3]
Mar  4 07:42:59 dhcp83-94 kernel:  [<d087c15d>] journal_dirty_data_fn+0x0/0x12
[ext3]
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0144fb3>] pageout+0x8f/0xce
Mar  4 07:42:59 dhcp83-94 kernel:  [<c01451fd>] shrink_list+0x20b/0x3f5
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0144470>] __pagevec_release+0x15/0x1d
Mar  4 07:42:59 dhcp83-94 kernel:  [<c01455c4>] shrink_cache+0x1dd/0x34d
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0145c82>] shrink_zone+0xa7/0xb6
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0146025>] balance_pgdat+0x1b6/0x2f8
Mar  4 07:42:59 dhcp83-94 kernel:  [<c011f64d>] prepare_to_wait+0x12/0x4c
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0146231>] kswapd+0xca/0xcc
Mar  4 07:42:59 dhcp83-94 kernel:  [<c011f722>] autoremove_wake_function+0x0/0x2d
Mar  4 07:42:59 dhcp83-94 kernel:  [<c02c708a>] ret_from_fork+0x6/0x14
Mar  4 07:42:59 dhcp83-94 kernel:  [<c011f722>] autoremove_wake_function+0x0/0x2d
Mar  4 07:42:59 dhcp83-94 kernel:  [<c0146167>] kswapd+0x0/0xcc
Mar  4 07:42:59 dhcp83-94 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Mar  4 07:42:59 dhcp83-94 kernel: Mem-info:
Mar  4 07:42:59 dhcp83-94 kernel: DMA per-cpu:
Mar  4 07:42:59 dhcp83-94 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar  4 07:42:59 dhcp83-94 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar  4 07:42:59 dhcp83-94 kernel: Normal per-cpu:
Mar  4 07:42:59 dhcp83-94 kernel: cpu 0 hot: low 28, high 84, batch 14
Mar  4 07:42:59 dhcp83-94 kernel: cpu 0 cold: low 0, high 28, batch 14
Mar  4 07:42:59 dhcp83-94 kernel: HighMem per-cpu: empty
Mar  4 07:42:59 dhcp83-94 kernel:
Mar  4 07:43:00 dhcp83-94 kernel: Free pages:           0kB (0kB HighMem)
Mar  4 07:43:00 dhcp83-94 kernel: Active:33150 inactive:22347 dirty:206
writeback:3861 unstable:0 free:0 slab:3366 mapped:45788 pagetables:1045
Mar  4 07:43:00 dhcp83-94 kernel: DMA free:0kB min:28kB low:56kB high:84kB
active:6584kB inactive:5568kB present:16384kB
Mar  4 07:43:00 dhcp83-94 kernel: protections[]: 0 0 0
Mar  4 07:43:00 dhcp83-94 kernel: Normal free:0kB min:476kB low:952kB
high:1428kB active:126016kB inactive:83820kB present:245744kB
Mar  4 07:43:00 dhcp83-94 kernel: protections[]: 0 0 0
Mar  4 07:43:00 dhcp83-94 kernel: HighMem free:0kB min:128kB low:256kB
high:384kB active:0kB inactive:0kB present:0kB
Mar  4 07:43:00 dhcp83-94 kernel: protections[]: 0 0 0
Mar  4 07:43:00 dhcp83-94 kernel: DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Mar  4 07:43:00 dhcp83-94 kernel: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Mar  4 07:43:00 dhcp83-94 kernel: HighMem: empty
Mar  4 07:43:00 dhcp83-94 kernel: Swap cache: add 640173, delete 638147, find
122421/152956, race 3+2
Mar  4 07:43:00 dhcp83-94 kernel: Free swap:       470928kB
Mar  4 07:43:00 dhcp83-94 kernel: 65532 pages of RAM
Mar  4 07:43:00 dhcp83-94 kernel: 0 pages of HIGHMEM
Mar  4 07:43:00 dhcp83-94 kernel: 1820 reserved pages
Mar  4 07:43:00 dhcp83-94 kernel: 53597 pages shared
Mar  4 07:43:00 dhcp83-94 kernel: 2026 pages swap cached
Mar  4 07:43:00 dhcp83-94 kernel: ENOMEM in journal_alloc_journal_head, retrying.
Comment 11 Larry Woodman 2005-04-06 14:12:50 EDT
Jeff, can you grab the latest kernel and try to reproduce this problem with it?

Thanks, Larry
Comment 12 Jeff Burke 2005-04-07 09:42:13 EDT
Larry,
  Do you mean the latest DaveJ built kernel 2.6.9-6.37 or the latest Larry
kernel 2.6.9-6.39?

FYI I will start with the DaveJ kernel if that is incorrect let me know.

Thanks Jeff
Comment 13 Jeff Burke 2005-04-11 13:44:57 EDT
Larry,
     With my current testing of 2.6.9-37. I do _not_ see those messages being
printed any longer.

Comment 14 Jeff Burke 2005-04-30 11:11:21 EDT
Larry,
      I am seeing many many more of these with the current kernel. 2.6.9-6.43
For example: This is from the /var/log/messages file on pe2850he.lab

Apr 28 12:26:36 pe2850he kernel: printk: 907405 messages suppressed.
Apr 28 12:26:36 pe2850he kernel: kswapd0: page allocation failure. order:0, mod\
e:0x50
Apr 28 12:26:37 pe2850he kernel: Call Trace: \
                         <ffffffff80157ebe>{__alloc_pages+768} \
                         <ffffffff8016cddc>{alloc_page_interleave+61} \
                         <ffffffff80154450>{find_or_create_page+53} \
                         <ffffffff80176632>{__getblk_slow+237} \
                         <ffffffff80176789>{__getblk+60} \
                         <ffffffff8017679e>{__bread+6} \
                         <ffffffffa00790b8>{:ext3:read_block_bitmap+50} \
                         <ffffffffa007a1ac>{:ext3:ext3_new_block+629} \
                         <ffffffffa007c396>{:ext3:ext3_alloc_block+7} \
                         <ffffffffa007df69>{:ext3:ext3_get_block_handle+863} \
                         <ffffffff801757e7>{__block_write_full_page+198} \
                         <ffffffffa007e3be>{:ext3:ext3_get_block+0} \
                         <ffffffffa007cb26>{:ext3:ext3_ordered_writepage+245} \
                         <ffffffff8015ee5b>{shrink_zone+3095} \
                         <ffffffff80133926>{autoremove_wake_function+0} \
                         <ffffffff8015f773>{balance_pgdat+506} \
                         <ffffffff8015f9bd>{kswapd+252} \
                         <ffffffff80133926>{autoremove_wake_function+0} \
                         <ffffffff80130e75>{finish_task_switch+55} \
                         <ffffffff80133926>{autoremove_wake_function+0} \
                         <ffffffff80130ec4>{schedule_tail+11} \
                         <ffffffff80110c8f>{child_rip+8} \
                         <ffffffff8015f8c1>{kswapd+0} \
                         <ffffffff80110c87>{child_rip+0}
Apr 28 12:27:06 pe2850he kernel: Mem-info:
Apr 28 12:27:12 pe2850he kernel: Node 0 DMA per-cpu:
Apr 28 12:27:18 pe2850he kernel: cpu 0 hot: low 2, high 6, batch 1
Apr 28 12:27:42 pe2850he kernel: cpu 0 cold: low 0, high 2, batch 1
Apr 28 12:30:23 pe2850he kernel: cpu 1 hot: low 2, high 6, batch 1
Apr 28 12:30:42 pe2850he kernel: cpu 1 cold: low 0, high 2, batch 1
Apr 28 12:33:07 pe2850he kernel: cpu 2 hot: low 2, high 6, batch 1
Apr 28 12:33:08 pe2850he kernel: cpu 2 cold: low 0, high 2, batch 1
Apr 28 12:33:08 pe2850he kernel: cpu 3 hot: low 2, high 6, batch 1
Apr 28 12:33:09 pe2850he kernel: cpu 3 cold: low 0, high 2, batch 1
Apr 28 12:33:09 pe2850he kernel: Node 0 Normal per-cpu:
Apr 28 12:33:09 pe2850he kernel: cpu 0 hot: low 32, high 96, batch 16
Apr 28 12:33:09 pe2850he kernel: cpu 0 cold: low 0, high 32, batch 16
Apr 28 12:33:09 pe2850he kernel: cpu 1 hot: low 32, high 96, batch 16
Apr 28 12:33:09 pe2850he kernel: cpu 1 cold: low 0, high 32, batch 16
Apr 28 12:33:09 pe2850he kernel: cpu 2 hot: low 32, high 96, batch 16
Apr 28 12:33:15 pe2850he kernel: cpu 2 cold: low 0, high 32, batch 16
Apr 28 12:33:15 pe2850he kernel: cpu 3 hot: low 32, high 96, batch 16
Apr 28 12:33:15 pe2850he kernel: cpu 3 cold: low 0, high 32, batch 16
Apr 28 12:33:15 pe2850he kernel: Node 0 HighMem per-cpu: empty
Apr 28 12:33:15 pe2850he kernel:
Apr 28 12:33:15 pe2850he kernel: Free pages:       11620kB (0kB HighMem)
Apr 28 12:33:15 pe2850he kernel: Active:791995 inactive:196868 dirty:5715 write\
back:19509 unstable:0 free:2905 slab:11521 mapped:960429 pagetables:2765
Apr 28 12:33:15 pe2850he kernel: Node 0 DMA free:11620kB min:4kB low:8kB high:1\
2kB active:0kB inactive:0kB present:16384kB pages_scanned:6034 all_unreclaimabl\
e? yes
Apr 28 12:33:15 pe2850he kernel: protections[]: 0 0 0
Apr 28 12:33:15 pe2850he kernel: Node 0 Normal free:0kB min:2276kB low:4552kB h\
igh:6828kB active:3167980kB inactive:787472kB present:5226496kB pages_scanned:9\
9 all_unreclaimable? no
Apr 28 12:33:15 pe2850he kernel: protections[]: 0 0 0
Apr 28 12:33:15 pe2850he kernel: Node 0 HighMem free:0kB min:128kB low:256kB hi\
gh:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?\
 no
Apr 28 12:33:15 pe2850he kernel: protections[]: 0 0 0
Apr 28 12:33:15 pe2850he kernel: Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0\
*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 11620kB
Apr 28 12:33:15 pe2850he kernel: Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64k\
B 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Apr 28 12:33:15 pe2850he kernel: Node 0 HighMem: empty
Apr 28 12:33:15 pe2850he kernel: Swap cache: add 8378150, delete 8373743, find \
2968285/3565680, race 0+15
Apr 28 12:33:15 pe2850he kernel: Free swap:       3070772kB
Apr 28 12:33:15 pe2850he kernel: 1310720 pages of RAM
Apr 28 12:33:15 pe2850he kernel: 301180 reserved pages
Apr 28 12:33:15 pe2850he kernel: 88599 pages shared
Apr 28 12:33:15 pe2850he kernel: 4555 pages swap cached

Comment 15 Larry Woodman 2005-05-01 23:18:04 EDT
Exactly wen did this start happening?  Based on comment #13 the .37 kernel ran
OK but based on comment 14 started printing out tons of failures on the .43
kernel started lote more allocation failures.

These are the changes:

* Sat Apr 30 2005 Dave Jones <davej@redhat.com> [2.6.9-6.43]
- CAN-2005-0136 ptrace corner cases on ia64. (#155283)
- Redo fix to avoid sleep in timer context in E1000. (#154880,#154944,#154951)
- Fix sctp sendbuffer accounting. (#146797)
                                                                               
                              
* Wed Apr 27 2005 Dave Jones <davej@redhat.com> [2.6.9-6.42]
- Fix oopsable locking in NFS. (#152557)
- Fix incorrect use of memset/memcpy in ia64 ia32 signal handling.
- Cope with faults in iret/exec-shield bug (#154221, #154972)
- Potential NULL mm in swap token code. (#154639)
                                                                               
                              
* Wed Apr 27 2005 Dave Jones <davej@redhat.com> [2.6.9-6.41]
- Fix up tty locking. (#152600, #155765)
                                                                               
                              
* Fri Apr 22 2005 Dave Jones <davej@redhat.com> [2.6.9-6.40]
- Update the PCI EEH error recovery patch.
                                                                               
                              
* Thu Apr 14 2005 Dave Jones <davej@redhat.com> [2.6.9-6.39]
- ata_piix: broken BIOS AHCI BAR setup work-around. (#154712)
                                                                               
                              
* Wed Apr 13 2005 Dave Jones <davej@redhat.com> [2.6.9-6.38]
- Fix possible corruption in mmap over NFS. (#151284)
- Don't cache /proc/pid dentry for dead processes. (#147832)
- Add Itanium ZX2 chipset identification (#150110)
- Don't enable NUMA by default on dual core systems.
                                                                               
                              
* Wed Mar 30 2005 Dave Jones <davej@redhat.com>
- Really apply patch for PCI EEH error recovery on ppc64.
Comment 16 Jeff Burke 2005-05-02 08:38:31 EDT
Larry,
Comment #4 Has this starting on Day 0 kernel. You are Correct with Comment #13 I
did not see the issue. But, perhaps my sample set was not as large as I needed
it to be. 

When I tested .37 for the issue. I tested the machine I had originally seen this
problem with. Comment #14 is on a different system pe2850he.lab the second
system has 4GB of ram where as the first system has 384MB of ram.
Comment 17 Martin Svensson 2005-05-03 05:54:53 EDT
We are seeing the same thing here on a RHELES 4 running 2.6.9-5.0.3.ELsmp.
I have also had the ext3 journal problems described in previous posts.

kswapd0: page allocation failure. order:0, mode:0x50
 [<c013f1ab>] __alloc_pages+0x28b/0x298
 [<c013f1d0>] __get_free_pages+0x18/0x24
 [<c0141b0c>] kmem_getpages+0x1c/0xbb
 [<c014265d>] cache_grow+0xae/0x136
 [<c014284a>] cache_alloc_refill+0x165/0x19d
 [<c0142a45>] kmem_cache_alloc+0x51/0x57
 [<c01596d8>] alloc_buffer_head+0xd/0x34
 [<c01573a2>] create_buffers+0x21/0x8b
 [<c0157b41>] create_empty_buffers+0x11/0x70
 [<f889d202>] ext3_ordered_writepage+0x95/0x13a [ext3]
 [<c01446d5>] pageout+0x8d/0xcc
 [<c014491c>] shrink_list+0x208/0x3ee
 [<c0144cdf>] shrink_cache+0x1dd/0x34d
 [<c014539d>] shrink_zone+0xa7/0xb6
 [<c0145740>] balance_pgdat+0x1b6/0x2f8
 [<c014594c>] kswapd+0xca/0xcc
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<c02c5fca>] ret_from_fork+0x6/0x14
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<c0145882>] kswapd+0x0/0xcc
 [<c01041f1>] kernel_thread_helper+0x5/0xb
Comment 18 Jeff Burke 2005-05-03 13:49:54 EDT
After speaking wiht Larry. This issue is going to be worked as part of RHEL4 U2
release.

The messages that are being printed to the /var/log/messages file may look
detrimental but are in fact informational. In RHEL3 the same code exist but the 
printk is commented out. In RHEL4 the printk was uncommented and additional
debug code was added to capture the memory settings at the time the alloc_page
failure occured. Subsequent page allocations maybe successful if the application
does a retry of alloc_page.

As far as the original post of this BZ. I had commented "The first observation
is a message scrolling on all of the virtual terminals "EXT3-fs error (device
dm-0) in start_transaction: Journal has aborted".
That error was due to a harddisk failing. Once the drive was replaced I had
never seen that message again.
Comment 19 Jeremy Sanders 2005-08-24 11:03:18 EDT
We have a computer running kernel-2.6.9-11.EL (i686, Athlon) which reproducably
produces this error when running an application:

Aug 19 14:37:26 xpc12 kernel: kswapd0: page allocation failure. order:0, mode:0x850
Aug 19 14:37:26 xpc12 kernel:  [<c0146e13>] __alloc_pages+0x28b/0x29d
Aug 19 14:37:26 xpc12 kernel:  [<c0146e3d>] __get_free_pages+0x18/0x24
Aug 19 14:37:26 xpc12 kernel:  [<c014a2ca>] kmem_getpages+0x15/0x94
Aug 19 14:37:26 xpc12 kernel:  [<c014af8a>] cache_grow+0x10a/0x236
Aug 19 14:37:26 xpc12 kernel:  [<c014b2ad>] cache_alloc_refill+0x1f7/0x227
Aug 19 14:37:26 xpc12 kernel:  [<c014b7e2>] __kmalloc+0x6b/0x7d
Aug 19 14:37:26 xpc12 kernel:  [<e084a522>] __jbd_kmalloc+0x16/0x17 [jbd]
Aug 19 14:37:26 xpc12 kernel:  [<e0841833>] journal_get_undo_access+0x58/0x122 [jbd]
Aug 19 14:37:26 xpc12 kernel:  [<e0876d09>]
ext3_try_to_allocate_with_rsv+0x40/0x358 [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<c0301d54>] __cond_resched+0x14/0x3b
Aug 19 14:37:26 xpc12 kernel:  [<e0877309>] ext3_new_block+0x260/0x581 [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<c01664e6>] bh_lru_install+0x8b/0x93
Aug 19 14:37:26 xpc12 kernel:  [<e08794d8>] ext3_alloc_block+0x9/0xb [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<e08797ea>] ext3_alloc_branch+0x4a/0x25a [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<e0879d21>] ext3_get_block_handle+0x1b7/0x276 [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<e0879e44>] ext3_get_block+0x64/0x6c [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<c0166a01>] __block_write_full_page+0xd0/0x2a6
Aug 19 14:37:26 xpc12 kernel:  [<e0879de0>] ext3_get_block+0x0/0x6c [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<c0167dcf>] block_write_full_page+0xa4/0xad
Aug 19 14:37:26 xpc12 kernel:  [<e0879de0>] ext3_get_block+0x0/0x6c [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<e087a74b>] ext3_ordered_writepage+0xce/0x13a [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<e087a65f>] bget_one+0x0/0x6 [ext3]
Aug 19 14:37:26 xpc12 kernel:  [<c014d89c>] pageout+0x88/0xc5
Aug 19 14:37:26 xpc12 kernel:  [<c014dae3>] shrink_list+0x20a/0x4eb
Aug 19 14:37:26 xpc12 kernel:  [<c014dfc3>] shrink_cache+0x1ff/0x454
Aug 19 14:37:26 xpc12 kernel:  [<c014e97d>] shrink_zone+0x8f/0x9e
Aug 19 14:37:26 xpc12 kernel:  [<c014ed13>] balance_pgdat+0x188/0x2b5
Aug 19 14:37:26 xpc12 kernel:  [<c011bdb9>] recalc_task_prio+0x128/0x133
Aug 19 14:37:26 xpc12 kernel:  [<c014eef9>] kswapd+0xb9/0xbb
Aug 19 14:37:26 xpc12 kernel:  [<c011deb4>] autoremove_wake_function+0x0/0x2d
Aug 19 14:37:26 xpc12 kernel:  [<c03034ce>] ret_from_fork+0x6/0x14
Aug 19 14:37:26 xpc12 kernel:  [<c011deb4>] autoremove_wake_function+0x0/0x2d
Aug 19 14:37:26 xpc12 kernel:  [<c014ee40>] kswapd+0x0/0xbb
Aug 19 14:37:26 xpc12 kernel:  [<c01041d9>] kernel_thread_helper+0x5/0xb
Aug 19 14:37:26 xpc12 kernel: Mem-info:
Aug 19 14:37:26 xpc12 kernel: DMA per-cpu:
Aug 19 14:37:26 xpc12 kernel: cpu 0 hot: low 2, high 6, batch 1
Aug 19 14:37:26 xpc12 kernel: cpu 0 cold: low 0, high 2, batch 1
Aug 19 14:37:26 xpc12 kernel: Normal per-cpu:
Aug 19 14:37:26 xpc12 kernel: cpu 0 hot: low 32, high 96, batch 16
Aug 19 14:37:26 xpc12 kernel: cpu 0 cold: low 0, high 32, batch 16
Aug 19 14:37:26 xpc12 kernel: HighMem per-cpu: empty
Aug 19 14:37:26 xpc12 kernel: 
Aug 19 14:37:26 xpc12 kernel: Free pages:           0kB (0kB HighMem)
Aug 19 14:37:26 xpc12 kernel: Active:91730 inactive:31462 dirty:42727
writeback:0 unstable:0 free:0 slab:3413 mapped:36331 pagetables:747
Aug 19 14:37:26 xpc12 kernel: DMA free:0kB min:20kB low:40kB high:60kB
active:4416kB inactive:8232kB present:16384kB pages_scanned:0 all_unreclaimable? no
Aug 19 14:37:26 xpc12 kernel: protections[]: 0 0 0
Aug 19 14:37:26 xpc12 kernel: Normal free:0kB min:700kB low:1400kB high:2100kB
active:362504kB inactive:117616kB present:507824kB pages_scanned:0
all_unreclaimable? no
Aug 19 14:37:26 xpc12 kernel: protections[]: 0 0 0
Aug 19 14:37:26 xpc12 kernel: HighMem free:0kB min:128kB low:256kB high:384kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Aug 19 14:37:26 xpc12 kernel: protections[]: 0 0 0
Aug 19 14:37:26 xpc12 kernel: DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Aug 19 14:37:26 xpc12 kernel: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Aug 19 14:37:26 xpc12 kernel: HighMem: empty
Aug 19 14:37:26 xpc12 kernel: Swap cache: add 44, delete 44, find 0/0, race 0+0
Aug 19 14:37:26 xpc12 kernel: Free swap:       4016372kB
Aug 19 14:37:26 xpc12 kernel: 131052 pages of RAM
Aug 19 14:37:26 xpc12 kernel: 0 pages of HIGHMEM
Aug 19 14:37:26 xpc12 kernel: 2258 reserved pages
Aug 19 14:37:26 xpc12 kernel: 100667 pages shared
Aug 19 14:37:26 xpc12 kernel: 0 pages swap cached
Aug 19 14:37:26 xpc12 kernel: journal_get_undo_access: No memory for committed data
Aug 19 14:37:26 xpc12 kernel: ext3_try_to_allocate_with_rsv: aborting
transaction: Out of memory in __ext3_journal_get_undo_access
Aug 19 14:37:26 xpc12 kernel: EXT3-fs error (device hda3) in ext3_new_block: Out
of memory
Aug 19 14:37:26 xpc12 kernel: Aborting journal on device hda3.
Aug 19 14:37:26 xpc12 kernel: ext3_abort called.
Aug 19 14:37:26 xpc12 kernel: EXT3-fs error (device hda3):
ext3_journal_start_sb: Detected aborted journal
Aug 19 14:37:26 xpc12 kernel: Remounting filesystem read-only
Aug 19 14:37:26 xpc12 kernel: EXT3-fs error (device hda3) in
ext3_ordered_writepage: Out of memory
Aug 19 14:37:26 xpc12 kernel: EXT3-fs error (device hda3) in start_transaction:
Journal has aborted
Aug 19 14:37:28 xpc12 kernel: EXT3-fs error (device hda3) in start_transaction:
Journal has aborted

The ext3 journal aborts and the filesystem switches into read only mode.

We tried copying the files to another disk on the same computer and the program
worked. However the program fails if the original disk is plugged into a
different computer with the same kernel.

We rebuilt and used the FC4 kernel kernel-2.6.12-1.1398 instead of the EL
kernel, and this solves the problem! There must be a subtle bug hidden in
kernel-2.6.9-11.EL. This does not look to be a hardware issue.
Comment 20 Trond H. Amundsen 2005-11-22 05:50:22 EST
Just a me too. We've seen this problem during stresstest on two identical
servers running the kernel 2.6.9-22.0.1.EL. I'll attach the output to console.
Comment 21 Trond H. Amundsen 2005-11-22 05:52:33 EST
Created attachment 121338 [details]
Ouput from console when running stresstest on kernel 2.6.9-22.0.1.ELsmp
Comment 22 Joseph Chiu 2006-01-03 10:25:26 EST
Hi,
We have 5 machines RHEL4 update 2  (2.6.9-22.Elsmp) that have this problem.  (we
have 20 servers) It never happends prior to this version until recently upgrade
(about 1 month ago).   We upgrade update 0 to 2 ie: rpm -Fvh *.rpm" Could you
please know if this will be fix on RHEL4 update 3  release or you may have a RPM
for it?  

Here is my error
"
EXT3-fs error (device dm-5) in start_transaction: Journal has aborted
EXT3-fs error (device dm-5) in start_transaction: Journal has aborted
.
.
.
"

Redhat is telling me "(kswapd message)These messages are not harmful they are
informational." but what is this message mean and how can I fix it? 
These are actually problems for us.  We can not have read only  root.  Could
SOMEONE at least please tell me how to resolve this  (turning from read-only to
read-write) to to the root.  I need a  work around or at least I can still write
to root device.  I e2fsck /dev/root after reboot to CD boot, Will fix for now?
(but I don't want to reboot server).

Will this bug be fixed on Update 4?

Thanks.
Comment 23 Ryan Phillips 2006-03-19 17:35:29 EST
Has this problem been fixed in RHEL4 Update3?
Comment 24 Jeff Burke 2006-03-20 08:23:48 EST
Ryan, Trond and Jeremy;
   Which problem are you all referring too? Are you all referring too the
"kswapd0: page allocation failure. order:0, mode:0x50"

   When I initially opened the BZ there were two issues. The "kswap" and the
"EXT3-fs error (device dm-5) in start_transaction: Journal has aborted". The
second of the two is just and effect of the first(kswap issue). That may or may
not be an actual issue. We are still discussing it internally.

   As far as the kswap fix Larry does have a test patch. That makes things allot
better. We did not include this into RHEL4 U3. It was felt that we need a full
testing cycle to ensure that we do not introduce any issues with the change.
 
   With that said if any or all of you have system you could run this test
kernel patch on please let me know the arch. I will build a kernel for you.
The more data/test points we have the better the chances we have of this being
included in RHEL4 U4. Please do not test this kernel on production machines.
This kernel will not be supported by our support staff here at Red Hat. This
will be for testing purposes only.
  
Thanks,
Jeff
Comment 25 Doug Chapman 2006-03-20 10:44:02 EST
FYI, I just hit the "page allocation failure. order:0, mode:0x50" issue in the
latest development kernel: 2.6.9-34.5.EL on 2 of my ia64 systems.  I _should_ be
able to reproduce these so please let me know when Larry's fix is submitted and
I will re-run to help verify.

Comment 26 Doug Chapman 2006-05-01 10:55:42 EDT
I hit this again in 2.6.9-34.26.EL.  Was this fix ever submitted into the RHEL4
pool?
Comment 27 Ahnjoan Amous 2006-05-11 14:33:42 EDT
I would like to give the patch a shot.  I hope this will address problems I have
listed under bug #183366.  If you all think it might, I'd be happy to attempt
debugging or whatever else could help.  I have four identical hosts and each of
them has crashed with similar errors.  One I use more than the others and can
count on it to crash at least once a week.  This is real world behavior and not
caused by a test case scenario.  I am running the x86_64 kernel on Intel.

Linux wks01 2.6.9-22.0.2.ELsmp #1 SMP Thu Jan 5 17:11:56 EST 2006
x86_64 x86_64 x86_64 GNU/Linux

Side note, above is the output from `uname -a`.  Hope this will help to know
what type of test kernel I might need.

Thanks
Ahnjoan
Comment 28 Larry Woodman 2006-05-11 14:50:15 EDT
Ahnjoan, are these messages causing your system to crash?  We do see them here
on occasiopn under very heavy load but they cause any real problems.  They
simply print out thsi message and the system continues on.  They only occur on
small systems under very heavy loads.

Larry Woodman
Comment 29 Jeff Burke 2006-05-11 14:55:21 EDT
As per commnet #24

A couple of questions.
    1. Can you test with the RHEL4-U4 beta candidate? It would be easier
for me to provide a kernel to you.
    2. What are the issues you are seeing?
    3. Are these test case scenarios or it is real world behavior that
causes these issues?
    4. When you are running the kernel x86_64 is it on AMD or Intel
(em64T/ia32e)?

    Just to give you a little background for the context of my question.
We did not not include this patch in RHEL4-U4 even though I have done and
several other have done extensive testing with the patch. The reason is that no
customer has yet to open a BZ based on a real world situation where this causes
an actual problem.
Comment 30 Jeff Burke 2006-05-11 15:00:35 EDT
Also, The patch does not solve the issue it just makes the printing of the
informational messages a less frequent. Based on the kernel behavior
Comment 31 Ahnjoan Amous 2006-05-11 15:03:41 EDT
Larry - I'm not sure if you would really call it a crash.  All of my file
systems turn read only.  X doesn't crash, I can still use the bash shell within
an xterm.  Syslog continues to send messages and I assume lots of other stuff
continues to work.  After I configured syslog to point at another machine I was
able to get the messages that are listed in bug #183366.  I end up having to
crash my machine to get it working again.  I'm sure someone smarter might have a
better way to do it but when I type in reboot, or init 0, the machine ends up
hanging while trying to shut down.  I assume this is because EXT3 remounts
everything read-only.

It takes about 2 hours to fsck every thing back to happiness and I'm back in
business.

Hope that helps.

Thanks
Ahnjoan
Comment 32 Ahnjoan Amous 2006-05-11 15:30:03 EDT
Jeff - I don't know enough to answer question one of comment #29.  I purchased a
couple of Dell hosts and they came with RedHat CDs.  From time to time I run
up2date and include kernel updates.  I don't know what # update I'm running in
terms of RHEL U?

Thanks
Ahnjoan
Comment 33 Jeff Burke 2006-05-11 15:33:29 EDT
If you cat /etc/redhat-release that should tell you the product version/update
information
Comment 34 Ahnjoan Amous 2006-05-11 15:36:10 EDT
Mine says update 2, if you sent an update 4 kernel will work with update 2?  If
so I would be more than happy to test.
Comment 35 Jeff Burke 2006-05-11 15:42:17 EDT
We maintain ABI through out release. So it should work fine. When the kernel is
ready I or Larry will post a link.

Also I am not sure I actually saw the answer to this question: What are you
running when you doing to cause the issue mentioned in comment #31 ? Are you
running some test or a real user type application?

Thanks,
Jeff
Comment 36 Ahnjoan Amous 2006-05-11 15:51:18 EDT
Just working really.  I almost always have the following things up.  4 xterms in
and one mozilla browser in each of two different virtual windows, one vnc
session in another virtual window, and exported from another machine is HPs NNM
in a fourth virtual window.  I don't know that I do one certain thing to make
this happen but it does happen on a fairly regular basis.  I have never run any
stress tests.

Thanks
Ahnjoan
Comment 37 Larry Woodman 2006-05-11 16:28:09 EDT
Ahnjoan, have you tried the RHEL4-U4 test kernel?  It includes this patch that
prevents IO queue stalls:

-- linux-2.6.9/drivers/block/cfq-iosched.c.larry
+++ linux-2.6.9/drivers/block/cfq-iosched.c
@@ -381,7 +381,7 @@ static int cfq_dispatch_requests(request
 restart:
        good_queues = 0;
        list_for_each_safe(entry, tmp, &cfqd->rr_list) {
-               cfqq = list_entry_cfqq(cfqd->rr_list.next);
+               cfqq = list_entry_cfqq(entry);
                                                                               
                                 
                BUG_ON(RB_EMPTY(&cfqq->sort_list));
                                                                               
                                 
Comment 38 Ahnjoan Amous 2006-05-11 18:09:23 EDT
Larry - I haven't tried that, I'm happy to give it a go round but it will take
me a little while to figure out how to do this. (get source, patch, compile,
install)  I'll report back when completed.

Thanks
Ahnjoan
Comment 39 Jeff Burke 2006-05-12 08:32:26 EDT
Larry,
    Which test kernel are you referring to? Can you please denote with a build
number. 

Jeff
Comment 40 Larry Woodman 2006-05-12 11:36:29 EDT
Ahnjoan, can you try out the appropriate kernel at this location:

http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/kernel-2.6.9-36.EL.i686.rpm


Larry Woodman
Comment 41 Ahnjoan Amous 2006-05-12 16:01:31 EDT
Larry - I tried the following.

Downloaded and installed rpm
Then rebooted and select the 2.6.9-36 kernel.
The hosts hangs at "Booting the kernel"

Thanks
Comment 42 Ahnjoan Amous 2006-05-12 16:10:49 EDT
I'm guessing I should have looked in the entire directory and chosen the right
rpm.  I have downloaded kernel-smp for x86_64 and will give it a shot.  I'm
sorry for all the traffic.

Thanks
Comment 43 Larry Woodman 2006-05-12 16:15:53 EDT
Actually, can you grab the correct kernel from here:

>>>http://people.redhat.com/~jbaron/rhel4/RPMS.kernel

The one I referred you to was the UP kernel.

Larry Woodman


Comment 44 Ahnjoan Amous 2006-05-12 16:31:19 EDT
Sure, will do, wonder if you might point me in the right direction for the
source?  When my machine boots it won't load some modules complaining about
missing source.

Thanks
Comment 45 Ahnjoan Amous 2006-05-15 14:17:48 EDT
I have tried the following without success.

1. Download kernel-smp-2.6.9-36.EL.x86_64.rpm
  Boot error complains about missing source when loading nvidia module.
2. Download kernel-devel-2.6.9-36.EL.x86_64.rpm
  Same error
3. Download kernel-2.6.9-36.EL.src.rpm
  Same error
4. Downloaded latest OEM nvidia driver, boot "36" kernel, attempt install of driver
  Driver install complains about missing source.
5. Attempt to find where kernel-2.6.9-36.EL.src.rpm installs so that I can pass
the nvidia driver a parameter named --kernel-source-path.
  Couldn't figure this out, probably because I don't know what I'm doing.

Anyone have any ideas on what I might be doing incorrectly?

Thanks
Comment 48 Ahnjoan Amous 2006-05-23 12:14:58 EDT
The .37 kernel is installed and has been running for just over a day.  I did
have one question.  Are the changes in .37, that pertain to this issue, going to
fix the page allocation issue or do they only stop the file system from locking
itself?

FYI – The two previous posts I submitted look to have been addressed with the
latest (37) kernel.  I no longer get errors about missing source when the kernel
modules for sound or video attempt to load.  These errors only happened with the
.36 kernel.
Comment 49 Doug Chapman 2006-06-05 13:58:06 EDT
FYI, I still see these messages on a regular basis on my ia64 systems running
the 2.6.9-37.EL kernel.  This is seen with an HP virtual memory test suite. 
Please let me know if you would like any details from my system.  Also, I can
easily reproduce this if you want me to run any experimental kernels.

- Doug
Comment 54 Benedikt Schaefer 2006-08-17 02:26:57 EDT
Dear all,

I have several machines which shows the described page allocation error in
production enviroment.
For details please see bugreport #202205.

Do you have a solution for this problem?
We need urgent a solution.

Thanks.
Benedikt Schaefer (NEC HPCE GmbH)
Comment 55 Larry Woodman 2006-08-17 12:07:20 EDT
Benedikt, I can not seem to reporduce this behavior on any of the RHEL4 systems
we have here.  Any reproducer you can give me would be a huge help here.  In the
mean time can you experiment aroung with increasing
/proc/sys/vm/lower_zone_protection to 100 and increasing the value currently in
/proc/sys/vm/min_free_kbytes, try doubling it until this problem subsides.  Both
of these should increase the size of the free lists so that kswapd kicks in
earlier and has more time to recover from the memory exhaustion.

Thanks, Larry Woodman
Comment 56 Benedikt Schaefer 2006-08-28 09:11:20 EDT
Our only way to reproduce this for the moment is to run limpack, but this is not
successfull every time.
We asked customer to try the kernel parameter you  mentioned.
Comment 57 Steve Bergman 2006-11-04 07:47:33 EST
I can reproduce this on a 2.6.9-42.0.3.ELsmp CentOS 4.4 system by simply writing
a very large file (multi-gigabyte).

The system is running smp with 2 3.2GHz Xeons and 4GB memory.

I have also seen OOM-Killer activity with plenty of memory available.

Increasing lower_zone_protection to 150 seems to help.

Running:

dd if=/dev/zero of=testfile bs=1M count=8192

still triggered the OOM-Killer with lowerzone_protection = 100



Comment 58 Steve Bergman 2006-11-04 07:51:37 EST
Created attachment 140354 [details]
Log output from page allocation failure
Comment 59 Steve Bergman 2006-11-04 07:55:31 EST
Created attachment 140355 [details]
Log output about OOM-Killer activity
Comment 60 Steve Bergman 2006-11-04 08:00:55 EST
(In reply to comment #57)

BTW, there is no NFS involved.  (I've heard read in other bug reports that one
might have to increase lower_zone protection when copying large files via nfs.)
This is simply a copy of a large file on a local filesystem to another local
location.

Perhaps a little lower_zone_protection should be enabled by default, to be safe?
Comment 63 Larry Woodman 2006-11-21 14:21:34 EST
Also, you should probably lower /proc/sus/vm/dirty_ratio and
/proc/sys/vn/dirty_background_ratio.  Please set them to 1/2 of whatever they
are by default and see if this prevents the problem from occuring.  Basically
these controls the % of memory that can be dirty before the offending process
and pdflush are forced to start writing them back to disk.  Since this is an x86
system lowmem gets unfairly picked on due to highmem fallover, bounce buffers
and general kernel data structure/slabcache allocation.

Larry Woodman
Comment 64 Steve Bergman 2006-11-21 14:38:05 EST
lower_zone_protection = 150 completely eliminates the problem even in the case of:

dd if=/dev/zero of=tempfile bs=1m

(In this test, the free lowmem drops to 13MB before the value stabilizes.

However, I end up with ~150MB of memory that can't be used for cache.

Should I be able to lower that if I also lower these parameters, and would that
be preferable?

FWIW, 150MB is about what gets reserved by default on a 4GB system by 2.6.16,
which does not use lower_zone_protection, but lowmem_reserve_ratio, which
autotunes reserved memory at boot time based upon ram size.
Comment 65 Larry Woodman 2006-12-08 08:05:10 EST

*** This bug has been marked as a duplicate of 193542 ***
Comment 66 Larry Woodman 2006-12-11 14:09:14 EST
Steve, FIY, /proc/sys/vm/min_free_kbytes has the same effect.  If you increase
it to 4 times the default value it will also match the upstream defaults and
prevent the page allocation failure.

Larry Woodman
Comment 68 Duncan Hill 2007-02-14 07:22:29 EST
I'm seeing similar issues on a dual quad-core Dell 1950 with 8 GB of RAM, 
using software RAID 1 with 4 disks (2 pairs).  When creating a 10 GB disk 
image in VMware server, or attempting to build a system using that disk image, 
I stand a very good chance of having to send an ipmi command to reboot the 
server, because it just doesn't come back to life (and it's 50 miles away).

In fact, as I type this, it looks like the server has done it again, and all 
it was doing was a staggered boot of several VM images.  I'll be trying 
the /proc/sys/vm/lower_zone_protection method when it comes back from the 
power reset.

Kernel is 2.6.9-42.0.3.ELsmp, yum informs me 8.ELsmp is available (CentOS 4.4 
server).

Note You need to log in before you can comment on or make changes to this bug.