Bug 159271 - kernel oops when system is under high load
Summary: kernel oops when system is under high load
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Larry Woodman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-06-01 09:50 UTC by Martijn Brizee
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:01:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Martijn Brizee 2005-06-01 09:50:30 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
When the system is under high load for some time the kernel oopes and the system hangs.  It only occurs when the load is high and there is IO on the scsi interface.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.20-27.0.4

How reproducible:
Sometimes

Steps to Reproduce:
1.boot system
2.start simulation
3.start restore from tape
  

Actual Results:  system crahes

Expected Results:  system should run fine

Additional info:

The system was running fine with RH9. Problems started after upgrading to RHEL 3. The system is a single cpu AMD athlon with an SMP kernel running. It was decided this way by the RHEL 3 installer.

Comment 1 Martijn Brizee 2005-06-01 09:51:46 UTC
oops log from netdump:

Oops: 0000
ppp_async ppp_generic slhc nfsd lockd sunrpc netconsole autofs4 via-rhine mii
crc32 e1000 e100 ide-scsi ide-cd cdrom st
ext3 jbd raid1 DAC960 sym53c8xx sd_mod
CPU:    0
EIP:    0060:[<c0142ec5>]    Not tainted
EFLAGS: 00010292

EIP is at __remove_inode_page [kernel] 0x15 (2.4.21-27.0.4.ELsmp/athlon)
eax: 39794000   ebx: c16fe008   ecx: 00000000   edx: 00000000
esi: c3847583   edi: 00000b7a   ebp: c0396410   esp: c39cfcdc
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 6, stackpage=c39cf000)
Stack: 00000000 c16fe008 c0395a40 c014f593 c16fe008 c0395a40 c1038030 c0396418
       00000282 ffffffff 00000000 c198e280 0000000a c0395a40 00000000 c0153cca
       c0153f3c c0396fc0 00000000 00000001 00000000 c0396fc4 00000000 00000030
Call Trace:   [<c014f593>] reclaim_page [kernel] 0x313 (0xc39cfce8)
[<c0153cca>] fixup_freespace [kernel] 0x2a (0xc39cfd18)
[<c0153f3c>] __alloc_pages [kernel] 0x10c (0xc39cfd1c)
[<c0153eb6>] __alloc_pages [kernel] 0x86 (0xc39cfd3c)
[<c01cc4fb>] elevator_linus_merge [kernel] 0x27b (0xc39cfd48)
[<c01ca008>] locate_hd_struct [kernel] 0x38 (0xc39cfd4c)
[<c01ca1a7>] req_new_io [kernel] 0x67 (0xc39cfd64)
[<c01541ac>] __get_free_pages [kernel] 0x1c (0xc39cfd80)
[<c014d052>] kmem_cache_grow [kernel] 0xc2 (0xc39cfd84)
[<c01ca8f1>] __make_request [kernel] 0x4f1 (0xc39cfd8c)
[<c014de8d>] __kmem_cache_alloc [kernel] 0x6d (0xc39cfdac)
[<f89e2362>] raid1_alloc_r1bh [raid1] 0x92 (0xc39cfdcc)
[<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfde8)
[<f89e2b71>] raid1_make_request [raid1] 0x41 (0xc39cfe14)
[<c020b126>] md_make_request [kernel] 0x76 (0xc39cfe64)
[<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfe78)
[<c01cac59>] submit_bh_rsector [kernel] 0x49 (0xc39cfea4)
[<c01637d2>] brw_page [kernel] 0xb2 (0xc39cfec0)
[<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cfedc)
[<c015316f>] rw_swap_page_base [kernel] 0xaf (0xc39cfee4)
[<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34)
[<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c)
[<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50)
[<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60)
[<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84)
[<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac)
[<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0)
[<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0)

Code: 8b 50 28 85 d2 75 5f 8b 43 04 8b 13 89 42 04 89 10 c7 43 04

CPU#0 is executing netdump.
< netdump activated - performing handshake with the client. >

Pid/TGid: 6/6, comm:               kswapd
EIP: 0060:[<c0142ec5>] CPU: 0
EIP is at __remove_inode_page [kernel] 0x15 (2.4.21-27.0.4.ELsmp)
 ESP: e008:00000000 EFLAGS: 00010292    Not tainted
EAX: 39794000 EBX: c16fe008 ECX: 00000000 EDX: 00000000
ESI: c3847583 EDI: 00000b7a EBP: c0396410 DS: 0068 ES: 0068 FS: 0000 GS: 0000
CR0: 8005003b CR2: 39794028 CR3: 00101000 CR4: 000006d0
Call Trace:   [<c014f593>] reclaim_page [kernel] 0x313 (0xc39cfce8)
[<c0153cca>] fixup_freespace [kernel] 0x2a (0xc39cfd18)
[<c0153f3c>] __alloc_pages [kernel] 0x10c (0xc39cfd1c)
[<c0153eb6>] __alloc_pages [kernel] 0x86 (0xc39cfd3c)
[<c01cc4fb>] elevator_linus_merge [kernel] 0x27b (0xc39cfd48)
[<c01ca008>] locate_hd_struct [kernel] 0x38 (0xc39cfd4c)
[<c01ca1a7>] req_new_io [kernel] 0x67 (0xc39cfd64)
[<c01541ac>] __get_free_pages [kernel] 0x1c (0xc39cfd80)
[<c014d052>] kmem_cache_grow [kernel] 0xc2 (0xc39cfd84)
[<c01ca8f1>] __make_request [kernel] 0x4f1 (0xc39cfd8c)
[<c014de8d>] __kmem_cache_alloc [kernel] 0x6d (0xc39cfdac)
[<f89e2362>] raid1_alloc_r1bh [raid1] 0x92 (0xc39cfdcc)
[<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfde8)
[<f89e2b71>] raid1_make_request [raid1] 0x41 (0xc39cfe14)
[<c020b126>] md_make_request [kernel] 0x76 (0xc39cfe64)
[<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfe78)
[<c01cac59>] submit_bh_rsector [kernel] 0x49 (0xc39cfea4)
[<c01637d2>] brw_page [kernel] 0xb2 (0xc39cfec0)
[<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cfedc)
[<c015316f>] rw_swap_page_base [kernel] 0xaf (0xc39cfee4)
[<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34)
[<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c)
[<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50)
[<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60)
[<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84)
[<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac)
[<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0)
[<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0)


                         free                        sibling
  task             PC    stack   pid father child younger older
init          S C0424180     0     1      0     3       2       (NOTLB)
Call Trace:   [<c0153f08>] __alloc_pages [kernel] 0xd8 (0xf7fa1ea4)
[<c0121d76>] schedule [kernel] 0x176 (0xf7fa1eb8)
[<c0132825>] schedule_timeout [kernel] 0x65 (0xf7fa1ee0)
[<c01541ac>] __get_free_pages [kernel] 0x1c (0xf7fa1ee8)
[<c0173201>] __pollwait [kernel] 0x31 (0xf7fa1eec)
[<c01327b0>] process_timeout [kernel] 0x0 (0xf7fa1f00)
[<c017348e>] do_select [kernel] 0x11e (0xf7fa1f18)
[<c0173932>] sys_select [kernel] 0x352 (0xf7fa1f5c)

migration/0   S C0424180  5492     2      0                   1 (L-TLB)
Call Trace:   [<c0121d76>] schedule [kernel] 0x176 (0xc39c7f78)
[<c0123560>] migration_task [kernel] 0x0 (0xc39c7f90)
[<c01237f0>] migration_task [kernel] 0x290 (0xc39c7fa0)
[<c0123560>] migration_task [kernel] 0x0 (0xc39c7fe0)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39c7ff0)

keventd       S C0424180     0     3      1             4       (L-TLB)
Call Trace:   [<c0121d76>] schedule [kernel] 0x176 (0xc37c5f64)
[<c0139947>] context_thread [kernel] 0x117 (0xc37c5f8c)
[<c0139830>] context_thread [kernel] 0x0 (0xc37c5fe0)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c5ff0)

kapmd         S C0424180     0     4      1             5     3 (L-TLB)
Call Trace:   [<c0121d76>] schedule [kernel] 0x176 (0xc37c3f20)
[<c0132825>] schedule_timeout [kernel] 0x65 (0xc37c3f48)
[<c01327b0>] process_timeout [kernel] 0x0 (0xc37c3f68)
[<c011a940>] apm_mainloop [kernel] 0x60 (0xc37c3f80)
[<c011b244>] apm [kernel] 0x1e4 (0xc37c3fcc)
[<c011b060>] apm [kernel] 0x0 (0xc37c3fe4)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c3ff0)

ksoftirqd/0   S C0424180     0     5      1             8     4 (L-TLB)
Call Trace:   [<c0121d76>] schedule [kernel] 0x176 (0xc37c1fa4)
[<c012d9ff>] ksoftirqd [kernel] 0xaf (0xc37c1fcc)
[<c012d950>] ksoftirqd [kernel] 0x0 (0xc37c1fe0)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c1ff0)

bdflush       S C0424180  4468     8      1             6     5 (L-TLB)
Call Trace:   [<c0121d76>] schedule [kernel] 0x176 (0xc39cbf7c)
[<c0122485>] interruptible_sleep_on [kernel] 0x55 (0xc39cbfa4)
[<c012d941>] __run_task_queue [kernel] 0x61 (0xc39cbfbc)
[<c0164227>] bdflush [kernel] 0xe7 (0xc39cbfd4)
[<c0164140>] bdflush [kernel] 0x0 (0xc39cbfe4)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cbff0)

kswapd        R current   2932     6      1             7     8 (L-TLB)
Call Trace:   [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34)
[<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c)
[<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50)
[<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60)
[<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84)
[<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac)
[<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0)
[<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4)
[<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0)

Comment 2 Martijn Brizee 2005-06-01 09:53:19 UTC
contents of /proc/cpuinfo:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1200.073
cache size      : 256 KB
physical id     : 0
siblings        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 2392.06

Comment 3 Martijn Brizee 2005-06-01 09:54:23 UTC
Memory dump is also available but is ~700 MB when zipped.

Comment 4 Larry Woodman 2005-06-17 14:03:09 UTC
This problem appears to be page list corruption.  We have fixed a memory
corruption problem that did cause corruption of the page lists in RHEL3-U6 and
in an RHEL3-U5 security errata.  You should really be running that kernel,
please grab that kernel and let me know if it does indeed fix this problem.

Larry Woodman


Comment 5 RHEL Program Management 2007-10-19 19:01:13 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.