From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: When the system is under high load for some time the kernel oopes and the system hangs. It only occurs when the load is high and there is IO on the scsi interface. Version-Release number of selected component (if applicable): kernel-smp-2.4.20-27.0.4 How reproducible: Sometimes Steps to Reproduce: 1.boot system 2.start simulation 3.start restore from tape Actual Results: system crahes Expected Results: system should run fine Additional info: The system was running fine with RH9. Problems started after upgrading to RHEL 3. The system is a single cpu AMD athlon with an SMP kernel running. It was decided this way by the RHEL 3 installer.
oops log from netdump: Oops: 0000 ppp_async ppp_generic slhc nfsd lockd sunrpc netconsole autofs4 via-rhine mii crc32 e1000 e100 ide-scsi ide-cd cdrom st ext3 jbd raid1 DAC960 sym53c8xx sd_mod CPU: 0 EIP: 0060:[<c0142ec5>] Not tainted EFLAGS: 00010292 EIP is at __remove_inode_page [kernel] 0x15 (2.4.21-27.0.4.ELsmp/athlon) eax: 39794000 ebx: c16fe008 ecx: 00000000 edx: 00000000 esi: c3847583 edi: 00000b7a ebp: c0396410 esp: c39cfcdc ds: 0068 es: 0068 ss: 0068 Process kswapd (pid: 6, stackpage=c39cf000) Stack: 00000000 c16fe008 c0395a40 c014f593 c16fe008 c0395a40 c1038030 c0396418 00000282 ffffffff 00000000 c198e280 0000000a c0395a40 00000000 c0153cca c0153f3c c0396fc0 00000000 00000001 00000000 c0396fc4 00000000 00000030 Call Trace: [<c014f593>] reclaim_page [kernel] 0x313 (0xc39cfce8) [<c0153cca>] fixup_freespace [kernel] 0x2a (0xc39cfd18) [<c0153f3c>] __alloc_pages [kernel] 0x10c (0xc39cfd1c) [<c0153eb6>] __alloc_pages [kernel] 0x86 (0xc39cfd3c) [<c01cc4fb>] elevator_linus_merge [kernel] 0x27b (0xc39cfd48) [<c01ca008>] locate_hd_struct [kernel] 0x38 (0xc39cfd4c) [<c01ca1a7>] req_new_io [kernel] 0x67 (0xc39cfd64) [<c01541ac>] __get_free_pages [kernel] 0x1c (0xc39cfd80) [<c014d052>] kmem_cache_grow [kernel] 0xc2 (0xc39cfd84) [<c01ca8f1>] __make_request [kernel] 0x4f1 (0xc39cfd8c) [<c014de8d>] __kmem_cache_alloc [kernel] 0x6d (0xc39cfdac) [<f89e2362>] raid1_alloc_r1bh [raid1] 0x92 (0xc39cfdcc) [<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfde8) [<f89e2b71>] raid1_make_request [raid1] 0x41 (0xc39cfe14) [<c020b126>] md_make_request [kernel] 0x76 (0xc39cfe64) [<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfe78) [<c01cac59>] submit_bh_rsector [kernel] 0x49 (0xc39cfea4) [<c01637d2>] brw_page [kernel] 0xb2 (0xc39cfec0) [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cfedc) [<c015316f>] rw_swap_page_base [kernel] 0xaf (0xc39cfee4) [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34) [<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c) [<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50) [<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60) [<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84) [<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac) [<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0) [<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0) Code: 8b 50 28 85 d2 75 5f 8b 43 04 8b 13 89 42 04 89 10 c7 43 04 CPU#0 is executing netdump. < netdump activated - performing handshake with the client. > Pid/TGid: 6/6, comm: kswapd EIP: 0060:[<c0142ec5>] CPU: 0 EIP is at __remove_inode_page [kernel] 0x15 (2.4.21-27.0.4.ELsmp) ESP: e008:00000000 EFLAGS: 00010292 Not tainted EAX: 39794000 EBX: c16fe008 ECX: 00000000 EDX: 00000000 ESI: c3847583 EDI: 00000b7a EBP: c0396410 DS: 0068 ES: 0068 FS: 0000 GS: 0000 CR0: 8005003b CR2: 39794028 CR3: 00101000 CR4: 000006d0 Call Trace: [<c014f593>] reclaim_page [kernel] 0x313 (0xc39cfce8) [<c0153cca>] fixup_freespace [kernel] 0x2a (0xc39cfd18) [<c0153f3c>] __alloc_pages [kernel] 0x10c (0xc39cfd1c) [<c0153eb6>] __alloc_pages [kernel] 0x86 (0xc39cfd3c) [<c01cc4fb>] elevator_linus_merge [kernel] 0x27b (0xc39cfd48) [<c01ca008>] locate_hd_struct [kernel] 0x38 (0xc39cfd4c) [<c01ca1a7>] req_new_io [kernel] 0x67 (0xc39cfd64) [<c01541ac>] __get_free_pages [kernel] 0x1c (0xc39cfd80) [<c014d052>] kmem_cache_grow [kernel] 0xc2 (0xc39cfd84) [<c01ca8f1>] __make_request [kernel] 0x4f1 (0xc39cfd8c) [<c014de8d>] __kmem_cache_alloc [kernel] 0x6d (0xc39cfdac) [<f89e2362>] raid1_alloc_r1bh [raid1] 0x92 (0xc39cfdcc) [<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfde8) [<f89e2b71>] raid1_make_request [raid1] 0x41 (0xc39cfe14) [<c020b126>] md_make_request [kernel] 0x76 (0xc39cfe64) [<c01cabb7>] generic_make_request [kernel] 0xe7 (0xc39cfe78) [<c01cac59>] submit_bh_rsector [kernel] 0x49 (0xc39cfea4) [<c01637d2>] brw_page [kernel] 0xb2 (0xc39cfec0) [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cfedc) [<c015316f>] rw_swap_page_base [kernel] 0xaf (0xc39cfee4) [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34) [<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c) [<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50) [<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60) [<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84) [<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac) [<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0) [<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0) free sibling task PC stack pid father child younger older init S C0424180 0 1 0 3 2 (NOTLB) Call Trace: [<c0153f08>] __alloc_pages [kernel] 0xd8 (0xf7fa1ea4) [<c0121d76>] schedule [kernel] 0x176 (0xf7fa1eb8) [<c0132825>] schedule_timeout [kernel] 0x65 (0xf7fa1ee0) [<c01541ac>] __get_free_pages [kernel] 0x1c (0xf7fa1ee8) [<c0173201>] __pollwait [kernel] 0x31 (0xf7fa1eec) [<c01327b0>] process_timeout [kernel] 0x0 (0xf7fa1f00) [<c017348e>] do_select [kernel] 0x11e (0xf7fa1f18) [<c0173932>] sys_select [kernel] 0x352 (0xf7fa1f5c) migration/0 S C0424180 5492 2 0 1 (L-TLB) Call Trace: [<c0121d76>] schedule [kernel] 0x176 (0xc39c7f78) [<c0123560>] migration_task [kernel] 0x0 (0xc39c7f90) [<c01237f0>] migration_task [kernel] 0x290 (0xc39c7fa0) [<c0123560>] migration_task [kernel] 0x0 (0xc39c7fe0) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39c7ff0) keventd S C0424180 0 3 1 4 (L-TLB) Call Trace: [<c0121d76>] schedule [kernel] 0x176 (0xc37c5f64) [<c0139947>] context_thread [kernel] 0x117 (0xc37c5f8c) [<c0139830>] context_thread [kernel] 0x0 (0xc37c5fe0) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c5ff0) kapmd S C0424180 0 4 1 5 3 (L-TLB) Call Trace: [<c0121d76>] schedule [kernel] 0x176 (0xc37c3f20) [<c0132825>] schedule_timeout [kernel] 0x65 (0xc37c3f48) [<c01327b0>] process_timeout [kernel] 0x0 (0xc37c3f68) [<c011a940>] apm_mainloop [kernel] 0x60 (0xc37c3f80) [<c011b244>] apm [kernel] 0x1e4 (0xc37c3fcc) [<c011b060>] apm [kernel] 0x0 (0xc37c3fe4) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c3ff0) ksoftirqd/0 S C0424180 0 5 1 8 4 (L-TLB) Call Trace: [<c0121d76>] schedule [kernel] 0x176 (0xc37c1fa4) [<c012d9ff>] ksoftirqd [kernel] 0xaf (0xc37c1fcc) [<c012d950>] ksoftirqd [kernel] 0x0 (0xc37c1fe0) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc37c1ff0) bdflush S C0424180 4468 8 1 6 5 (L-TLB) Call Trace: [<c0121d76>] schedule [kernel] 0x176 (0xc39cbf7c) [<c0122485>] interruptible_sleep_on [kernel] 0x55 (0xc39cbfa4) [<c012d941>] __run_task_queue [kernel] 0x61 (0xc39cbfbc) [<c0164227>] bdflush [kernel] 0xe7 (0xc39cbfd4) [<c0164140>] bdflush [kernel] 0x0 (0xc39cbfe4) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cbff0) kswapd R current 2932 6 1 7 8 (L-TLB) Call Trace: [<c0154b80>] swap_writepage [kernel] 0x0 (0xc39cff34) [<c0153263>] rw_swap_page [kernel] 0x43 (0xc39cff3c) [<c0154bb8>] swap_writepage [kernel] 0x38 (0xc39cff50) [<c014fec9>] launder_page [kernel] 0x709 (0xc39cff60) [<c0151702>] rebalance_dirty_zone [kernel] 0xa2 (0xc39cff84) [<c0151cdb>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc39cffac) [<c0151e08>] kswapd [kernel] 0x68 (0xc39cffd0) [<c0151da0>] kswapd [kernel] 0x0 (0xc39cffe4) [<c01093ed>] kernel_thread_helper [kernel] 0x5 (0xc39cfff0)
contents of /proc/cpuinfo: processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1200.073 cache size : 256 KB physical id : 0 siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 2392.06
Memory dump is also available but is ~700 MB when zipped.
This problem appears to be page list corruption. We have fixed a memory corruption problem that did cause corruption of the page lists in RHEL3-U6 and in an RHEL3-U5 security errata. You should really be running that kernel, please grab that kernel and let me know if it does indeed fix this problem. Larry Woodman
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.