Bug 55521

Summary: Oops in kernel-2.4.9-6enterprise. Hardware issue?
Product: [Retired] Red Hat Linux Reporter: Need Real Name <robert_macaulay>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED ERRATA QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: anwarpp, michael_e_brown
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-12-11 16:40:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
SysrqM output with wait_on_irq dump
none
SysrqT output with wait_on_irq dump
none
System.map file from 2.4.9-0.18 none

Description Need Real Name 2001-11-01 16:17:49 UTC
Spin off of Bug 54700

We hit an oops after running 2.4.9-0.18 for about 14 days. The 
questionable memory has been swapped out with known good memory. We have 
not upgraded to 2.4.9-6 yet.  Does 2.4.9-6 have a fix for this oops? We 
will be upgrading it today most likely.

ksymoops 2.4.3 on i686 2.4.9-0.18enterprise.  

Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.9-0.18enterprise/ (default)
     -m /boot/System.map-2.4.9-0.18enterprise (specified)

Error (expand_objects): cannot stat(/lib/qla2x00.o) for qla2x00
Error (expand_objects): cannot stat(/lib/megaraid.o) for megaraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base 
says 
c01dfbf0, System.map says c0174
810.  Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol 
socket_file_ops_R__ver_socket_file_ops not found in System.map.  I
gnoring ksyms_base entry
Oops: 0000
CPU:    4
EIP:    0010:[<c013f587>]    Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010207
eax: 00000000   ebx: c4a6f974   ecx: c0292964   edx: db191eb0
esi: c4a6f958   edi: c0292964   ebp: 000564ae   esp: d0c27d90
ds: 0018   es: 0018   ss: 0018
Process db-unload (pid: 3662, stackpage=d0c27000)
Stack: 00000007 000000cf 00000001 00000001 000000d2 00000000 000000d2 
c0140201
       000000d2 00000000 d0c26000 00000001 c0140388 000000d2 00000001 
d0c26000
       c0141131 000000d2 00000001 000000d2 d056f450 00000000 c7e66c30 
c01411ef
Call Trace: [<c0140201>] deactivate_page_Rsmp_f16413ac [] 0x2261
[<c0140388>] deactivate_page_Rsmp_f16413ac [] 0x23e8
[<c0141131>] _alloc_pages_Rsmp_686e8739 [] 0x1e1
[<c01411ef>] __alloc_pages_Rsmp_867830d7 [] 0xf
[<c01198e0>] do_BUG_Rsmp_577f4bff [] 0x110
[<c0136e9e>] filemap_nopage_Rsmp_ecbf8c86 [] 0x12e
[<c01574fc>] path_walk_Rsmp_931832bf [] 0xafc
[<c01198e0>] do_BUG_Rsmp_577f4bff [] 0x110
[<c0131a34>] vmtruncate_Rsmp_4f09e0fd [] 0x9e4
[<c01198e0>] do_BUG_Rsmp_577f4bff [] 0x110
[<c0131c0b>] vmtruncate_Rsmp_4f09e0fd [] 0xbbb
[<c015b01c>] vfs_follow_link_Rsmp_98ff6cea [] 0x17c
[<c0160de5>] dput_Rsmp_9946dfc7 [] 0x35
[<c01198e0>] do_BUG_Rsmp_577f4bff [] 0x110
[<c0119b16>] do_BUG_Rsmp_577f4bff [] 0x346
[<c0136867>] do_generic_file_read_Rsmp_051d766f [] 0x867
[<c016565a>] update_atime_Rsmp_82f08dd4 [] 0x4a
[<c013678f>] do_generic_file_read_Rsmp_051d766f [] 0x78f
[<c0160de5>] dput_Rsmp_9946dfc7 [] 0x35
[<c014a7a7>] fput_Rsmp_25a50b3a [] 0x77
[<c01291c2>] del_timer_sync_Rsmp_a0201047 [] 0xb02
[<c0114f69>] smp_call_function_Rsmp_0014bfd1 [] 0x929
[<c01198e0>] do_BUG_Rsmp_577f4bff [] 0x110
[<c0107878>] sys_sigaltstack_Rsmp_ab65536b [] 0x10c8
Code: f7 40 18 06 00 00 00 75 f0 8b 40 28 39 d0 75 f0 31 d2 85 d2


>>EIP; c013f586 <page_launder+206/a10>   <=====
Trace; c0140200 <do_try_to_free_pages+10/50>
Trace; c0140388 <try_to_free_pages+28/40>
Trace; c0141130 <_wrapped_alloc_pages+1c0/270>
Trace; c01411ee <__alloc_pages+e/a0>
Trace; c01198e0 <do_page_fault+0/5d0>
Trace; c0136e9e <filemap_nopage+12e/570>
Trace; c01574fc <path_walk+afc/bf0>
Trace; c01198e0 <do_page_fault+0/5d0>
Trace; c0131a34 <do_no_page+c4/1d0>
Trace; c01198e0 <do_page_fault+0/5d0>
Trace; c0131c0a <handle_mm_fault+ca/1c0>
Trace; c015b01c <vfs_follow_link+17c/1f0>
Trace; c0160de4 <dput+34/230>
Trace; c01198e0 <do_page_fault+0/5d0>
Trace; c0119b16 <do_page_fault+236/5d0>
Trace; c0136866 <file_read_actor+c6/100>
Trace; c016565a <update_atime+4a/50>
Trace; c013678e <do_generic_file_read+78e/7a0>
Trace; c0160de4 <dput+34/230>
Trace; c014a7a6 <fput+76/140>
Trace; c01291c2 <run_local_timers+a2/1a0>
Trace; c0114f68 <smp_apic_timer_interrupt+118/130>
Trace; c01198e0 <do_page_fault+0/5d0>
Trace; c0107878 <error_code+38/40>
Code;  c013f586 <page_launder+206/a10>
00000000 <_EIP>:
Code;  c013f586 <page_launder+206/a10>   <=====
   0:   f7 40 18 06 00 00 00      testl  $0x6,0x18(%eax)   <=====
Code;  c013f58c <page_launder+20c/a10>
   7:   75 f0                     jne    fffffff9 <_EIP+0xfffffff9> 
c013f57e 
<page_launder+1fe/a10>
Code;  c013f58e <page_launder+20e/a10>
   9:   8b 40 28                  mov    0x28(%eax),%eax
Code;  c013f592 <page_launder+212/a10>
   c:   39 d0                     cmp    %edx,%eax
Code;  c013f594 <page_launder+214/a10>
   e:   75 f0                     jne    0 <_EIP>
Code;  c013f596 <page_launder+216/a10>
  10:   31 d2                     xor    %edx,%edx
Code;  c013f598 <page_launder+218/a10>
  12:   85 d2                     test   %edx,%edx


2 warnings and 5 errors issued.  Results may not be reliable.

Comment 1 Need Real Name 2001-11-01 16:43:34 UTC
Created attachment 36042 [details]
SysrqM output with wait_on_irq dump

Comment 2 Need Real Name 2001-11-01 16:44:38 UTC
Created attachment 36043 [details]
SysrqT output with wait_on_irq dump

Comment 3 Need Real Name 2001-11-01 16:45:20 UTC
Created attachment 36044 [details]
System.map file from 2.4.9-0.18

Comment 4 Need Real Name 2001-11-02 10:45:51 UTC
The box crashed again tonite. We loaded 2.4.9-6enterprise on the box after last 
nights crash. This oops changed to a BUG posted below. Do you have any 
suggestions on what the cause could be? This same kernel has worked on other 
machines of the same class, RAM size, and controller cards with out any 
problems. The only thing this machine has that the others don't is more swap
(9GB over 5 partitions). 

Could this be a hardware problem in the scsi controller, qla2x00? 

ksymoops 2.4.3 on i686 2.4.9-6enterprise.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.9-6enterprise/ (default)
     -m /boot/System.map-2.4.9-6enterprise (specified)

Error (expand_objects): cannot stat(/lib/qla2x00.o) for qla2x00
Error (expand_objects): cannot stat(/lib/megaraid.o) for megaraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base says 
c01c3b90, System.map says c01626f0.  Ignoring ksyms_base entry
kernel BUG at /usr/src/build/47107-i686/BUILD/kernel-
2.4.9/linux/include/asm/pci.h:145!
invalid operand: 0000
CPU:    2
EIP:    0010:[<f882f7e8>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010086
eax: 00000058   ebx: 00000000   ecx: c02fe164   edx: 00005d92
esi: 00000002   edi: 00000002   ebp: 00000001   esp: ccca9ea8
ds: 0018   es: 0018   ss: 0018
Process bdflush (pid: 13, stackpage=ccca9000)
Stack: f88328c0 00000091 c040d600 00000000 f6232e00 00000046 00004000 00004000
       f6232f60 c041f760 f67a007c f67a007c f882b1ef f67a007c f6232f60 00000001
       000080e0 000080e0 f8830ffe f67a007c c041f760 f67a8160 00000000 000080e0
Call Trace: [<f88328c0>] __insmod_qla2x00_S.rodata_L8384 [qla2x00] 0x120
[<f882b1ef>] qla2100_next [qla2x00] 0x4f
[<f8830ffe>] qla2100_restart_queues [qla2x00] 0xce
[<f8829ddc>] qla2100_queuecommand [qla2x00] 0x1dc
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<f88006cc>] scsi_release_command_Rsmp_394750ec [scsi_mod] 0x2fc
[<f8806cf0>] scsi_sleep_Rsmp_35962bf8 [scsi_mod] 0x1680
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<f8808783>] scsi_io_completion_Rsmp_adfcb909 [scsi_mod] 0x803
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<c018e4ab>] generic_unplug_device [kernel] 0x2b
[<c011fffd>] __run_task_queue [kernel] 0x5d
[<c01433b1>] bdflush [kernel] 0xc1
[<c0105000>] stext [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0105866>] kernel_thread [kernel] 0x26
[<c01432f0>] bdflush [kernel] 0x0
Code: 0f 0b 58 5a 8b 14 24 8b 04 1a 85 c0 74 11 05 00 00 00 40 31

>>EIP; f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0>   <=====
Trace; f88328c0 <[qla2x00].rodata.start+120/20be>
Trace; f882b1ee <[qla2x00]qla2100_next+4e/c0>
Trace; f8830ffe <[qla2x00]qla2100_restart_queues+ce/250>
Trace; f8829ddc <[qla2x00]qla2100_queuecommand+1dc/1f0>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; f88006cc <[scsi_mod]__kstrtab_scsi_deregister_blocked_host+c/40>
Trace; f8806cf0 <[scsi_mod]scsi_old_done+0/680>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; f8808782 <[scsi_mod]scsi_request_fn+312/370>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; c018e4aa <generic_unplug_device+2a/40>
Trace; c011fffc <__run_task_queue+5c/70>
Trace; c01433b0 <bdflush+c0/e0>
Trace; c0105000 <_stext+0/0>
Trace; c0105000 <_stext+0/0>
Trace; c0105866 <kernel_thread+26/30>
Trace; c01432f0 <bdflush+0/e0>
Code;  f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0>
00000000 <_EIP>:
Code;  f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  f882f7ea <[qla2x00]qla2100_32bit_start_scsi+da/6f0>
   2:   58                        pop    %eax
Code;  f882f7ea <[qla2x00]qla2100_32bit_start_scsi+da/6f0>
   3:   5a                        pop    %edx
Code;  f882f7ec <[qla2x00]qla2100_32bit_start_scsi+dc/6f0>
   4:   8b 14 24                  mov    (%esp,1),%edx
Code;  f882f7ee <[qla2x00]qla2100_32bit_start_scsi+de/6f0>
   7:   8b 04 1a                  mov    (%edx,%ebx,1),%eax
Code;  f882f7f2 <[qla2x00]qla2100_32bit_start_scsi+e2/6f0>
   a:   85 c0                     test   %eax,%eax
Code;  f882f7f4 <[qla2x00]qla2100_32bit_start_scsi+e4/6f0>
   c:   74 11                     je     1f <_EIP+0x1f> f882f806 <[qla2x00]
qla2100_32bit_start_scsi+f6/6f0>
Code;  f882f7f6 <[qla2x00]qla2100_32bit_start_scsi+e6/6f0>
   e:   05 00 00 00 40            add    $0x40000000,%eax
Code;  f882f7fa <[qla2x00]qla2100_32bit_start_scsi+ea/6f0>
  13:   31 00                     xor    %eax,(%eax)


1 warning and 5 errors issued.  Results may not be reliable.


Comment 5 Need Real Name 2001-11-12 22:13:50 UTC
I've had an another box crash with the 2.4.9-6 kernel. It didn't have a serial 
console, and died with an "Aiee - in inturrupt handler".

The keryboard that is plugged in does not generate the proper pageup code, so I 
could not scroll back to get the complete oops. The linee on the screen that 
indicated it was in something related to qla2100_* from the qla2x00 driver. 

There was nothing in /var/log/messages. This machine has the same qlogic 2200 
cards in it. Sorry this is not much to go on. I'll attache a serial console to 
this box in case it dies again.


Comment 6 Need Real Name 2001-11-15 06:43:46 UTC
This has become most frustrating. I am becoming more and more convinced it's a 
hardware problem. We've already replaced the RAM modules. Can you validate my 
suspicions? The box crashed again tonite. The oops is at the bottom of this 
comment. Since the last oops on this box(forget about the dev box from the 
comment above, its ok now, and does have a serial console just in case), we 
have swapped out one of the 2 qlogic cards in the box due to the last oops 
occuring in the qla2100_32bit_start_scsi.

We had to choose between the 2 cards. Do the errors reported above seem to have 
any other source other than dying hardware? The crash takes 12-14 days to 
occur. The 2 hardware targets for replacement left for me would be the 
remaining qlogic card, and the actual memory cards(not the RAM). The OS disk 
controller is barely used during the time of the crashes. 

Any guesses are appriciated. Thanks alot for your work on this.

ksymoops 2.4.3 on i686 2.4.9-6enterprise.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.9-6enterprise/ (default)
     -m /boot/System.map-2.4.9-6enterprise (specified)

Error (expand_objects): cannot stat(/lib/qla2x00.o) for qla2x00
Error (expand_objects): cannot stat(/lib/megaraid.o) for megaraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base says 
c01c3b90, System.map says c
01626f0.  Ignoring ksyms_base entry
Unable to handle kernel NULL pointer dereference at virtual address 00000018
c0135ec7
*pde = 2d5ed001
Oops: 0000
CPU:    7
EIP:    0010:[<c0135ec7>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010207
eax: 00000000   ebx: c470ec98   ecx: c02ffb20   edx: ed86ea40
esi: c470ec7c   edi: 00000000   ebp: 00000007   esp: d18b5ea4
ds: 0018   es: 0018   ss: 0018
Process unitool (pid: 18091, stackpage=d18b5000)
Stack: 00000000 fffffdbe 000007aa 0009c98a 00000001 00000001 000000d2 00000000
       000000d2 c01369b1 000000d2 00000000 d18b4000 00000001 c0136b48 000000d2
       00000001 d18b4000 c0137851 000000d2 00000001 000000d2 f710b34c 00000000
Call Trace: [<c01369b1>] do_try_to_free_pages [kernel] 0x11
[<c0136b48>] try_to_free_pages [kernel] 0x28
[<c0137851>] _wrapped_alloc_pages [kernel] 0x1c1
[<c013790f>] __alloc_pages [kernel] 0xf
[<c013112c>] generic_file_write [kernel] 0x35c
[<c013e786>] sys_write [kernel] 0x96
[<c010719b>] system_call [kernel] 0x33
Code: f7 40 18 06 00 00 00 75 f0 8b 40 28 39 d0 75 f0 31 d2 85 d2

>>EIP; c0135ec6 <page_launder+216/940>   <=====
Trace; c01369b0 <do_try_to_free_pages+10/50>
Trace; c0136b48 <try_to_free_pages+28/40>
Trace; c0137850 <_wrapped_alloc_pages+1c0/270>
Trace; c013790e <__alloc_pages+e/a0>
Trace; c013112c <generic_file_write+35c/660>
Trace; c013e786 <sys_write+96/d0>
Trace; c010719a <system_call+32/38>
Code;  c0135ec6 <page_launder+216/940>
00000000 <_EIP>:
Code;  c0135ec6 <page_launder+216/940>   <=====
   0:   f7 40 18 06 00 00 00      testl  $0x6,0x18(%eax)   <=====
Code;  c0135ecc <page_launder+21c/940>
   7:   75 f0                     jne    fffffff9 <_EIP+0xfffffff9> c0135ebe 
<page_launder+20e/940>
Code;  c0135ece <page_launder+21e/940>
   9:   8b 40 28                  mov    0x28(%eax),%eax
Code;  c0135ed2 <page_launder+222/940>
   c:   39 d0                     cmp    %edx,%eax
Code;  c0135ed4 <page_launder+224/940>
   e:   75 f0                     jne    0 <_EIP>
Code;  c0135ed6 <page_launder+226/940>
  10:   31 d2                     xor    %edx,%edx
Code;  c0135ed8 <page_launder+228/940>
  12:   85 d2                     test   %edx,%edx
 


Comment 7 Arjan van de Ven 2001-11-15 10:54:50 UTC
It can be a hardware bug; but just as well a qlogic driver bug......

Comment 8 Need Real Name 2001-11-15 17:57:39 UTC
Tommorrow, we will be shutting down this box, and bringing up another 8way in 
its place. Totally seperate system w/ the exception of the FC disk array. If 
this runs fine, its a HW problem. If it crashes, it's kernel. This hopefully 
will help nail this down better.

Comment 9 Michael K. Johnson 2001-12-06 17:04:41 UTC
What was the result of the experiment?

Comment 10 Need Real Name 2001-12-06 17:51:04 UTC
We haven't hit a streak of 14+ days yet. The box we swapped with had slower 
processors in it causing the app to take too long to run, missing SLAs. We 
upgraded the CPUs in the box. After this, we ran into a tcp bug, which I will 
document in an additional bug, posted here soon. Currently, the box is running 
2.4.15aa1, because this fixed the tcp bug.

Comment 11 Need Real Name 2001-12-06 18:09:40 UTC
TCP bug is Bug #57189

Comment 12 Matt Domsch 2001-12-11 16:34:27 UTC
Robert, it sounds like you can close this now, right?


Comment 13 Need Real Name 2001-12-11 16:39:57 UTC
We can close this, since the box in question is running a new custom kernel. 
Other machines running 2.4.9-6 seem stable, but they really don't utilize the 
qla2x00 driver as hard as the machine in question does.



Comment 14 Need Real Name 2002-05-16 20:09:06 UTC
Newer errata kernels do not exhibit this behaviour.