Bug 55521
Summary: | Oops in kernel-2.4.9-6enterprise. Hardware issue? | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Need Real Name <robert_macaulay> | ||||||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Brock Organ <borgan> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 7.3 | CC: | anwarpp, michael_e_brown | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i386 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2001-12-11 16:40:02 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Need Real Name
2001-11-01 16:17:49 UTC
Created attachment 36042 [details]
SysrqM output with wait_on_irq dump
Created attachment 36043 [details]
SysrqT output with wait_on_irq dump
Created attachment 36044 [details]
System.map file from 2.4.9-0.18
The box crashed again tonite. We loaded 2.4.9-6enterprise on the box after last
nights crash. This oops changed to a BUG posted below. Do you have any
suggestions on what the cause could be? This same kernel has worked on other
machines of the same class, RAM size, and controller cards with out any
problems. The only thing this machine has that the others don't is more swap
(9GB over 5 partitions).
Could this be a hardware problem in the scsi controller, qla2x00?
ksymoops 2.4.3 on i686 2.4.9-6enterprise. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.9-6enterprise/ (default)
-m /boot/System.map-2.4.9-6enterprise (specified)
Error (expand_objects): cannot stat(/lib/qla2x00.o) for qla2x00
Error (expand_objects): cannot stat(/lib/megaraid.o) for megaraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says
c01c3b90, System.map says c01626f0. Ignoring ksyms_base entry
kernel BUG at /usr/src/build/47107-i686/BUILD/kernel-
2.4.9/linux/include/asm/pci.h:145!
invalid operand: 0000
CPU: 2
EIP: 0010:[<f882f7e8>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010086
eax: 00000058 ebx: 00000000 ecx: c02fe164 edx: 00005d92
esi: 00000002 edi: 00000002 ebp: 00000001 esp: ccca9ea8
ds: 0018 es: 0018 ss: 0018
Process bdflush (pid: 13, stackpage=ccca9000)
Stack: f88328c0 00000091 c040d600 00000000 f6232e00 00000046 00004000 00004000
f6232f60 c041f760 f67a007c f67a007c f882b1ef f67a007c f6232f60 00000001
000080e0 000080e0 f8830ffe f67a007c c041f760 f67a8160 00000000 000080e0
Call Trace: [<f88328c0>] __insmod_qla2x00_S.rodata_L8384 [qla2x00] 0x120
[<f882b1ef>] qla2100_next [qla2x00] 0x4f
[<f8830ffe>] qla2100_restart_queues [qla2x00] 0xce
[<f8829ddc>] qla2100_queuecommand [qla2x00] 0x1dc
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<f88006cc>] scsi_release_command_Rsmp_394750ec [scsi_mod] 0x2fc
[<f8806cf0>] scsi_sleep_Rsmp_35962bf8 [scsi_mod] 0x1680
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<f8808783>] scsi_io_completion_Rsmp_adfcb909 [scsi_mod] 0x803
[<f881cd60>] __insmod_sd_mod_S.rodata_L1920 [sd_mod] 0x8e0
[<c018e4ab>] generic_unplug_device [kernel] 0x2b
[<c011fffd>] __run_task_queue [kernel] 0x5d
[<c01433b1>] bdflush [kernel] 0xc1
[<c0105000>] stext [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0105866>] kernel_thread [kernel] 0x26
[<c01432f0>] bdflush [kernel] 0x0
Code: 0f 0b 58 5a 8b 14 24 8b 04 1a 85 c0 74 11 05 00 00 00 40 31
>>EIP; f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0> <=====
Trace; f88328c0 <[qla2x00].rodata.start+120/20be>
Trace; f882b1ee <[qla2x00]qla2100_next+4e/c0>
Trace; f8830ffe <[qla2x00]qla2100_restart_queues+ce/250>
Trace; f8829ddc <[qla2x00]qla2100_queuecommand+1dc/1f0>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; f88006cc <[scsi_mod]__kstrtab_scsi_deregister_blocked_host+c/40>
Trace; f8806cf0 <[scsi_mod]scsi_old_done+0/680>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; f8808782 <[scsi_mod]scsi_request_fn+312/370>
Trace; f881cd60 <[sd_mod]sd_template+0/0>
Trace; c018e4aa <generic_unplug_device+2a/40>
Trace; c011fffc <__run_task_queue+5c/70>
Trace; c01433b0 <bdflush+c0/e0>
Trace; c0105000 <_stext+0/0>
Trace; c0105000 <_stext+0/0>
Trace; c0105866 <kernel_thread+26/30>
Trace; c01432f0 <bdflush+0/e0>
Code; f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0>
00000000 <_EIP>:
Code; f882f7e8 <[qla2x00]qla2100_32bit_start_scsi+d8/6f0> <=====
0: 0f 0b ud2a <=====
Code; f882f7ea <[qla2x00]qla2100_32bit_start_scsi+da/6f0>
2: 58 pop %eax
Code; f882f7ea <[qla2x00]qla2100_32bit_start_scsi+da/6f0>
3: 5a pop %edx
Code; f882f7ec <[qla2x00]qla2100_32bit_start_scsi+dc/6f0>
4: 8b 14 24 mov (%esp,1),%edx
Code; f882f7ee <[qla2x00]qla2100_32bit_start_scsi+de/6f0>
7: 8b 04 1a mov (%edx,%ebx,1),%eax
Code; f882f7f2 <[qla2x00]qla2100_32bit_start_scsi+e2/6f0>
a: 85 c0 test %eax,%eax
Code; f882f7f4 <[qla2x00]qla2100_32bit_start_scsi+e4/6f0>
c: 74 11 je 1f <_EIP+0x1f> f882f806 <[qla2x00]
qla2100_32bit_start_scsi+f6/6f0>
Code; f882f7f6 <[qla2x00]qla2100_32bit_start_scsi+e6/6f0>
e: 05 00 00 00 40 add $0x40000000,%eax
Code; f882f7fa <[qla2x00]qla2100_32bit_start_scsi+ea/6f0>
13: 31 00 xor %eax,(%eax)
1 warning and 5 errors issued. Results may not be reliable.
I've had an another box crash with the 2.4.9-6 kernel. It didn't have a serial console, and died with an "Aiee - in inturrupt handler". The keryboard that is plugged in does not generate the proper pageup code, so I could not scroll back to get the complete oops. The linee on the screen that indicated it was in something related to qla2100_* from the qla2x00 driver. There was nothing in /var/log/messages. This machine has the same qlogic 2200 cards in it. Sorry this is not much to go on. I'll attache a serial console to this box in case it dies again. This has become most frustrating. I am becoming more and more convinced it's a
hardware problem. We've already replaced the RAM modules. Can you validate my
suspicions? The box crashed again tonite. The oops is at the bottom of this
comment. Since the last oops on this box(forget about the dev box from the
comment above, its ok now, and does have a serial console just in case), we
have swapped out one of the 2 qlogic cards in the box due to the last oops
occuring in the qla2100_32bit_start_scsi.
We had to choose between the 2 cards. Do the errors reported above seem to have
any other source other than dying hardware? The crash takes 12-14 days to
occur. The 2 hardware targets for replacement left for me would be the
remaining qlogic card, and the actual memory cards(not the RAM). The OS disk
controller is barely used during the time of the crashes.
Any guesses are appriciated. Thanks alot for your work on this.
ksymoops 2.4.3 on i686 2.4.9-6enterprise. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.9-6enterprise/ (default)
-m /boot/System.map-2.4.9-6enterprise (specified)
Error (expand_objects): cannot stat(/lib/qla2x00.o) for qla2x00
Error (expand_objects): cannot stat(/lib/megaraid.o) for megaraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says
c01c3b90, System.map says c
01626f0. Ignoring ksyms_base entry
Unable to handle kernel NULL pointer dereference at virtual address 00000018
c0135ec7
*pde = 2d5ed001
Oops: 0000
CPU: 7
EIP: 0010:[<c0135ec7>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010207
eax: 00000000 ebx: c470ec98 ecx: c02ffb20 edx: ed86ea40
esi: c470ec7c edi: 00000000 ebp: 00000007 esp: d18b5ea4
ds: 0018 es: 0018 ss: 0018
Process unitool (pid: 18091, stackpage=d18b5000)
Stack: 00000000 fffffdbe 000007aa 0009c98a 00000001 00000001 000000d2 00000000
000000d2 c01369b1 000000d2 00000000 d18b4000 00000001 c0136b48 000000d2
00000001 d18b4000 c0137851 000000d2 00000001 000000d2 f710b34c 00000000
Call Trace: [<c01369b1>] do_try_to_free_pages [kernel] 0x11
[<c0136b48>] try_to_free_pages [kernel] 0x28
[<c0137851>] _wrapped_alloc_pages [kernel] 0x1c1
[<c013790f>] __alloc_pages [kernel] 0xf
[<c013112c>] generic_file_write [kernel] 0x35c
[<c013e786>] sys_write [kernel] 0x96
[<c010719b>] system_call [kernel] 0x33
Code: f7 40 18 06 00 00 00 75 f0 8b 40 28 39 d0 75 f0 31 d2 85 d2
>>EIP; c0135ec6 <page_launder+216/940> <=====
Trace; c01369b0 <do_try_to_free_pages+10/50>
Trace; c0136b48 <try_to_free_pages+28/40>
Trace; c0137850 <_wrapped_alloc_pages+1c0/270>
Trace; c013790e <__alloc_pages+e/a0>
Trace; c013112c <generic_file_write+35c/660>
Trace; c013e786 <sys_write+96/d0>
Trace; c010719a <system_call+32/38>
Code; c0135ec6 <page_launder+216/940>
00000000 <_EIP>:
Code; c0135ec6 <page_launder+216/940> <=====
0: f7 40 18 06 00 00 00 testl $0x6,0x18(%eax) <=====
Code; c0135ecc <page_launder+21c/940>
7: 75 f0 jne fffffff9 <_EIP+0xfffffff9> c0135ebe
<page_launder+20e/940>
Code; c0135ece <page_launder+21e/940>
9: 8b 40 28 mov 0x28(%eax),%eax
Code; c0135ed2 <page_launder+222/940>
c: 39 d0 cmp %edx,%eax
Code; c0135ed4 <page_launder+224/940>
e: 75 f0 jne 0 <_EIP>
Code; c0135ed6 <page_launder+226/940>
10: 31 d2 xor %edx,%edx
Code; c0135ed8 <page_launder+228/940>
12: 85 d2 test %edx,%edx
It can be a hardware bug; but just as well a qlogic driver bug...... Tommorrow, we will be shutting down this box, and bringing up another 8way in its place. Totally seperate system w/ the exception of the FC disk array. If this runs fine, its a HW problem. If it crashes, it's kernel. This hopefully will help nail this down better. What was the result of the experiment? We haven't hit a streak of 14+ days yet. The box we swapped with had slower processors in it causing the app to take too long to run, missing SLAs. We upgraded the CPUs in the box. After this, we ran into a tcp bug, which I will document in an additional bug, posted here soon. Currently, the box is running 2.4.15aa1, because this fixed the tcp bug. TCP bug is Bug #57189 Robert, it sounds like you can close this now, right? We can close this, since the box in question is running a new custom kernel. Other machines running 2.4.9-6 seem stable, but they really don't utilize the qla2x00 driver as hard as the machine in question does. Newer errata kernels do not exhibit this behaviour. |