Bug 438878 - megaraid SAS driver corruption on 2.6.24.3-29.el5rt.i386 kernel
Summary: megaraid SAS driver corruption on 2.6.24.3-29.el5rt.i386 kernel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 1.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Michal Schmidt
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-25 18:06 UTC by Gurhan Ozen
Modified: 2008-05-27 22:04 UTC (History)
7 users (show)

Fixed In Version: 2.6.24.7-47.el5rt
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-27 22:04:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg messages (18.53 KB, text/plain)
2008-03-25 18:06 UTC, Gurhan Ozen
no flags Details
ioapic quirk ported to i386 (5.16 KB, patch)
2008-03-27 15:46 UTC, Michal Schmidt
no flags Details | Diff
ioapic quirk for i386 (5.63 KB, patch)
2008-04-14 11:34 UTC, Michal Schmidt
no flags Details | Diff

Description Gurhan Ozen 2008-03-25 18:06:50 UTC
Description of problem:
This bug is eeriely similar to #250266 . When running .6.24.3-29.el5rt.i386 on a
dell-pe1950 machine, megasas driver gets corrupted after the machine does some
work. You can make it happen with:
cat /dev/sdX > /dev/null 

If the kernel is booted with noapic option, this doesn't happen. I am attaching
portions of dmesg with the relevant messages.


Version-Release number of selected component (if applicable):
2.6.24.3-29.el5rt.i386

How reproducible:
Very
Steps to Reproduce:
1. Boot with 2.6.24.3-29.el5rt.i386 kernel on a hardware that has megasas
controller..
2. run   "cat /dev/sdX > /dev/null"
3. 
  
Actual results:


Expected results:


Additional info:

Comment 1 Gurhan Ozen 2008-03-25 18:06:50 UTC
Created attachment 299054 [details]
dmesg messages

Comment 2 Michal Schmidt 2008-03-25 18:34:16 UTC
Note to myself: port preempt-irqs-x86-64-ioapic-mask-quirk-jcm.patch to i386 and
see if it fixes it.

Comment 3 Michal Schmidt 2008-03-27 15:46:58 UTC
Created attachment 299347 [details]
ioapic quirk ported to i386

Straightforward port of preempt-irqs-x86-64-ioapic-mask-quirk-jcm.patch for
32-bit kernels. I'm testing on dell-pe1850-01 in RHTS (I could not find a
pe1950).

With 2.6.24.4-30.el5rt I was getting:
megaraid: aborting-6695 cmd=2a <c=1 t=0 l=0>
megaraid abort: 6695:13[255:128], fw owner
megaraid: aborting-6696 cmd=2a <c=1 t=0 l=0>
megaraid abort: 6696:9[255:128], fw owner
megaraid: aborting-6699 cmd=2a <c=1 t=0 l=0>
megaraid abort: 6699:0[255:128], fw owner
megaraid: aborting-6700 cmd=28 <c=1 t=0 l=0>
megaraid abort: 6700:18[255:128], fw owner
megaraid: 4 outstanding commands. Max wait 300 sec
megaraid mbox: Wait for 2 commands to complete:300
megaraid mbox: reset sequence completed sucessfully

With the patch it hasn't happened yet. Scratch kernel build with the patch:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1231815

Comment 4 Gurhan Ozen 2008-03-28 02:41:49 UTC
Verified this with 2.6.24.4-30.m1.el5rt.i386 kernel ...

Comment 5 Michal Schmidt 2008-04-14 11:34:58 UTC
Created attachment 302329 [details]
ioapic quirk for i386

We still do not have the patch applied in CVS. Here's the patch against the
current 2.6.24.4-41. jcm noticed that previously the ioapic quirks in
arch/x86/kernel/quirks.c were used only on x86_64. This version of the patch
removes the '#ifdef CONFIG_X64_64' so the automatic quirks are used on both
archs.

(Gurhan, I think setting this bug to VERIFIED after testing the scratch build
was a mistake. In the usual BZ workflow it means the fix was applied and passed
QA.)

Comment 6 Gurhan Ozen 2008-04-15 05:57:41 UTC
(In reply to comment #5)
> Created an attachment (id=302329) [edit]
> ioapic quirk for i386
> 
> We still do not have the patch applied in CVS. Here's the patch against the
> current 2.6.24.4-41. jcm noticed that previously the ioapic quirks in
> arch/x86/kernel/quirks.c were used only on x86_64. This version of the patch
> removes the '#ifdef CONFIG_X64_64' so the automatic quirks are used on both
> archs.
> 
> (Gurhan, I think setting this bug to VERIFIED after testing the scratch build
> was a mistake. In the usual BZ workflow it means the fix was applied and passed
> QA.)

  Eekk. Sorry about creating a confusion, I guess I still don't know small
cornercases of BZ workflow such as this. I didn't realize this. Anyway, feel
free to change the status of the bug and just let me know when it goes in main
-rt tree and i'll test it. 

Comment 7 Michal Schmidt 2008-04-15 09:15:00 UTC
It's now in kernel-rt-2.6.24.4-42.el5rt.

Comment 8 Gurhan Ozen 2008-04-15 18:11:34 UTC
(In reply to comment #7)
> It's now in kernel-rt-2.6.24.4-42.el5rt.

Ok in that case, I'll just revive this bug. There are some issues with -42 , I
can't even get this to be tested. Here are some backtraces from dmesg:

BUG: scheduling with irqs disabled: auditd/0x00000000/12737
caller is rt_spin_lock_slowlock+0xcf/0x14f
Pid: 12737, comm: auditd Not tainted 2.6.24.4-42.el5rt #1
 [<c0633b59>] schedule+0x8e/0x114
 [<c06343ef>] rt_spin_lock_slowlock+0xcf/0x14f
 [<c0634a65>] __rt_spin_lock+0x4c/0x4e
 [<c0634a6f>] rt_spin_lock+0x8/0xa
 [<c04770bf>] page_address+0x4d/0x80
 [<c047730d>] kmap_high+0xf0/0x463
 [<c0403226>] ? __switch_to+0xa3/0x125
 [<c042a515>] ? finish_task_switch+0x29/0xc4
 [<c042c6cd>] ? __wake_up+0x34/0x4f
 [<c0477191>] ? kunmap_high+0x9f/0xa1
 [<c0424140>] ? kunmap+0x52/0x54
 [<c0424180>] kmap+0x3e/0x49
 [<c0421398>] kmap_atomic_func+0x12/0x15
 [<c0423802>] gup_pte_range+0x4f/0x135
 [<c0423a86>] fast_gup+0x19e/0x264
 [<c0449c96>] get_futex_key+0x70/0xa2
 [<c044b1e5>] do_futex+0x383/0xa4b
 [<c048e0bd>] ? do_readv_writev+0x16d/0x178
 [<c044b994>] sys_futex+0xe7/0xfa
 [<c048e5d8>] ? sys_writev+0x58/0x8f
 [<c0404226>] syscall_call+0x7/0xb
 =======================
WARNING: at arch/x86/kernel/smp_32.c:580 native_smp_call_function_mask()
Pid: 13593, comm: automount Not tainted 2.6.24.4-42.el5rt #1
 [<c0419d61>] ? do_flush_tlb_all+0x0/0x3f
 [<c0419f4e>] native_smp_call_function_mask+0x4c/0x12a
 [<c0419d61>] ? do_flush_tlb_all+0x0/0x3f
 [<c0476f78>] ? __set_page_address+0x8b/0x95
 [<c0419d61>] ? do_flush_tlb_all+0x0/0x3f
 [<c0419d61>] ? do_flush_tlb_all+0x0/0x3f
 [<c041b4e5>] smp_call_function+0x1e/0x22
 [<c0434059>] on_each_cpu+0x24/0x5c
 [<c0419c75>] flush_tlb_all+0x1e/0x20
 [<c04774ba>] kmap_high+0x29d/0x463
 [<c0424180>] kmap+0x3e/0x49
 [<c0421398>] kmap_atomic_func+0x12/0x15
 [<c0423802>] gup_pte_range+0x4f/0x135
 [<c047bb7e>] ? handle_mm_fault+0xb7a/0xbb3
 [<c0423a86>] fast_gup+0x19e/0x264
 [<c0449c96>] get_futex_key+0x70/0xa2
 [<c044a302>] futex_wake+0x38/0xbe
 [<c044aee7>] do_futex+0x85/0xa4b
 [<c0493d0a>] ? pipe_write+0x36f/0x3c4
 [<c046fcf8>] ? __pagevec_free+0x17/0x1e
 [<c048d9fb>] ? do_sync_write+0xc5/0x102
 [<c044b994>] sys_futex+0xe7/0xfa
 [<c0468c99>] ? __delayacct_add_tsk+0x205/0x210
 [<c042cdd0>] mm_release+0x84/0x8b
 [<c0430c01>] exit_mm+0x15/0xfd
 [<c0432128>] do_exit+0x213/0x716
 [<c04070a8>] ? do_syscall_trace+0x14c/0x198
 [<c04326bb>] complete_and_exit+0x0/0x16
 [<c0404226>] syscall_call+0x7/0xb
 =======================



Comment 9 Michal Schmidt 2008-04-15 18:41:44 UTC
I opened a new bug 442595 for the BUG in -42, as it does not look related to the
ioapic quirk.

Comment 10 Clark Williams 2008-04-23 20:51:16 UTC
so does your 32-bit ioapic quirk patch fix this bug?

Clark


Comment 11 Michal Schmidt 2008-04-24 15:26:50 UTC
Gurhan, does -47 boot on the machine? And does the ioapic quirk help?

Comment 12 Gurhan Ozen 2008-04-24 18:22:18 UTC
(In reply to comment #11)
> Gurhan, does -47 boot on the machine? And does the ioapic quirk help?

  I've never tried -47 kernel on this machine. I'll be PTO until april 29th and 
be in the office on the 30th.. i doubt i'll get any chance to try this till 
then, especially given that that machine has no remote management console set 
up. Will update the BZ once i try it. 


Comment 13 Gurhan Ozen 2008-05-05 19:41:57 UTC
Update as a response to clark's email on rt list.. This issue is still open I
didn't get a chance to try  -47 kernel before the office move . The box is not
hooked up in the new lab yet, i am hoping that it'll be today. Once it's online
i'll test this one. 

Comment 14 Gurhan Ozen 2008-05-07 15:16:51 UTC
-47 kernel booted fine.. and i ran "cat /dev/sda > dev/null" 50 times without
any damage. 

Comment 15 Clark Williams 2008-05-27 22:04:26 UTC
closing


Note You need to log in before you can comment on or make changes to this bug.