Bug 28029

Summary: 3w-xxxx module fails to load in SMP kernel
Product: [Retired] Red Hat Linux Reporter: Roger Moore <rwmoore>
Component: kernelAssignee: Ben LaHaise <bcrl>
Status: CLOSED RAWHIDE QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: dr, zaitcev
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-03-20 23:08:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ksymoops output from stack trace
none
Workaround for 3ware crash on boot
none
Modified 3w-xxxx patch to include declaration of 'ret' none

Description Roger Moore 2001-02-16 19:24:13 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (X11; U; Linux 2.2.16-22smp i686)


Attempting to load the 3w-xxxx module for a 3ware Escalade 6xxx (8port) IDE
RAID controller causes the modprobe program to segmentation fault.
Following this the machine behaves in strange, unpredictable ways.
Particularly the keyboard often behaves badly. 

'lsmod' shows that the module is still initializing and I am unable to
'rmmod' it plus I have to use the reset button to reboot since the machine
refuses to respond to the usual three-fingered salute or the shutdown
command.

Using the 2.4.0 UP kernel everything behaves normally.

Reproducible: Always
Steps to Reproduce:
1. Boot SMP machine containing Escalade controller with 2.4.0 SMP    
kernel (system disk is single IDE disk /dev/hda)
2. Type 'modprobe 3w-xxxx'
3. Machine 'crashes' (behaves unpredictably)
	

Actual Results:  modprobe core dumps and the machine starts to behave in
strange, unpredictable ways. Often the keyboard returns strange characters
and frequently accessing non-cached files on the disk causes the shell to
hang.

Expected Results:  3ware module should be inserted and /dev/sda should
become the RAID disk device.

When running the command on the console I get:

NMI watchdog detected a lockup on CPU1,registers:
....

Unfortunately the machine is so screwed up I can't capture the output nor
is it stored in the log files. I believe it is essentially a stack trace of
the modprobe command though. If you REALLY want to see it I can sit down
and copy it out and then type it back in. It is not always CPU1 though,
about 50% of the time it is CPU0.

Comment 1 Roger Moore 2001-02-16 21:29:39 UTC
Done some further testing. I get the same behaviour with a custom compiled 2.4.1
kernel. The UP module works fine but the SMP module fails. Changing the compile
options to include the 3w-xxxx driver as part of the kernel means that the whole
thing works correctly with the SMP kernel. So it seems that the problem is only
to do with the SMP kernel module.


Comment 2 Glen Foster 2001-02-17 23:31:49 UTC
We (Red Hat) should really try to resolve this before next release.

Comment 3 Pete Zaitcev 2001-02-20 06:01:32 UTC
Is there a way to run the machine with a serial console?
My current theory is that something sleeps while 
called from tw_findcards(). Stack trace would prove
or disprove that.

-- Pete


Comment 4 Roger Moore 2001-02-20 16:07:49 UTC
Good idea - it saves me from typing! I've hooked up a serial console and the
output is included below. I've now also recompiled the 0.99.11 kernel RPM which
came with the beta to include the 3ware driver in the kernel itself and I get
the error when it boots. The 2.4.1 kernel I tried before also occasionally gets
the error when it boots as well - I was just lucky the first couple of times!
The output included below is from booting a kernel with it compiled in although
I can also give you a trace from the module as shipped with the beta if that is
better for you.

NMI Watchdog detected LOCKUP on CPU1, registers:
CPU:    1
EIP:    0010:[<c022e7c2>]
EFLAGS: 00000087
eax: eeb5d2bf   ebx: 000e3260   ecx: eeb51d7f   edx: 00000001
esi: 0000dfa4   edi: 00040000   ebp: dffe7ec4   esp: dffe7eb8
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 1, stackpage=dffe7000)
Stack: 57005010 c01b409a 000e3260 3a9240f2 000e7ebd 3a9240f1 000e7304 0000dfa8
       dfe56000 dfe56000 dfe56000 c01b25f1 dfe56000 00040000 0000000f 00000080
       00000082 0000dfa4 00000000 00000000 00000000 0000dfac 0000dfa4 000e7301
Call Trace: [<c01b409a>] [<c01b25f1>] [<c01b2e24>] [<c0105000>] [<c01a9b78>]
[<c01566af>] [<c0156b51>]
       [<c0105000>] [<c0105000>] [<c0107108>] [<c0105000>] [<c0107626>]
[<c01070e0>]
 
Code: 29 c8 39 d8 72 f8 5b c3 8d b6 00 00 00 00 8b 44 24 04 eb 0a
console shuts up ...     


Comment 5 Michael K. Johnson 2001-02-20 17:21:04 UTC
Could you run that through ksymoops?  Since you have to boot a different
kernel in order to get through the boot sequence you will have to tell
ksymoops where to get its symbols.

Comment 6 Roger Moore 2001-02-20 17:29:57 UTC
Just got your new comment when submitting this but thought you might like it
anyway - will attempt to look at ksymoops...

Thinking about it a little more you probably really want the stack trace from
the official Fisher kernel so here it is. The command I ran to get it was:

	> modprobe 3w-xxxx

Here is the output:
NMI Watchdog detected LOCKUP on CPU1, registers:
CPU:    1
EIP:    0010:[<c021f222>]
EFLAGS: 00000083
eax: b73951a4   ebx: 000e3260   ecx: b72e72a4   edx: 0000000d
esi: 0000dfa4   edi: 00040000   ebp: dc46fe2c   esp: dc46fe20
ds: 0018   es: 0018   ss: 0018
Process modprobe (pid: 1300, stackpage=dc46f000)
Stack: 57005010 e0a36bfa 000e3260 3a92a8cd 0003260b 3a92a8cc 000335a2 0000dfa8
       dc402000 dc402000 dc402000 e0a35151 dc402000 00040000 0000000f 00000080
       00000082 0000dfa4 00000000 00000000 00000000 0000dfac 0000dfa4 0003359f
Call Trace: [<e0a36bfa>] [<e0a35151>] [<e0a35984>] [<e0a3a9c0>] [<e099ccd8>]
[<e0a3a9c0>] [<e0a35000>]
       [<e0a380d7>] [<e0a3a9c0>] [<e0a3a9c0>] [<c0119d45>] [<e0a3af40>]
[<e0a2b000>] [<e0a35060>] [<c01091c7>]
 
Code: 29 c8 39 d8 72 f8 5b c3 8d b6 00 00 00 00 8b 44 24 04 eb 0a
console shuts up ...  


Comment 7 Michael K. Johnson 2001-02-20 18:15:59 UTC
Unfortunately, even with a kernel that we shipped, the oops message
without ksymoops doesn't help because modules get loaded at different
addresses, so without ksymoops output, we're out of luck here.
Using your static kernel is fine for this -- in fact, in some ways
better because there is no chance of addresses changing and it's easier
for ksymoops to decode correctly.

Comment 8 Roger Moore 2001-02-20 20:10:45 UTC
Created attachment 10561 [details]
ksymoops output from stack trace

Comment 9 Roger Moore 2001-02-20 20:14:01 UTC
I've attached the file containing the output of ksymoops from my adapted version
of your stock kernel (All that was changed was the i686 config file in the
source RPM to make SCSI disk and 3ware drivers compile into the kernel). You
were right about the module - the only way to make ksymoops work was with the
monolithic kernel.


Comment 10 Alan Cox 2001-02-20 23:37:59 UTC
This is a known 2.4.x problem. I've not had any feedback from 3ware about
sorting this one out. Trying to fix this bug (and the fact that when you fix
this one it blows up for other reasons) without hardware docs is very hard. 

I have no anticipated 2.4.x fix date for this driver.  You might want to ask the
vendor...


Comment 11 Alan Cox 2001-02-20 23:39:15 UTC
> ------- Additional comments from glen 2001-02-17 18:31:49 -------
>
> We (Red Hat) should really try to resolve this before next release.

Unless you can get sane feedback from the vendor, we should probably drop this
driver for
the next release. Annoying since I have one 8(.


Comment 12 Pete Zaitcev 2001-02-21 23:29:32 UTC
As a workaround, I have asked Roger to switch off new style
error processing in 3w-xxxx.h. This allows to push through
tw_findcards and then the system hangs for 2 minutes with
the following message:
3w-xxxx: tw_interrupt(): Received a request id (0) (opcode=0x13) that wasn't posted
Afterwards it appears to work. I guess it will work until
an error repored into SCSI layer, then it may blow up.

Comment 13 Alan Cox 2001-02-22 00:06:46 UTC
Ok thats enough to chase down what I think is the real bogon...

Patch attached


Comment 14 Alan Cox 2001-02-22 00:07:50 UTC
Created attachment 10721 [details]
Workaround for 3ware crash on boot

Comment 15 Roger Moore 2001-02-22 20:10:06 UTC
Thanks for the patch. I tried it (with a minor modification to declare 'ret' and
in first) and it produced the same behaviour as Pete's. First the error message
appears:

3w-xxxx: tw_interrupt(): Received a request id (0) (opcode=0x13) that wasn't
posted.

then the machine hangs for 2 minutes and finally the module loads and all is
well again.

I'll attach the modified patch below.


Comment 16 Roger Moore 2001-02-22 20:13:44 UTC
Created attachment 10780 [details]
Modified 3w-xxxx patch to include declaration of 'ret'

Comment 17 Michael K. Johnson 2001-02-26 23:07:45 UTC
Since Ben has hardware to test and has resolved this bug on his
hardware, I am re-assigning this bug...

Comment 18 Michael K. Johnson 2001-03-08 15:37:47 UTC
3ware sent us a patch, and it is in the kernel we have in rawhide.
Could you test with 2.4.2-0.1.19 or later from rawhide, from the
ftp://ftp.redhat.com/pub/rawhide/i386/RedHat/RPMS/
directory?

Thanks!

Comment 19 Roger Moore 2001-03-16 00:16:44 UTC
Sorry for the slow response but I've been kept busy by other things. I tested
the 2.4.2-0.1.25 kernel from rawhide today and it works flawlessly with a RAID0
partition. The module inserts correctly and there is no 2 minute hang. I'm
reconfiguring with RAID5 (takes several hours) and I'll let you know how that
test goes tomorrow morning.

Thanks for all the help - it seems like you've got it fixed! Any news as to when
RedHat 7.1 is coming out?


Comment 20 Roger Moore 2001-03-20 23:08:38 UTC
Ok - I've tested it with the RAID5 configured disks (which caused problems with
the old fix) and it works fine. Nothing I've managed to do has caused anything
to hang! As far as I can tell you've completely fixed the problem!

Thanks again!


Comment 21 Ben LaHaise 2001-03-20 23:13:31 UTC
Thanks for the update!