Bug 28029
Summary: | 3w-xxxx module fails to load in SMP kernel | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Roger Moore <rwmoore> | ||||||||
Component: | kernel | Assignee: | Ben LaHaise <bcrl> | ||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Brock Organ <borgan> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.1 | CC: | dr, zaitcev | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i386 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2001-03-20 23:08:42 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Roger Moore
2001-02-16 19:24:13 UTC
Done some further testing. I get the same behaviour with a custom compiled 2.4.1 kernel. The UP module works fine but the SMP module fails. Changing the compile options to include the 3w-xxxx driver as part of the kernel means that the whole thing works correctly with the SMP kernel. So it seems that the problem is only to do with the SMP kernel module. We (Red Hat) should really try to resolve this before next release. Is there a way to run the machine with a serial console? My current theory is that something sleeps while called from tw_findcards(). Stack trace would prove or disprove that. -- Pete Good idea - it saves me from typing! I've hooked up a serial console and the output is included below. I've now also recompiled the 0.99.11 kernel RPM which came with the beta to include the 3ware driver in the kernel itself and I get the error when it boots. The 2.4.1 kernel I tried before also occasionally gets the error when it boots as well - I was just lucky the first couple of times! The output included below is from booting a kernel with it compiled in although I can also give you a trace from the module as shipped with the beta if that is better for you. NMI Watchdog detected LOCKUP on CPU1, registers: CPU: 1 EIP: 0010:[<c022e7c2>] EFLAGS: 00000087 eax: eeb5d2bf ebx: 000e3260 ecx: eeb51d7f edx: 00000001 esi: 0000dfa4 edi: 00040000 ebp: dffe7ec4 esp: dffe7eb8 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 1, stackpage=dffe7000) Stack: 57005010 c01b409a 000e3260 3a9240f2 000e7ebd 3a9240f1 000e7304 0000dfa8 dfe56000 dfe56000 dfe56000 c01b25f1 dfe56000 00040000 0000000f 00000080 00000082 0000dfa4 00000000 00000000 00000000 0000dfac 0000dfa4 000e7301 Call Trace: [<c01b409a>] [<c01b25f1>] [<c01b2e24>] [<c0105000>] [<c01a9b78>] [<c01566af>] [<c0156b51>] [<c0105000>] [<c0105000>] [<c0107108>] [<c0105000>] [<c0107626>] [<c01070e0>] Code: 29 c8 39 d8 72 f8 5b c3 8d b6 00 00 00 00 8b 44 24 04 eb 0a console shuts up ... Could you run that through ksymoops? Since you have to boot a different kernel in order to get through the boot sequence you will have to tell ksymoops where to get its symbols. Just got your new comment when submitting this but thought you might like it anyway - will attempt to look at ksymoops... Thinking about it a little more you probably really want the stack trace from the official Fisher kernel so here it is. The command I ran to get it was: > modprobe 3w-xxxx Here is the output: NMI Watchdog detected LOCKUP on CPU1, registers: CPU: 1 EIP: 0010:[<c021f222>] EFLAGS: 00000083 eax: b73951a4 ebx: 000e3260 ecx: b72e72a4 edx: 0000000d esi: 0000dfa4 edi: 00040000 ebp: dc46fe2c esp: dc46fe20 ds: 0018 es: 0018 ss: 0018 Process modprobe (pid: 1300, stackpage=dc46f000) Stack: 57005010 e0a36bfa 000e3260 3a92a8cd 0003260b 3a92a8cc 000335a2 0000dfa8 dc402000 dc402000 dc402000 e0a35151 dc402000 00040000 0000000f 00000080 00000082 0000dfa4 00000000 00000000 00000000 0000dfac 0000dfa4 0003359f Call Trace: [<e0a36bfa>] [<e0a35151>] [<e0a35984>] [<e0a3a9c0>] [<e099ccd8>] [<e0a3a9c0>] [<e0a35000>] [<e0a380d7>] [<e0a3a9c0>] [<e0a3a9c0>] [<c0119d45>] [<e0a3af40>] [<e0a2b000>] [<e0a35060>] [<c01091c7>] Code: 29 c8 39 d8 72 f8 5b c3 8d b6 00 00 00 00 8b 44 24 04 eb 0a console shuts up ... Unfortunately, even with a kernel that we shipped, the oops message without ksymoops doesn't help because modules get loaded at different addresses, so without ksymoops output, we're out of luck here. Using your static kernel is fine for this -- in fact, in some ways better because there is no chance of addresses changing and it's easier for ksymoops to decode correctly. Created attachment 10561 [details]
ksymoops output from stack trace
I've attached the file containing the output of ksymoops from my adapted version of your stock kernel (All that was changed was the i686 config file in the source RPM to make SCSI disk and 3ware drivers compile into the kernel). You were right about the module - the only way to make ksymoops work was with the monolithic kernel. This is a known 2.4.x problem. I've not had any feedback from 3ware about sorting this one out. Trying to fix this bug (and the fact that when you fix this one it blows up for other reasons) without hardware docs is very hard. I have no anticipated 2.4.x fix date for this driver. You might want to ask the vendor... > ------- Additional comments from glen 2001-02-17 18:31:49 ------- > > We (Red Hat) should really try to resolve this before next release. Unless you can get sane feedback from the vendor, we should probably drop this driver for the next release. Annoying since I have one 8(. As a workaround, I have asked Roger to switch off new style error processing in 3w-xxxx.h. This allows to push through tw_findcards and then the system hangs for 2 minutes with the following message: 3w-xxxx: tw_interrupt(): Received a request id (0) (opcode=0x13) that wasn't posted Afterwards it appears to work. I guess it will work until an error repored into SCSI layer, then it may blow up. Ok thats enough to chase down what I think is the real bogon... Patch attached Created attachment 10721 [details]
Workaround for 3ware crash on boot
Thanks for the patch. I tried it (with a minor modification to declare 'ret' and in first) and it produced the same behaviour as Pete's. First the error message appears: 3w-xxxx: tw_interrupt(): Received a request id (0) (opcode=0x13) that wasn't posted. then the machine hangs for 2 minutes and finally the module loads and all is well again. I'll attach the modified patch below. Created attachment 10780 [details]
Modified 3w-xxxx patch to include declaration of 'ret'
Since Ben has hardware to test and has resolved this bug on his hardware, I am re-assigning this bug... 3ware sent us a patch, and it is in the kernel we have in rawhide. Could you test with 2.4.2-0.1.19 or later from rawhide, from the ftp://ftp.redhat.com/pub/rawhide/i386/RedHat/RPMS/ directory? Thanks! Sorry for the slow response but I've been kept busy by other things. I tested the 2.4.2-0.1.25 kernel from rawhide today and it works flawlessly with a RAID0 partition. The module inserts correctly and there is no 2 minute hang. I'm reconfiguring with RAID5 (takes several hours) and I'll let you know how that test goes tomorrow morning. Thanks for all the help - it seems like you've got it fixed! Any news as to when RedHat 7.1 is coming out? Ok - I've tested it with the RAID5 configured disks (which caused problems with the old fix) and it works fine. Nothing I've managed to do has caused anything to hang! As far as I can tell you've completely fixed the problem! Thanks again! Thanks for the update! |