Bug 45207

Summary:	Megaraid module will not load after unloading on seawolf IA64 gold
Product:	[Retired] Red Hat Linux	Reporter:	Matt Domsch <matt_domsch>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED RAWHIDE	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.3	CC:	afom_m, dale_kaisner, john_hull, matt_domsch, michael_e_brown, peterj
Target Milestone:	---
Target Release:	---
Hardware:	ia64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2001-11-15 21:44:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Domsch 2001-06-20 20:34:13 UTC

Description of Problem:
On a Dell PowerEdge 7150 w/ A01 bios and 16GB ram running seawolf ia64 
gold.  After rmmod of megaraid module.  Attempt to insmod megaraid and 
load process hangs at:
"SCSI2: Found a megaraid controller at 0xe800000, IRQ: 57
SCSI2: Enabling 64 bit support"

Lose machine responsiveness and must reboot at this point.



How Reproducible:
Always

Steps to Reproduce:
1.  rmmod megaraid
2.  modprobe megaraid
3. 

Actual Results:
Machine hangs

Expected Results:
Driver loads

Additional Information:
We haven't seen this issue on IA-32.  AMI has been informed of the issue.

Comment 1 Matt Domsch 2001-07-11 16:49:50 UTC

AMI suspects a hardware problem with their card in this case, per the megaraid 
1.17 driver release notes.  They're continuing their investigation.

Comment 2 Glen Foster 2001-07-13 22:07:34 UTC

This defect considered MUST-FIX for Fairfax gold-release.

Comment 3 Matt Domsch 2001-07-23 21:54:19 UTC

I'm wondering if it's at all related to the sometimes-seen qla12160 controller 
lock-up on IA-64 systems.  If you repeatedly reboot the system and start Linux, 
the qla12160 driver sometimes fails to properly initialize the qla12160, and 
the system hangs.  I've asked AMI to include this thought in their 
investigation.  Maybe we can kill two birds with one stone.

Comment 4 Clay Cooper 2001-08-03 20:58:09 UTC

Reproduced on pe7150 running 2.4.7-0.3 kernel, machine bios X15, using a Perc3DC
(taos) raid controller.

Comment 5 Tesfamariam Michael 2001-08-20 23:37:18 UTC

Reproduced on Fairfax RC1 using PERC3/QC.

Comment 6 Bill Nottingham 2001-08-23 02:46:47 UTC

*** Bug 52312 has been marked as a duplicate of this bug. ***

Comment 7 Tesfamariam Michael 2001-08-23 19:51:12 UTC

This issue still exists with Fairfax RC1 on a systems with 2GB and 16GB of RAM.

Comment 8 Bill Nottingham 2001-08-23 21:19:47 UTC

They have reported that this succeeds on 2.4.7 + ia64 patch + latest megaraid,
but fails on our kernel.

Comment 9 Bill Nottingham 2001-09-20 20:01:28 UTC

This should be fixed in 2.4.7-6.1 or later. Please confirm.

Comment 10 Tesfamariam Michael 2001-10-04 23:51:11 UTC

This issue still exists with the recent RC1 (kernel 2.4.9-0.12 and 2.4.9-0.18).

Comment 11 Matt Domsch 2001-11-05 16:13:11 UTC

Per Clay Cooper:
Still broken with 2.4.9-9.1.  I loaded....unloaded......then loaded and then 
the system hung.

Comment 12 Matt Domsch 2001-11-07 23:24:45 UTC

I reproduced this with 2.4.9-13dell2smp, which is -13 + megaraid 1.18 + kdb 
patch, and made some progress.

Here's a backtrace.  It appears to be getting stuck waiting on
mbox->status I think.

 0xa000000000327c20 [megaraid]megaIssueCmd+0x580)
        args (0xe000000fee7600c0, 0xe000000feead0050, 0x0, 0xe000000fee760650,
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0xe000000fee7600c0, 0xff, 0xe000000fee777c98, 0xe000000fee777c99,
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0x2000000, 0xa000000000335054, 0x0, 0xa000000000335f10, 0xa000000
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0x39, 0x0, 0x3, 0x0, 0x0)
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570
        args (0x713, 0xa000000000334a28, 0x101e, 0x1960, 0xa000000)
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570
        args (0x1ddd5a, 0xa000000000334ae0, 0x0, 0xe0000001022515a0, 0x0)
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570
        args (0xc9e, 0x1, 0xa000000000334a28, 0xa000000000330f90, 0x206)
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0xe000000fee777e68, 0xa000000000324034,0xa000000000324010, 0x8,
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0x20000000002bb0b0, 0x20000000002bb0b8, 0x0, 0x2000000000136090,
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0x0, 0x400000000000a7d0, 0xc000000000000c1e, 0x6000000000023510,
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060
0xa000000000327c10 [megaraid]megaIssueCmd+0x570)
        args (0x4000000000013610, 0xc000000000000a98, 0x6000000000023510, 0x600
        megaraid .text 0xa0000000003240c0 0xa0000000003276a0 0xa000000000328060

So, we're looping for a while, then (maybe) getting past it?  

[0]kdb> id 0xa000000000327c10
0xa000000000327c10 megaIssueCmd+0x570 [MFI]       mov r1=r37
0xa000000000327c16 megaIssueCmd+0x576             nop.f 0x0
0xa000000000327c1c megaIssueCmd+0x57c             mov r15=255

0xa000000000327c20 megaIssueCmd+0x580 [MMI]       ld1.acq r14=[r36];;
0xa000000000327c26 megaIssueCmd+0x586             cmp4.eq
p6,p7=r15,r14
0xa000000000327c2c megaIssueCmd+0x58c             nop.i 0x0

0xa000000000327c30 megaIssueCmd+0x590 [MFB]    nop.m 0x0
0xa000000000327c36 megaIssueCmd+0x596             nop.f 0x0
0xa000000000327c3c megaIssueCmd+0x59c       (p06) br.cond.dpnt.few

This corresponds to this code in megaraid.s:
	.loc 1 2355 0
	br.call.sptk.many b0 = WRINDOOR#
	;;
	mov r1 = r37
	.loc 1 2357 0
	addl r15 = 255, r0
.L3046:
	ld1.acq r14 = [r36]
	;;
	cmp4.eq p6, p7 = r15, r14
	(p6) br.cond.dpnt .L3046
	.loc 1 2358 0
	mov r15 = r39
	addl r16 = 255, r0
	;;

which is line 2357 in megaraid.c v1.18:
			mbox->numstatus = 0xFF;
			mbox->status = 0xFF;
			WRINDOOR (megaCfg, phys_mbox | 0x1);

			while (mbox->numstatus == 0xFF) ;
			while (mbox->status == 0xFF) ;
			while (mbox->mraid_poll != 0x77) ;

So, for some reason, either numstatus or status isn't getting updated (the card 
doesn't respond).  Likely it's not getting reset properly when the module is 
unloaded on IA-64.

Comment 13 Michael K. Johnson 2001-11-08 16:07:56 UTC

>while (mbox->numstatus == 0xFF) ;
>while (mbox->status == 0xFF) ;
>while (mbox->mraid_poll != 0x77) ;


I'm suprised that Intel hasn't had a conniption about these lines
already; Intel says that busy waits like that will damage their
fragile PIV chips...

Comment 14 Michael K. Johnson 2001-11-13 17:28:41 UTC

We have a potential fix (mark those structure members volatile
and put in cpu_relax to make Intel happy) and it will be in
2.4.9-13.4 when you get it, please test.

Comment 15 Michael K. Johnson 2001-11-13 21:33:48 UTC

2.4.9-13.4 is now available at
ftp://ftp.beta.redhat.com/pub/testing/kernel/

Please give it a whirl!

Comment 16 Clay Cooper 2001-11-15 21:44:53 UTC

works with 2.4.9-13.4smp