Bug 248042

Summary:	unloading fw_ohci causes kernel panic (firewall_ohci in 2.6.22)
Product:	[Fedora] Fedora	Reporter:	Colin <bugzilla>
Component:	kernel	Assignee:	Jay Fenlason <fenlason>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7	CC:	chris.brown, jfeeney, krh, stefan-r-rhbz, zaitcev
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-01-04 00:29:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Colin 2007-07-12 19:49:36 UTC

Description of problem:

unloading fw_ohci when fw_sbp2 causes the F7 to kernel panic and output the
stack trace given below.


Version-Release number of selected component (if applicable):

kernel-2.6.21-1.3228.fc7
(also with kernel-2.6.21-1.3194.fc7, the initial FC7 kernel)

How reproducible:

Always.

Steps to Reproduce:
1. modprobe fw_core; modprobe fw_ohci; modprobe fw_spb2
2. modprobe -r fw_ohci
3. kernel panic
  
Actual results:

kernel panic (given below in additional info)

Expected results:

The solution is to unload fw_sbp2 first (or
presumably any other modules that use fw_ohci)
before unloading fw_ohci. However if the module
is in use the expected results should be that it
says the module is in use and doesn't unload it.


Additional info:

fw_sbp2: management write failed, rcode 0xffffffed
fw_ohci: Removed fw-ohci device.
fw_sbp2: removed sbp2 unit fw1.0 general protection fault: 0000 [1] SMP 
last sysfs file: /block/sdb/size CPU 1
Modules linked in: hfsplus ipv6 ipt_owner ipt_LOG xt_limit ipt_REJECT xt_tcpudp 
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables
x_tables cpufreq_ondemand fw_sbp2 serio_raw forcedeth i2c_core k8_edae
ata_generic sr_mod cdrom sg dm_zero dm_mirror dm_mod usb_storage pata_amd
sata_nv libata sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd     
                                                                
Pid: 0, comm: swapper Not tainted 2.6.21-1.3228.fc7 #1                         
                            
RIP: 0010:[<ffffffff8028a529>]  [<ffffffff8028a529>] run_timer_softirq+0x159/0x1 d1
RSP: 0018:ffff81007ff07f00 EFLAGS: 00010282
RAX: ffff810037cbffd8 RBX: 3020302030203836 RCX: 3120373231343320
RDX: ffff81007ff07f00 RSI: 0000000030203020 RDI: 3020302030203836
RBP: 0000000000000100 R08: ffff81007e8b0070 R09:
R10: ffff810081d0aa80 R11: ffffffff8022dfb7 R12: ffff810037c98000
R13: 3120373231343320 R14: 0000000000000000 R15: 0000000000000000    
FS:  00002aaaaaab6a40(0000) GS:ffff810037c85940(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaab8860a0 CR3: 000000007bf5b000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810037cbe000, task ffff810037c9c100)
Stack:  ffff81007ff07f00 ffff81007ff07f00 000000000000000a 0000000000000001
 ffffffff805a3110 000000000000000a 0000000000000001 ffffffff80210dba
 ffff81007ff07f38 0000000000000046 ffff81007ff07f78 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff80210dba>] __do_softirq+0x55/0xc3
 [<ffffffff802582ac>] call_softirq+0x1c/0x28
 [<ffffffff8026534e>] do_softirq+0x2c/0x85
 [<ffffffff8026ee3b>] smp_apic_timer_interrupt+0x48/0x5b
 [<ffffffff80263d5d>] default_idle+0x0/0x3d
 [<ffffffff80257d56>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff8022dfb7>] unix_poll+0x0/0x96
 [<ffffffff80263d86>] default_idle+0x29/0x3d
 [<ffffffff8024423b>] cpu_idle+0x8c/0xaf


Code: 41 ff d5 65 48 8b 04 25 10 00 00 00 3b a8 44 e0 ff ff 74 1d
RIP  [<ffffffff8028a529>] run_timer_softirq+0x159/0x1d1
 RSP <ffff81007ff07f00>
Kernel panic - not syncing: Aiee, killing interrupt handler!


Linux nike 2.6.21-1.3228.fc7 #1 SMP Tue Jun 12 14:56:37 EDT 2007 x86_64 x86_64
x86_64 GNU/Linux

Comment 1 Colin 2007-07-12 19:52:22 UTC

I forgot to note that this might be related to bug #246256 which talks about
fw_ohci causing a kernel panic on resuming from a suspended state. No stack
traces there to indicate the problem though, but they seem to talk about a
kernel panic.

Comment 2 Christopher Brown 2007-09-18 15:12:18 UTC

Hello Colin,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel? The bug you mention was
resolved in a recent kernel update so this could also be the case for your issue.

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Cheers
Chris

Comment 3 Colin 2007-09-18 19:53:03 UTC

Hi Chris,

This continues to be a problem with the latest Fedora 7 kernel RPM
(2.6.22.5-76.fc7.x86_64.rpm). In 2.6.22 the modules in question are now called

 firewire_sbp2 firewire_ohci firewire_core

Unloading them in this order causes no problems. Unloading firewire_core while
the other two remain causes the expected "module is in use" error and no crash.
But if firewire_ohci is unloaded while firewire_sbp2 is loaded then it crashes.

One observation which may/may not help is that it does not crash if you unload
firewire_ohci straight after boot. In this case it simply unloaded. I therefore
tried mounting and accessing the disk and then unmounting it and trying to
unload the module. This caused a crash. My guess is that accessing the disk
causes something to be loaded which isn't otherwise?

After some head-banging with Hyperterminal, I realised I could use PuTTY to get
my kernel panic without it being mangled - for anyone reading this I'd
thoroughly recommend this instead of the mangled nonsense HT outputs - and so
here it is.

The first part of this suggests that firewire_ohci is trying to remove
firewire_sbp2 as part of its unloading and this is what is messing up. How this
works on the command line but not from a LKM calling the same routine I don't
know, but this seems to be the problem area:

nike# rmmod firewire_ohci
firewire_sbp2: management write failed, rcode 0xffffffed
sd 8:0:0:0: [sdg] Synchronizing SCSI cache
firewire_sbp2: removed sbp2 unit fw1.0
firewire_ohci: Removed fw-ohci device.
nike# general protection fault: 0000 [1] SMP
last sysfs file: /block/sdg/sdg1/dev
CPU 0
Modules linked in: firewire_sbp2 firewire_core ipv6 nf_conntrack_ftp ipt_owner
ipt_LOG xt_limit ipt_REJECT xt_tcpudp nf_conntrack_ipv4 xt_state nf_conntrack
nfnetlink iptable_filter ip_tables x_tables cpufreq_ondemand dock crc_itu_t
rtc_cmos k8temp hwmon ac97_bus forcedeth snd_timer snd soundcore snd_page_alloc
i2c_nforce2 i2c_core sr_mod cdrom joydev sg dm_snapshot dm_zero dm_mirror dm_mod
usb_storage pata_amd sata_nv libata sd_mod scsi_mod ext3 jbd mbcache ehci_hcd
ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.22.5-76.fc7 #1
RIP: 0010:[<ffffffff8103b7e0>]  [<ffffffff8103b7e0>] run_timer_softirq+0x159/0x1d1
RSP: 0018:ffffffff81476f00  EFLAGS: 00010282
RAX: ffffffff81419fd8 RBX: 313220350000302e RCX: 3177666632785c31
RDX: ffffffff81476f00 RSI: 0000000030203020 RDI: 313220350000302e
RBP: 0000000000000100 R08: ffff810074afb070 R09: 000000000000000a
R10: ffffffff81365640 R11: ffff810077fce910 R12: ffffffff814b9c00
R13: 3177666632785c31 R14: ffffffff81447300 R15: 0000000000000000
FS:  00002aaaab67b710(0000) GS:ffffffff813ae000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaab8860a0 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff81418000, task ffffffff81365640)
Stack:  ffffffff81476f00 ffffffff81476f00 ffff810002355880 0000000000000001
 ffffffff813b4110 000000000000000a 0000000000000000 ffffffff81038f33
 ffffffff81476f38 0000000000000046 ffffffff81476f78 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff81038f33>] __do_softirq+0x55/0xc3
 [<ffffffff8100acec>] call_softirq+0x1c/0x28
 [<ffffffff8100be11>] do_softirq+0x2c/0x85
 [<ffffffff81019c0f>] smp_apic_timer_interrupt+0x48/0x5d
 [<ffffffff81008d8c>] default_idle+0x0/0x3d
 [<ffffffff8100a796>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff81008db5>] default_idle+0x29/0x3d
 [<ffffffff81008e55>] cpu_idle+0x8c/0xaf
 [<ffffffff81423809>] start_kernel+0x2ca/0x2d6
 [<ffffffff81423140>] _sinittext+0x140/0x144


Code: 41 ff d5 65 48 8b 04 25 10 00 00 00 3b a8 44 e0 ff ff 74 1d
RIP  [<ffffffff8103b7e0>] run_timer_softirq+0x159/0x1d1
 RSP <ffffffff81476f00>
Kernel panic - not syncing: Aiee, killing interrupt handler!


-----------------------------------------------------------

Regards,
Colin.

Comment 4 Christopher Brown 2007-09-18 21:35:32 UTC

Okay, thanks for the additional information Colin, thats some good debugging.
I'm re-assigning this to the firewire subsystem maintainer and they may be able
to shed further light on the problem.

Cheers
Chris

Comment 5 Stefan Richter 2007-09-18 21:50:52 UTC

I forgot: Upstream bug is fixed in kernel 2.6.23-rc4.

Comment 6 Stefan Richter 2007-09-18 22:19:57 UTC

I wrote in comment #5:
> Upstream bug is fixed in kernel 2.6.23-rc4.

Well, the part of thge upstream bug which was described here has been fixed.

Colin wrote:
> if the module is in use the expected results should be that it
> says the module is in use and doesn't unload it.

fw-sbp2 does not use fw-ohci.  It only indirectly requires its presence and
functioning in order to stay connected with SBP-2 devices.

The old sbp2 driver contains a hack which increases the use count of a card
driver module as soon as it logs in to a device behind the respective card (and
decreases the use count if it logs out or is otherwise disconnected).  I added
that hack because the old IEEE1394 driver stack has two drivers, video1394 and
dv1394, which use symbols of ohci1394 and hence increase and decrease ohci1394's
use count when loaded and unloaded.  So, if somebody had an SBP-2 disk mounted
and unloaded dv1394, ohci1394 was unloaded without that hack and the connection
to the disk was lost.

We don't need this hack for the new driver stack because there is no driver, and
never will be, which uses symbols of fw-ohci.  Of course people can shoot
themselves in the foot by unloading fw-ohci while they still got a filesystem on
an SBP-2 disk mounted.  (Would panic before 2.6.23-rc4, will "only" cause
connection loss and thus possible filesystem corruption since 2.6.23-rc4.)  But
while there may be reasons to unload video1394 or dv1394 while sbp2 is active,
there are hardly reasons to unload fw-ohci while fw-sbp2 is active.

Best would be though if drivers/scsi/scsi.c::scsi_device_get() and
scsi_device_put() would be expanded to call into hooks provided by SCSI lowlevel
drivers.  Then fw-sbp2 could get and put the card driver module when the
scsi_device of an SBP-2 device behind the cart is being _get() and _put(), e.g.
if a filesystem on it is mounted and unmounted.

Comment 7 Stefan Richter 2007-09-18 22:30:43 UTC

Colin wrote in comment #3:
> The first part of this suggests that firewire_ohci is trying to remove
> firewire_sbp2 as part of its unloading
[...]
> nike# rmmod firewire_ohci
> firewire_sbp2: management write failed, rcode 0xffffffed
> sd 8:0:0:0: [sdg] Synchronizing SCSI cache
> firewire_sbp2: removed sbp2 unit fw1.0
> firewire_ohci: Removed fw-ohci device.

No, firewire-ohci knows nothing of firewire-sbp2.  When firewire-ohci is
unloaded, it first tells firewire-core to shut down all cards which
firewire-ohci services, and firewire-core therefore shuts down all devices on
that card.  It does a quick shutdown though which doesn't give scsi-highlevel
and firewire-sbp2 any chance anymore to perform shutdown procedures (synchronize
cache, log out).

The panic after that happened because firewire-core forgot to remove a
card-related timer before letting firewire-ohci proceed to remove the card's
data structure.

Comment 8 Stefan Richter 2008-01-03 23:14:12 UTC

I suppose the fix (upstream commit 8a2d9ed3210464d22fccb9834970629c1c36fa36
"firewire: fix unloading of fw-ohci while devices are attached") made it into
one or another Fedora kernel by now.  Could a kernel package maintainer or the
reporter have a look?

Comment 9 Christopher Brown 2008-01-04 00:29:02 UTC

Hi Stefan,

Yes, its in current 2.6.23 based kernel. I'm pretty sure it won't be backported
as F-7 is also running 2.6.23 and previous Fedora releases are EOL'd. As we
haven't heard anything from the original reporter for three months, I'm closing
this INSUFFICIENT_DATA. Please re-open if required...

Cheers
Chris

Comment 10 Stefan Richter 2008-01-04 08:17:19 UTC

> I'm pretty sure it won't be backported as F-7 is also running
> 2.6.23 and previous Fedora releases are EOL'd.

Versions before Fedora 7 are not affected.