Description of problem: System locks up completely about 30 seconds after connecting an external hard drive with an IEEE 1394 "Firewire" interface. No messages occur in system log relating to the device before this happens. Mouse curser lost, no keyboard response, hardware reset required to restart. Version-Release number of selected component (if applicable): As supplied with Fedora 7 How reproducible: Reliable Steps to Reproduce: 1. Plug in IEEE 1394 external hard drive 2. 3. Actual results: System goes away Expected results: automount of device Additional info:
Should have noted that this is an nVidia "nForce2" system chip set with the MCP-T southbridge chip.
seems like the kernel is freezing, after the modules are loaded..
I tried it at runlevel 3 to see if I could get an error message. The message: fw_core: phy config: card 0, new root=ffc0, gap_count=5 occurs hundreds of times, sometimes with "new root=ffc1" and then the terminal freezes without any error message occurring.
Do you use the built-in FireWire controller of the nForce2 chipset, or is it an extra chip? I.e., what does "lspci" print? Do you know if the harddisk works with another PC or/and OS? Vice versa, do you know if this PC's FireWire controller works with another OS, e.g. Windows or a different Linux? If you don't mind the effort, you could test this controller--harddisk combination e.g. with a Knoppix live CD or any other Linux live CD with FireWire support. The kernel in Fedora 7 has a new driver called fw-ohci for FireWire controllers. Before that, the ohci1394 was used and that one had a special workaround for the nForce2 FireWire controller. This workaround does ugly things similar to or same as in bug 144201. However, I also got one report that ohci1394 did not work anymore with an nForce2 controller in a rather recent Linux, despite that workaround. (Ohci1394 didn't freeze the kernel though.) In my opinion, the short-term goal for fw-ohci should be that it does not freeze on nForce2. But whether we can ever get nForce2 fully working remains to be seen; I doubt it.
It is using the nForce2 Firewire adapter from the MCP-T southbridge chip. I don't have the correct cable to connect this disk to my laptop (FC6). I will get one and check it out to make sure that the disk is working. It would be nice to have Firewire but I would certainly prefer the effort go into getting the USB 2.0 interface working! I tested this only because none of my USB storage devices work on F7 (bug 242359).
The disk works fine on the nForce2 hardware using the "System Rescue CD" distribution.
So not all hope is lost for nForce2. Just out of curiosity, is it based on kernel 2.4 or on 2.6?
The version of "System Rescue CD" which I have uses the 2.6.16 kernel. There is a new release of the live CD out which uses 2.6.20 but I have not tried it.
Thanks, knowing that 2.6.16 works is a good datapoint. Now, to port the ohci1394/2.6.16's nForce2 workaround over to fw-ohci/2.6.21-f7 still won't be trivial without direct access to the affected hardware: 1) fw-ohci's bus reset handler is quite different, 2) the old workaround is an evil hack with busy loop in the interrupt handler and we don't want to do such abominable hacks anymore. However, the 1st goal has to be to prevent kernel lockup.
This problem is eliminated if you install the last FC6 kernel (kernel-2.6.20-1.2962.fc6.i686.rpm). This eliminates the problems with USB 2.0 storage devices as well. These all seem to be issues specific to the 2.6.21 kernel.
OK, so ohci1394/2.6.20/fc6 also still works with nForce2, contrary to the one report I mentioned in comment #4.
Related upstream bug: "nforce2 firewire no longer supported in new stack" http://bugzilla.kernel.org/show_bug.cgi?id=8828 No crash there though.
Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. Could you tell me if you are still having problems with the latest kernel? I have re-assigned to the firewire subsystem maintainer who may wish to review it however if the problem no longer exists then please close this bug or I'll do so in a few days if there is no additional information lodged. Cheers Chris
At least the upstream bug #8828 (nForce2 unsupported, doesn't work but doesn't necessarily crash) is still open.
(In reply to comment #14) > At least the upstream bug #8828 (nForce2 unsupported, doesn't work but doesn't > necessarily crash) is still open. Yeah, I suppose actually *looking* at the upstream bug would help. Okay thanks for the update, I guess Jay Fenlason may want to add something. I've added the upstream bug as a reference plus there's this debian bug report which started the whole thing: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=435062 It suggests switching on the old stack by blacklist - is this something that can be considered? Cheers Chris
Upstream bug is still open and: http://wiki.linux1394.org/ToDo indicates this is still unresolved - any updates - Stefan?
No work on this has been started. Same status: Controller doesn't work; I don't know whether it will crash (didn't crash for some nForce2 owners).
Assignee or submitter, could you rename the bug to something like "FireWire drivers hang with nForce2 controller" or something like that?
This behavior sounds identical to what I get with a jmicron pci-e firewire controller. I've got an old nforce2 board and cpu laying around somewhere, I'll have to double-check if that has a firewire port too...
Nope, no onboard firewire on my old nforce2 board.
*** Bug 434866 has been marked as a duplicate of this bug. ***
There are two upstream bugs involved: - There is apparently a workaround needed to get the nForce2 controller properly initialized. http://bugzilla.kernel.org/show_bug.cgi?id=8828 - The topology code in firewire-core requires that each self ID complete event is the immediate successor of the previous one. Otherwise there is a danger of list corruption and subsequent kernel panic. http://bugzilla.kernel.org/show_bug.cgi?id=10128
Historical note from http://bugzilla.kernel.org/show_bug.cgi?id=8828: nForce2 workaround as it went into Linux 2.4 (linux1394 SVN r857: busReset loop check for nForce2 chipsets) http://bugzilla.kernel.org/attachment.cgi?id=15375&action=view same workaround, ported to Linux 2.5 (linux1394 SVN r858: busReset loop check, forward ported from 2.4 branch) http://bugzilla.kernel.org/attachment.cgi?id=15376&action=view It is not applicable as-is to firewire-ohci. ohci1394's interrupt handler serves pairs of bus reset events and self ID complete events, while firewire-ohci's interrupt handler only serves self ID complete events. Also, ohci1394's workaround is an awfully ugly hack, as mentioned in a previous comment.
The lockup-after-plugging-something-in problems with my jmicron controller are solved by this patch: http://lkml.org/lkml/2008/3/20/266 From the original reporter's description, it does sound like the same problem -- things are fine until you plug something in, then you hit a bus reset storm (comment #3) and an eventual freeze. I've added this patch to rawhide, kernel 2.6.25-0.141.rc6.git5.fc9 carries it. Will add it to a Fedora 8 kernel soonish to make it easier for people who don't wanna go rawhide to test.
Of course, the jmicron controller still misbehaves in other ways -- after plugging in a disk, the entire system seems to go out to lunch. Can't do anything on either vga or serial console on the box, all I can do is hit ctrl-alt-delete to start a shutdown, but that hangs trying to unload the firewire drivers.
Actually, it doesn't always hang trying to reboot, only sometimes.
Mike, I don't suppose you could try out the latest rawhide kernel on this board, could you? We have a few patches in there now that have rendered this JMicron card actually useful. Still trying to track down an nForce 2 board w/onboard FireWire here...
I've got an nForce 2 board showing up tomorrow, will try to debug this myself.
My new (used) MSI K7N2G-ILSR arrived today. Should have it actually booting something soonish... FWIW, it appears to be an Agere FW803 phy with nForce 2 southbridge MCP-T-driven link layer. IIRC, the Asus boards have a Realtek phy. Hopefully, the phy doesn't matter...
I've got the board up and running, just need to heist some cabling from another box so I've got a 6-pin port to work with instead of the current situation of just motherboard header pins. So far, so good, but when something was plugged in is when the JMicron card blew up, so this much isn't that interesting yet... $ lspci -nnv -s 00:0d.0 00:0d.0 FireWire (IEEE 1394) [0c00]: nVidia Corporation nForce2 FireWire (IEEE 1394) Controller [10de:006e] (rev a3) (prog-if 10 [OHCI]) Subsystem: Micro-Star International Co., Ltd. Unknown device [1462:570d] Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 17 Memory at e5084000 (32-bit, non-prefetchable) [size=2K] Memory at e5085000 (32-bit, non-prefetchable) [size=64] Capabilities: [44] Power Management version 2 $ dmesg |grep firewire firewire_ohci: Added fw-ohci device 0000:00:0d.0, OHCI version 1.10 firewire_core: created device fw0: GUID 000010dc0027ba67, S400
Irrelevant, but a bit odd that the OUI reported there (000010) is for Sytek, Inc., rather than Micro-Star International (which is 000C76).
OUIs registered for Micro-Star are 000C76, 0010DC, 0013D3, 0019DB, 001D92, 002185. Obviously they used 0010DC for this board but accidently shifted it by a byte. Makes you wonder what else they got wrong in the EEPROM.
Whoops, I stopped looking after the first MSI OUI I found. Yeah, that's an ominous sign that they got that wrong... And sadly (or perhaps fortunately, depending on how you look at it), even with the latest firewire code, the box locks up a little while after plugging in a hard disk (without logging a thing in the debug=0 case -- retrying with the debugging cranked up shortly).
Ew. Turning on all debug options reveals an endless stream of busReset interrupts even with nothing plugged into the controller (no crash, system just gets a bit sluggish). So I shut that back off (debug=7), and I get a bit of selfID stuff logged before the system completely locks up again. Fun.
Removing JMicron from the description, they were differently impacted and now working, while nForce2 controllers continue to misbehave right now...
Perhaps something like the following helps: After "reg_write(ohci, OHCI1394_IntEventClear, OHCI1394_busReset)" in bus_reset_tasklet, check if OHCI1394_IntEventSet still has the busReset bit set. If so, reschedule bus_reset_tasklet and return. Repeat until the bit was finally cleared _or_ a loop counter reached a certain limit. In the latter case, force a bus reset. All of this should depend on a quirk flag set for nForce2 (and perhaps Uninorth v1 like in ohci1394), i.e. should not be performed for any other controller. Because we have no control over how fast the tasklet is re-executed, an udelay(10) like in ohci1394 should be inserted before the early exit of bus_reset_tasklet as a starting point. We (you) should then find out - at what occasions this check and loop is needed, and at what occasions the bus reset has to be forced (all bus resets? only the very first bus reset? ...), - whether there is an optimum loop counter limit (avoid that forced bus reset if possible, but don't stay unnecessarily long in the loop), - whether the udelay can be tuned or, ideally, omitted. If a delay is necessary, then a timer callback instead of the plain tasklet would be more sophisticated and nicer to the rest of the system. However, we shouldn't overengineer the workaround for this increasingly rare hardware, hence an udelay in there ( = busy loop in tasklet context --- still better than in ohci1394 where it happens in hardware IRQ context) might be acceptable. It would be nice if you could instrument all this not only for the default configuration of the Fedora kernel, but also for different CONFIG_*HZ* and CONFIG_PREEMPT* settings. At least one thing can be assumed --- the nForce2 controller will obviously only be used with single processor, single core CPUs. Also, this should be checked with small buses (no other node, one external node) and large buses (for example four external nodes, and topology changes from e.g. 4 to 5 as well as 1 to 5, and 5 to 1 nodes). That said, decide for yourself how much work this outdated onboard controller is worth.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
On F9, with either the stock kernel or the current one (2.6.25.3-18.fc9.i686), as soon as I plug in a camcorder, I get this panic (from 2.6.25.3-18.fc9.i686): kernel BUG at drivers/firewire/fw-transaction.c:576! invalid opcode: 0000 [#1] SMP Modules linked in: radeon drm bridge bnep rfcomm l2cap bluetooth fuse sunrpc ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 vfat fat dm_mirror dm_multipath dm_mod ppdev ns558 gameport snd_mpu401 snd_mpu401_uart snd_rawmidi parport_pc parport snd_intel8x0 snd_ac97_codec serio_raw floppy pcspkr ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event 3c59x mii snd_seq snd_seq_device snd_pcm_oss firewire_ohci snd_mixer_oss firewire_core crc_itu_t snd_pcm snd_timer snd soundcore snd_page_alloc forcedeth sg button usb_storage i2c_nforce2 i2c_core sr_mod cdrom ata_generic pata_acpi pata_amd sata_sil libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted (2.6.25.3-18.fc9.i686 #1) EIP: 0060:[<f89725d4>] EFLAGS: 00010206 CPU: 0 EIP is at fw_core_handle_request+0x53/0x214 [firewire_core] EAX: 0000000e EBX: 00000011 ECX: c0799f4c EDX: c0799f44 ESI: f6f70094 EDI: 00000003 EBP: c0799f34 ESP: c0799ef8 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c0799000 task=c07133a0 task.ti=c0748000) Stack: 00000000 00000000 c0799f44 f797f000 c041e569 c2011800 00000000 5200dc00 0000003d 00000000 c2011800 00000011 00000011 f6f70094 00000003 c0799f8c f8967c2a f797fd2c f797f000 00000000 00000008 000000e0 006c0000 ff93ffff Call Trace: [<c041e569>] ? enqueue_entity+0x1a4/0x1c4 [<f8967c2a>] ? handle_ar_packet+0xf2/0x10b [firewire_ohci] [<f89680ed>] ? ar_context_tasklet+0xf3/0x101 [firewire_ohci] [<c042fdb7>] ? run_timer_softirq+0x141/0x1a7 [<c042ffa4>] ? process_timeout+0x0/0xa [<c042ffa4>] ? process_timeout+0x0/0xa [<c042b971>] ? tasklet_action+0x5e/0xb3 [<c042c089>] ? __do_softirq+0x79/0xe7 [<c0407ddb>] ? do_softirq+0x74/0xb5 [<c045d4f7>] ? handle_fasteoi_irq+0x0/0xaf [<c042be91>] ? irq_exit+0x38/0x6b [<c0407ec8>] ? do_IRQ+0xac/0xc4 [<c04065e7>] ? common_interrupt+0x23/0x28 [<c0418c8b>] ? native_safe_halt+0x5/0x7 [<c0404bf0>] ? default_idle+0x46/0x7e [<c0404baa>] ? default_idle+0x0/0x7e [<c0404b8a>] ? cpu_idle+0xa1/0xc1 [<c061cc59>] ? rest_init+0x49/0x4b ======================= Code: 77 29 ff 24 85 b4 4d 97 f8 8b 75 cc bb 04 00 00 00 83 c6 14 eb 20 8b 45 cc 8b 70 1c 0f b7 58 16 eb 14 8b 55 cc 0f b7 5a 16 eb 09 <0f> 0b eb fe bb 04 00 00 00 31 f6 8d 43 54 ba 20 00 00 00 e8 9f EIP: [<f89725d4>] fw_core_handle_request+0x53/0x214 [firewire_core] SS:ESP 0068:c0799ef8 Kernel panic - not syncing: Fatal exception in interrupt Though you may feel my motherboard is outdated, it is still perfectly usable for many things, including capturing video for the sake of science (for which older machines are often used by those who scrabble for ever-scarce grant funding). Please fix this if at all possible! I'd be happy to test patches and/or provide additional info, if that helps. Thank you.
Jarod: Did you also get crashes in fw_core_handle_request when you last dealt with nForce2? Chris: Could you do an # echo 7 > /sys/module/firewire_ohci/parameters/debug and crash your box again, and post the kernel messages which immediately precede the "kernel BUG at..."?
Chris: Also send output from "lspci -nnv" about the FireWire controller please.
Stefan: I set /sys/module/firewire_ohci/parameters/debug as requested but there were no messages before the "kernel BUG" line. (I'm using a serial console for this purpose, but I don't usually - they should show up there, right?) lspci -nnv, the only entry mentioning firewire or 1394: 00:0d.0 FireWire (IEEE 1394) [0c00]: nVidia Corporation nForce2 FireWire (IEEE 1394) Controller [10de:006e] (rev a3) (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Unknown device [1043:809a] Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 21 Memory at e6084000 (32-bit, non-prefetchable) [size=2K] Memory at e6085000 (32-bit, non-prefetchable) [size=64] Capabilities: [44] Power Management version 2 Kernel driver in use: firewire_ohci Kernel modules: firewire-ohci
Okay, I tried setting /sys/module/firewire_ohci/parameters/debug to 255 and loglevel to 8, and this time I got the following before the ---[cut here]--- line: AR evt_bus_reset, generation 2 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 2, local node ID ffc0 selfID 0: 807f8860, phy 0 [-p.] S400 gc=63 +0W Lc selfID 0: 817f08c0, phy 1 [c..] S100 gc=63 +0W Lc firewire_ohci: IRQ 00000010 AR_req AR ack_complete, link internal firewire_ohci: IRQ 00010010 selfID AR_req Please don't hesitate to let me know if there's anything else I can do, including trying a different kernel, applying a kernel patch, etc.
This too had "EIP is at fw_core_handle_request" in the trace?
Yes, and much of the panic message looked the same, but the call trace is different, so here's the whole thing: kernel BUG at drivers/firewire/fw-transaction.c:576! invalid opcode: 0000 [#1] SMP Modules linked in: radeon drm bridge bnep rfcomm l2cap bluetooth fuse sunrpc ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 vfat fat dm_mirror dm_multipath dm_mod ppdev ns558 snd_mpu401 gameport snd_mpu401_uart snd_rawmidi parport_pc parport snd_intel8x0 floppy snd_ac97_codec ac97_bus snd_seq_dummy pcspkr serio_raw snd_seq_oss snd_seq_midi_event 3c59x snd_seq mii snd_seq_device snd_pcm_oss firewire_ohci snd_mixer_oss firewire_core crc_itu_t snd_pcm snd_timer snd soundcore snd_page_alloc sg forcedeth usb_storage button i2c_nforce2 i2c_core sr_mod cdrom ata_generic pata_acpi pata_amd sata_sil libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted (2.6.25.3-18.fc9.i686 #1) EIP: 0060:[<f89715d4>] EFLAGS: 00010206 CPU: 0 EIP is at fw_core_handle_request+0x53/0x214 [firewire_core] EAX: 0000000e EBX: 00000011 ECX: 00000046 EDX: c0799f44 ESI: f6e41034 EDI: 00000003 EBP: c0799f34 ESP: c0799ef8 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c0799000 task=c07133a0 task.ti=c0748000) Stack: c0799f30 f89669a3 c0799f44 f7b03000 f8968192 f8968269 00000000 524f1307 c0799f00 c0560525 c0799f34 00000011 00000011 f6e41034 00000003 c0799f8c f8966c2a f7b03d2c f7b03000 00000000 00000002 000000e0 006c0000 ff93ffff Call Trace: [<f89669a3>] ? log_ar_at_event+0x11e/0x1af [firewire_ohci] [<c0560525>] ? put_device+0xf/0x11 [<f8966c2a>] ? handle_ar_packet+0xf2/0x10b [firewire_ohci] [<f89670ed>] ? ar_context_tasklet+0xf3/0x101 [firewire_ohci] [<c042b971>] ? tasklet_action+0x5e/0xb3 [<c042c089>] ? __do_softirq+0x79/0xe7 [<c0407ddb>] ? do_softirq+0x74/0xb5 [<c045d4f7>] ? handle_fasteoi_irq+0x0/0xaf [<c042be91>] ? irq_exit+0x38/0x6b [<c0407ec8>] ? do_IRQ+0xac/0xc4 [<c04065e7>] ? common_interrupt+0x23/0x28 [<c0418c8b>] ? native_safe_halt+0x5/0x7 [<c0404bf0>] ? default_idle+0x46/0x7e [<c0404baa>] ? default_idle+0x0/0x7e [<c0404b8a>] ? cpu_idle+0xa1/0xc1 [<c061cc59>] ? rest_init+0x49/0x4b ======================= Code: 77 29 ff 24 85 b4 3d 97 f8 8b 75 cc bb 04 00 00 00 83 c6 14 eb 20 8b 45 cc 8b 70 1c 0f b7 58 16 eb 14 8b 55 cc 0f b7 5a 16 eb 09 <0f> 0b eb fe bb 04 00 00 00 31 f6 8d 43 54 ba 20 00 00 00 e8 9f EIP: [<f89715d4>] fw_core_handle_request+0x53/0x214 [firewire_core] SS:ESP 0068:c0799ef8 Kernel panic - not syncing: Fatal exception in interrupt
Created attachment 307285 [details] firewire: don't panic on invalid AR request buffer Please get the sources of the Fedora kernel (or get vanilla 2.6.26-rc4 if you don't know how to deal with Fedora kernel sources) and apply the attached patch. It will perhaps have line offsets when applying, but that's harmless. This patch removes a rather contra productive BUG() --- the call which generated the trace and then crashed the machine. Instead it logs the header of the invalid packet which caused the trouble and then discards the packet. The underlying problem (which we still need to get to the bottom of) will probably still be unrecoverable after this simple change, and may even still cause the box to panic elsewhere when it continues processing the buffer of incoming requests.
Created attachment 307286 [details] firewire: fw-ohci: use of uninitialized data in AR handler Apply this patch too to be a little bit more cautious when the DMA buffer contains garbage.
PS: And then let's see if we learn something from subsequent firewire-ohci debug logging.
Okay, with your 2 patches, dmesg -n8, and echo 255 > /sys/module/firewire_ohci/parameters/debug (I deleted some usb crud which I assume is unrelated): firewire_ohci: IRQ 00200000 cycle64Seconds firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 2 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 2, local node ID ffc0 selfID 0: 807f8860, phy 0 [-p.] S400 gc=63 +0W Lc selfID 0: 817f08c0, phy 1 [c..] S100 gc=63 +0W Lc firewire_ohci: IRQ 00000010 AR_req AR ack_complete, link internal firewire_ohci: IRQ 00010010 selfID AR_req firewire_core: ERROR - corrupt request received - 000000e0 006c0000 ff93ffff AR evt_bus_reset, generation 3 firewire_ohci: 2 selfIDs, generation 3, local node ID ffc1 selfID 0: 806c0882, phy 0 [p..] S100 gc=44 +0W Lci selfID 0: 816c8870, phy 1 [-c.] S400 gc=44 +0W Lc firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 4 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 4, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 5 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 5, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 6 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 6, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req firewire_ohci: IRQ 00010000 selfID AR evt_bus_reset, generation 7 firewire_ohci: 2 selfIDs, generation 7, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 8 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 8, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 9 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 9, local node ID ffc1 selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 10 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 10, local node ID ffc0 selfID 0: 807f8862, phy 0 [-p.] S400 gc=63 +0W Lci selfID 0: 817f08c0, phy 1 [c..] S100 gc=63 +0W Lc firewire_ohci: IRQ 00000010 AR_req AR ack_complete, link internal firewire_ohci: IRQ 00010010 selfID AR_req firewire_core: ERROR - corrupt request received - 000000e0 006c0000 ff93ffff AR evt_bus_reset, generation 11 firewire_ohci: 2 selfIDs, generation 11, local node ID ffc0 selfID 0: 806c8860, phy 0 [-p.] S400 gc=44 +0W Lc selfID 0: 816c08c2, phy 1 [c..] S100 gc=44 +0W Lci firewire_core: giving up on config rom for node id ffc1 firewire_ohci: IRQ 00200000 cycle64Seconds and it stays up! (and that last line repeats periodically - could be every 64 seconds; I didn't check.) Thank you. Feeling lucky, I then tried dvgrab. It failed: rom1394_1 warning: read failed: 0x0000fffff0000414 error reading config rom directory for node 1 Error: no camera exists Now, on another F9 machine, I have also been trying dvgrab with the same camcorder. It fails there as well, unless I run it as root AND under strace (!!!). Here it fails even when run as root under strace. (And yes, I'll be filing a bug about the dvgrab issue as soon as I've gathered more information.) But at least it didn't crash. Next step?
Could you post this log again, this time taken from the respective /var/log/... file? (Check with "grep -r selfID /var/log/" which file that would be. I don't have Fedora, hence don't know.) I ask because of syslogd's time stamps. BTW, the cycle64Seconds IRQ event is indeed supposed to happen every 64 seconds. At least one thing which the nForce2 gets right. Now talking to myself: Hmm, "firewire_core: giving up on config rom for node id ffc1" happens without any previous AT event = asynchronous transmission. The gap counts are developing in a strange way. firewire-core tries hard to get it down to 5, but most of the time it stays at the standard value "gc=63", and twice gc=44 is logged, which is a strange value. Maybe the PHY packet from firewire-core was corrupted by the controller before being sent out. The junk AR data 000000e0 006c0000 ff93ffff look like as if the 2nd and 3rd quadlets are a PHY config packet (the 3rd quadlet being the inverse of the 2nd quadlet). As a PHY config packet, 006c0000 would mean "new root ID = 0" (makes sense), "root holdoff = 0" (junk; if it came from firewire-core it should be 1), "set gap count = 1" (makes sense), "gap count = 44" (junk). Anyway, the AR context is not supposed to receive PHY packets... but firewire-ohci forgets to tell this to the controller. Another small patch comes in a minute.
Created attachment 307296 [details] firewire: fw-ohci: clear LinkControl register when enabling the chip Should form the AR events yet one tiny bit more to our liking.
On Fedora, it's /var/log/messages, but they don't all show up there. I probably could have tweaked syslog.conf, but instead I just turned on printk timestamps (echo 1 > /sys/module/printk/parameters/time) - higher resolution anyway. Here's the output with your third patch included; let me know if you want to see it without that patch: [11190.105780] firewire_ohci: IRQ 00200000 cycle64Seconds [11194.455438] firewire_ohci: IRQ 00000010 AR_req [11194.455438] AR evt_bus_reset, generation 6 [11194.455443] firewire_ohci: IRQ 00010000 selfID [11194.456438] firewire_ohci: 2 selfIDs, generation 6, local node ID ffc1 [11194.456438] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc [11194.456438] selfID 0: 817f8870, phy 1 [-c.] S400 gc=63 +0W Lc [11194.556923] firewire_core: phy config: card 0, new root=ffc1, gap_count=5 [11194.560753] firewire_ohci: IRQ 00000010 AR_req [11194.560753] AR evt_bus_reset, generation 7 [11194.560753] firewire_ohci: IRQ 00010000 selfID [11194.560753] firewire_ohci: 2 selfIDs, generation 7, local node ID ffc1 [11194.560753] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc [11194.560753] selfID 0: 817f8872, phy 1 [-c.] S400 gc=63 +0W Lci [11194.593665] firewire_core: phy config: card 0, new root=ffc1, gap_count=5 [11194.597393] firewire_ohci: IRQ 00000010 AR_req [11194.597393] AR evt_bus_reset, generation 8 [11194.597393] firewire_ohci: IRQ 00010000 selfID [11194.597393] firewire_ohci: 2 selfIDs, generation 8, local node ID ffc0 [11194.597393] selfID 0: 807f8862, phy 0 [-p.] S400 gc=63 +0W Lci [11194.597393] selfID 0: 817f08c0, phy 1 [c..] S100 gc=63 +0W Lc [11195.194021] firewire_ohci: IRQ 00000010 AR_req [11195.194021] AR evt_bus_reset, generation 9 [11195.194021] firewire_ohci: IRQ 00010000 selfID [11195.194021] firewire_ohci: 2 selfIDs, generation 9, local node ID ffc0 [11195.194024] selfID 0: 806c8860, phy 0 [-p.] S400 gc=44 +0W Lc [11195.195021] selfID 0: 816c08c2, phy 1 [c..] S100 gc=44 +0W Lci [11224.991770] firewire_core: giving up on config rom for node id ffc1 [11254.026490] firewire_ohci: IRQ 00200000 cycle64Seconds
BTW, in case it's of any value, I get these when I turn the camera off: [11496.044735] firewire_ohci: IRQ 00000010 AR_req [11496.044735] AR evt_bus_reset, generation 10 [11496.044735] firewire_ohci: IRQ 00010000 selfID [11496.044735] firewire_ohci: 1 selfIDs, generation 10, local node ID ffc0 [11496.044738] selfID 0: 807f8852, phy 0 [--.] S400 gc=63 +0W Lci
The dvgrab issue I mentioned in comment #48 is now bug #449252, and intriguingly, enabling the debug output that you asked for in this bug (/sys/module/firewire_ohci/parameters/debug) also makes it work. !!!!!
Chris, good job tracking this down so far. To summarize the current status: - crash fixed by switching off (or handling) reception of PHY packets into the AR-req DMA context - some things work: bus reset detection, forcing a bus reset, selfID receive DMA, cycle timer rollover - AT DMA context inoperable: no AT events for fw-device's read requests and fw-card's PHY config packet transmissions - hence inability to read the config ROM of the camera, i.e. to "detect" the camera on a higher level than self identification - also caused by AT DMA issue: inability to get the gap count down (not a vital function though) - the camera sends strange PHY config packets, or the controller believes it does IMO the next thing to do is to patch fw-ohci for AT DMA context diagnostics. I will try to follow up on this during the week, but have a lot of things going on at the moment.
(In reply to comment #39) > Jarod: > > Did you also get crashes in fw_core_handle_request when you last dealt with nForce2? Finally got my nf2 box back up and going today. Nope, no crashes there, at least not logged. I just get repeated phy configs until the box locks up w/o any additional info. The lockup appears to be when generation wraps around from 255 back to 0. This is with the three patches posted Saturday, but an otherwise semi-outdated firewire stack, will try again with latest linux1394-2.6.git + latest posted patches soonish...
Nb: not sure I ever tried a dv camera w/my nf2, may have only been sbp2 devices. With this same outdated stack + the three patches, plugging in a dv camera doesn't lead to a crash, though the camera is still left inoperable: firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 4 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 1 selfIDs, generation 4, local node ID ffc0 selfID 0: 807f8c56, phy 0 [---] S400 gc=63 -3W Lci firewire_core: skipped bus generations, destroying all nodes firewire_core: created device fw0: GUID 000010dc0027ba67, S400 firewire_ohci: IRQ 00000010 AR_req AR evt_bus_reset, generation 5 firewire_ohci: IRQ 00010000 selfID firewire_ohci: 2 selfIDs, generation 5, local node ID ffc0 selfID 0: 807f8c66, phy 0 [-p-] S400 gc=63 -3W Lci selfID 0: 817f08d4, phy 1 [c--] S100 gc=63 +0W Lc firewire_ohci: IRQ 00200000 cycle64Seconds firewire_core: giving up on config rom for node id ffc1
Jarod, this looks about the same as on Chris' board: No AT events happen. The gc=44 in Chris' reports are probably a bug of the camera.
Created attachment 309401 [details] console output of boot and then plugging in Sony camcorder I was able to borrow another camera; this is a Sony DCR-TRV27 (vs. the Panasonic PV-DV402D that I have been using). I should have both of them for the next week or so. Alas, when I plug in the Sony, even with the patches you've provided, the system hangs. I've attached the output (with dmesg -n8 && echo 255 > /sys/module/firewire_ohci/parameters/debug). It hangs so hard I can't even do an emergency sync with the magic sysrq (or via the serial console).
Created attachment 309402 [details] console output of boot and then plugging in Panasonic camcorder Here is the output I get when I plug in the Panasonic camcorder, again after dmesg -n8 && echo 255 > /sys/module/firewire_ohci/parameters/debug; as before, it does NOT hang the machine (with your patches applied). Please let me know if I can provide any further info to help with this. I've borrowed a 1394 card that I could stick in this machine to make comparisons, if that would help.
Re comment 59: Like reported by Jarod in comment 55, it hangs when the generation counter is 0. In addition we now know that it hangs either in fw_core_handle_bus_reset() or after fw_core_handle_bus_reset(). The hang after generation roll-over is a general bug which I just confirmed on a FW323 controller: http://bugzilla.kernel.org/show_bug.cgi?id=10922 This bug is only more visible on nForce2 because the roll-over happens earlier due to nForce2's bus reset loop quirk.
Jay fixed one of the three related upstream bugs: http://bugzilla.kernel.org/show_bug.cgi?id=10922#c10
Patch for the rollover bug now in kernel-2.6.27.4-66.fc10 and later. Need to spin my nForce2 box back up one of these days and see what I can see now...
Seems we're not crashing on generation roll-over any more, just infinitely resetting -- survived quite a storm of resets before I pulled the drive back off the bus. Time to revive that attempt to port the hack from the old driver for this thing... Unfortunately, I think I've lost the prior work I did when one of the disks died in this machine.
I wonder if my "manually set the gap count in the phy control register" patch makes a difference. If you installed firewire_ohci with debug=31, check the gap counts on the self-id packets. If the local node has the wrong gap count, try my patch.
Re comment 64: Worth investigating, but unlikely to resolve it straight away. The old drivers never touched the gap count. The workaround in the old drivers was about getting the intEvent.busReset bit cleared. Among else, this bit directly influences AT DMA (see comment 54). Re comment 63: ...and I should really be working on a quirk fix for this ALi M52xx here in my box --- works with ohci1394, but firewire-ohci is unable to clear intEvent.busReset after the initial bus reset. I suspect that this is very similar to the nForce2 quirk.
(In reply to comment #36) > Perhaps something like the following helps: > > After "reg_write(ohci, OHCI1394_IntEventClear, OHCI1394_busReset)" in > bus_reset_tasklet, check if OHCI1394_IntEventSet still has the busReset bit set. > If so, reschedule bus_reset_tasklet and return. Repeat until the bit was finally > cleared _or_ a loop counter reached a certain limit. In the latter case, force a > bus reset. So I finally spent some time poking at my nf2 box tonight, re-wrote a port of the old stack's patch. Unfortunately, I *always* loop at least 10,000 times, and when I try to force a bus reset, the box panics. Using a 'fw_core_initiate_bus_reset() call to force the reset, and its most definitely this call that is panicking things. I'm probably missing something somewhere though...
Re comment 66: fw_core_initiate_bus_reset() uses ohci_update_phy_reg() which sleeps. A tasklet must not call sleeping functions of course. So, to force a bus reset from inside the bus reset handler, you need to add a reduced non-sleeping version of ohci_update_phy_reg(). Or I, to get ALi M52xx going. [Thinking loudly: Maybe bus_reset_tasklet should rather be a workqueue job. Then we could do a lot more things in it, like rescheduling it in a fashion which is less directly depending on the CONFIG_HZ setting, or do GFP_KERNEL allocations in fw-topology.c and fw-device.c, ... But I suppose the mainline kernel will have threaded IRQ handlers even before we get to seriously experiment with a workqueue conversion of the bus reset tasklet.]
PS, always remember: ohci1394 does not work for all NForce2 owners either. http://osdir.com/ml/kernel.firewire.user/2006-10/msg00092.html http://osdir.com/ml/kernel.firewire.user/2006-10/msg00137.html
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Still an issue, slide bug forward. Might get around to a fix some day, might not... Probably only 3 people in the universe that actually care anyway...
This message is a reminder that Fedora 11 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 11. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '11'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 11's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 11 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Bug still exists. Moving forward again. Hoping to look into it again eventually...
I've created a new bug for Fedora 12 which might shed some more light on this. http://bugzilla.redhat.com/show_bug.cgi?id=592000
(In reply to comment #73) > I've created a new bug for Fedora 12 which might shed some more light on this. > > http://bugzilla.redhat.com/show_bug.cgi?id=592000 Unrelated, unless you happen to be using an nForce2 onboard firewire controller. If you *are* using such a beast, it would seem to suggest the core issue w/nf2 controllers might have been inadvertently fixed at some point.
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle. Changing version to '14'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Jarod, is this fixed or ever going to be? If it's not and you want to keep it open, I suggest moving it to rawhide. Otherwise, maybe we should close it out.
We haven't done anything specific about it. But since there were a lot of driver modifications since comment 72, a test with a current kernel wouldn't hurt.
Every few weeks, I glance at the nForce2 system still hiding under my desk, and think that I should poke at this. Lets keep it open, and maybe someday, I actually *will* poke at it, for the good of the tiny handful of people who still have functioning nForce2 onboard firewire controllers and care.
Set FutureFeature so this never leaves rawhide and doesn't fester in some release when Jarod still hasn't fixed it. ;)
I think its well past time to call it a day on this one. Way too many other higher priority things, and I don't think there's anyone left in the world that gives a damn about this bug (myself included). :)