Cloned from bug #235542, version changed to 'rawhide' +++ This bug was initially created as a clone of Bug #235542 +++ Description of problem: kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and newer kernels OOPS immediately upon plugging in any sbp2/firewire attached hard drives. Version-Release number of selected component (if applicable): kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and all newer kernels released officially by the fedora project as of this posting. How reproducible: Happens every single time reliably. Steps to Reproduce: 1. If my external firewire chassis is plugged in during boot of xen kernel, crash will occur during boot. (during sbp2 init) 2. Kernel boots fine, dom0 is also booted fine if my external firewire chassis is NOT plugged in during boot. All applications run normally. However, the second I plug in my external firewire chassis, xen panics and hangs. 3. Actual results: kernel BUG at lib/../arch/i386/kernel/swiotlb.c:394! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Modules linked in: sbp2 bridge netloop netbk blktap blkbk ipv6 dm_mirror dm_mod raid1 raid0 video sb s i2c_ec button battery asus_acpi ac parport_pc lp parport sg iTCO_wdt ahci ide_cd ohci1394 i2c_i801 ieee1394 i2c_core cdrom serial_core pl2303 pcspkr sky2 usbserial floppy ata_piix libata sd_mod scsi Expected results: Smooth sailin' Additional info: Please know that NON-xen kernels work very happily on this system, and firewire is no trouble whatsoever. Gigabyte GA-965P-DS3 motherboard (rev 1.0, bios F10) E6400 Core-2-Duo CPU 4GB DDR2-800 ICH8 Northbridge PCI-Express 1394a FireWire card. Please see attached file for complete OOPS (sbp2 initilization). This is what a normal init looks like on a non-xen kernel: ieee1394: Initialized config rom entry `ip1394' ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 17 (level, low) -> IRQ 17 ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f5004000-f50047ff] Max Packet=[2048] IR/IT contexts=[4/8] ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting... ieee1394: Node added: ID:BUS[0-00:1023] GUID[0012100200000523] ieee1394: Node added: ID:BUS[0-01:1023] GUID[0012100200000522] ieee1394: Node added: ID:BUS[0-02:1023] GUID[0012100200000521] ieee1394: Node added: ID:BUS[0-03:1023] GUID[0012100200000520] ieee1394: Node added: ID:BUS[0-04:1023] GUID[001210020000051f] ieee1394: Node added: ID:BUS[0-05:1023] GUID[001210020000051e] ieee1394: Node changed: 0-00:1023 -> 0-06:1023 ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1) ieee1394: sbp2: Try serialize_io=0 for better performance scsi4 : SBP-2 IEEE-1394 And then it goes on to detect all of my drives happily. -- Additional comment from doc on 2007-04-06 17:40 EST -- Created an attachment (id=151899) OOPS during plugging in a firewire drive -- Additional comment from stefan-r-rhbz.de on 2007-04-08 16:14 EST -- This was perhaps fixed in kernel.org's 2.6.21-rcX by patch "[IA64] make swiotlb use bus_to_virt/virt_to_bus" http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=93fbff63e62b87fe450814db41f859d60b048fb8 and in kernel.org's 2.6.19.6, 2.6.20.2, and 2.6.16.44 by patch "Missing critical phys_to_virt in lib/swiotlb.c" http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commit;h=e16b67f9a0ac6d9f89f680b7f3b439abfb1dac5e http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.19.y.git;a=commit;h=bcaaa45c3feb2fcc36a247011970d5026c286154 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=commit;h=d4705d6dc74016619a1a6565dd54c7c5269c25d0 -- Additional comment from doc on 2007-04-08 19:49 EST -- kernel-xen-2.6.20-1.2943.fc6 from testing repo still crashes in the same fashion, and I believe it to include the patches noted in comment #2 for 2.6.20.2. -- Additional comment from stefan-r-rhbz.de on 2007-04-08 20:27 EST -- Could you check the kernel source whether it really has the patch? If it does, could you build a kernel with the following patch applied? "ieee1394: sbp2: enforce 32bit DMA mapping" http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=f8ab7cc6e5457670145e31af6571eb3a584dfddb If you need help with that, give me the URL of the RPM with kernel-xen-... sources. (But not the SRPM please.) -- Additional comment from doc on 2007-04-09 00:28 EST -- I applied the following patch to the 2943 kernel as suggested by comment #4, however the problem still exists and the crash still happens in the same way: *** a/drivers/ieee1394/sbp2.c 2007-02-04 13:44:54.000000000 -0500 --- b/drivers/ieee1394/sbp2.c 2007-04-08 21:10:01.000000000 -0400 *************** *** 765,770 **** --- 765,775 ---- SBP2_ERR("failed to register lower 4GB address range"); goto failed_alloc; } + #else + if (dma_set_mask(hi->host->device.parent, DMA_32BIT_MASK)) { + SBP2_ERR("failed to set 4GB DMA mask"); + goto failed_alloc; + } #endif } -- Additional comment from stefan-r-rhbz.de on 2007-04-09 05:21 EST -- A naive newbie question: Is it the host OS or a guest OS that explodes? -- Additional comment from doc on 2007-04-09 11:31 EST -- (In reply to comment #6) > A naive newbie question: Is it the host OS or a guest OS that explodes? Both crash. As best as I can tell, dom0 crashes first, followed shortly by xen itself. -- Additional comment from doc on 2007-04-10 12:17 EST -- Created an attachment (id=152155) Xen Crash 2943 kernel sbp2 -- Additional comment from stefan-r-rhbz.de on 2007-04-10 12:25 EST -- Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand / sync_single run? The host's or the guest's? -- Additional comment from doc on 2007-04-10 14:49 EST -- I believe it'(In reply to comment #9) > Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand / > sync_single run? The host's or the guest's? I am a newbie to xen.. but I'm guessing the context would be dom0.. the guest. -- Additional comment from doc on 2007-04-14 16:08 EST -- Created an attachment (id=152624) fedora-xen-2944 crash during boot I just tested the newly released Fedora-xen-2944 kernel, and unfortunately this bug still exists. FEDORA + XEN + SBP2 drives = OOPS. :( -- Additional comment from doc on 2007-05-02 18:51 EST -- Created an attachment (id=153998) fedora-xen-2948 crash during boot I just tired the newly release 2.6.20-1.2948 xen kernel.. unfortunately, it still crashes during sbp2 init. help :( -- Additional comment from stefan-r-rhbz.de on 2007-05-02 19:37 EST -- Created an attachment (id=154003) ieee1394: sbp2: move some memory allocations into non-atomic context and use GFP_DMA32 Re comment #3: Could you attach lib/swiotlb.c here? Make sure it is the one used in the kernel you are running. Also, you could try the attached patch. I took it from the last upstream patch submission round (post 2.6.21), so you might get conflicts when applying it to 2.6.20-something... The original patch as it went into mainline only did a GFP_ATOMIC -> GFP_KERNEL switch; in the attached version I also added GFP_DMA32 to the affected allocation to steer clear of swiotlb bounce buffers. -- Additional comment from chrisw on 2007-05-11 12:41 EST -- This is a problem with Xen's swiotlb which doesn't handle sync_single with DMA_BIDECTIONAL. There's three solutions. Preferable is to use TODEVICE or FROMDEVICE, but if that buffer gets written by both sides that's not an option. Next is simply disable this module in the Xen build. Last is fix Xen's swiotlb to handle DMA_BIDRECTIONAL in this case. The last option is the best, however unclear on what is needed to make this fix (so 2nd option is mostly likely one to use). -- Additional comment from stefan-r-rhbz.de on 2007-05-11 14:42 EST -- All bidirectional DMA mappings in drivers/ieee1394/sbp2.c were unnecessary. They were recently converted to the more specific DMA directions. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2446a79f4f0a5e88e5d8316dac407d66ac10f70d I will look up the 2.6.20-1.2948 sources and post a refreshed version of that commit circa tomorrow. -- Additional comment from stefan-r-rhbz.de on 2007-05-13 09:34 EST -- Created an attachment (id=154609) [PATCH 2.6.20] ieee1394: sbp2: optimize DMA direction of s/g tables I can't find a suitable archive of 2.6.20-1.2948 sources. So here you have the patch against vanilla 2.6.20. -- Additional comment from doc on 2007-05-13 11:53 EST -- I am happy to report that the patch provided in comment #16 has fixed my firewire issues. :) XEN is running happily now. Thanks! -- Additional comment from chrisw on 2007-05-17 14:06 EST -- I h -- Additional comment from chrisw on 2007-05-17 14:16 EST -- I have also seen this against the new drivers/firewire in rawhide, however it's from different path. The problem is in drivers/firewire/fw-ohci.c with the Async Receive Contexts setup in ar_context_add_page. While the buffer may be bidirectional, I don't see the point for the bidirectional sync for device (which is what's causing problem for Xen in rawhide) in that spot. The buffer is written by CPU and needs to by sync'd to device there, AFAICT. Shouldn't that just be DMA_TO_DEVICE?
change QA contact
There is a bug in ar_context_add_page(), at least according to what LDD3 says about DMA mapping: The CPU shall not access memory after it has been DMA-mapped. But I guess this bug is harmless because there is another dma_sync_single_for_device() after the belated accesses by the CPU. Regarding a possible optimization for DMA_TO_DEVICE: The struct ar_buffer which is mapped there contains /1/ a "command descriptor" to program the controller, and /2/ a buffer into which the controller will write incoming asynchronous packets. ("AR DMA" = asynchronous receive DMA context.) Part /2/, i.e. ab->data, could be mapped as DMA_FROM_DEVICE. Part /1/ is partly to device, partly from device: The first 24 bytes ab->descriptor.req_count... .branch_address could be mapped as DMA_TO_DEVICE. ab->descriptor.res_count is to be initialized by the driver but will later be updated by the controller, so I don't see a way around DMA_BIDIRECTIONAL here, except for allocation of consistent memory. ab->descriptor.transfer_status is only written by the controller. The ohci1394 driver of the old ieee1394 stack simply uses consistent memory: pci_alloc_consistent() for the buffers and pci_pool_alloc() for the descriptors.
The sbp2 problems seems to be fixed. However, bug #302471 has a report of crashes when loading fw-ohci (the problem that Chris mentioned on hist last comment above). The firewire modules were disabled on kernel-xen until the problems are solved. I will keep this bug open as a reminder that fw-ohci needs to be fixed before enabling it on kernel-xen.
> as a reminder that fw-ohci needs to be fixed > before enabling it on kernel-xen. You mean: ...that XEN needs to be fixed before anything which uses DMA_BIDIRECTIONAL can be enabled on it, or fw-ohci's AR DMA needs to be converted to consistent memory allocations before it can be enabled on XEN.
Created attachment 208501 [details] Experimental fix to make Xen swiotlb accept DMA_BIDIRECTIONAL DMA mappings This is an experimental patch to make Xen swiotlb implementation accept DMA_BIDIRECTIONAL mappings.
I am building a test kernel RPM with the experimental patch on attachment #208501 [details]. Testing by people with firewire ohci hardware will be welcome. The test package is available here: http://koji.fedoraproject.org/koji/taskinfo?taskID=176609 Additionally, I need people with sbp2 hardware to check if the fw-sbp2 module is really working under Xen, so I know if we can enable it on Rawhide/F-8 and F-7. A test package with fw-sbp2 enabled (but fw-ohci disabled) is being built at: http://koji.fedoraproject.org/koji/taskinfo?taskID=176648
Re comment #5: Eduardo, thanks for taking care of the root cause of the issue. Re comment #6: AFAIR fw-sbp2 won't do anything interesting as long as it hasn't access to an SBP-2 device. And for that it needs fw-ohci driving one or more controllers.
(In reply to comment #7) > > Re comment #6: AFAIR fw-sbp2 won't do anything interesting as long as it hasn't > access to an SBP-2 device. And for that it needs fw-ohci driving one or more > controllers. You are right. I was supposing sbp2 was simply a different type of controller. Now I have read the description of the config options. :) As it doesn't make sense to enable sbp2 without ohci support, I will keep all firewire modules disabled while the fix is not included on the Fedora packages. Testing of fw-ohci and fw-sbp2 using the packages built on http://koji.fedoraproject.org/koji/taskinfo?taskID=176609 is still needed, however.
Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again.
This bug has been in NEEDINFO for more than 30 days since feedback was first requested. As a result we are closing it. If you can reproduce this bug in the future against a maintained Fedora version please feel free to reopen it against that version. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp