235542 – kernel-xen 2.6.19 and newer crash if sbp2 firewire is used

Bug 235542 - kernel-xen 2.6.19 and newer crash if sbp2 firewire is used

Summary: kernel-xen 2.6.19 and newer crash if sbp2 firewire is used

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel-xen
Sub Component:
Version:	6
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Eduardo Habkost
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-06 21:40 UTC by George Shearer
Modified:	2007-11-30 22:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-06-06 13:12:18 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
OOPS during plugging in a firewire drive (4.47 KB, text/plain) 2007-04-06 21:40 UTC, George Shearer	no flags	Details
Xen Crash 2943 kernel sbp2 (30.61 KB, text/plain) 2007-04-10 16:17 UTC, George Shearer	no flags	Details
fedora-xen-2944 crash during boot (33.00 KB, text/plain) 2007-04-14 20:08 UTC, George Shearer	no flags	Details
fedora-xen-2948 crash during boot (28.85 KB, text/plain) 2007-05-02 22:51 UTC, George Shearer	no flags	Details
ieee1394: sbp2: move some memory allocations into non-atomic context and use GFP_DMA32 (2.03 KB, patch) 2007-05-02 23:37 UTC, Stefan Richter	no flags	Details \| Diff
[PATCH 2.6.20] ieee1394: sbp2: optimize DMA direction of s/g tables (2.78 KB, patch) 2007-05-13 13:34 UTC, Stefan Richter	no flags	Details \| Diff
Show Obsolete (2) View All

Description George Shearer 2007-04-06 21:40:09 UTC

Description of problem:

kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and newer kernels OOPS immediately upon
plugging in any sbp2/firewire attached hard drives.

Version-Release number of selected component (if applicable):

kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and all newer kernels released officially
by the fedora project as of this posting.

How reproducible:

Happens every single time reliably.

Steps to Reproduce:
1. If my external firewire chassis is plugged in during boot of xen kernel,
crash will occur during boot. (during sbp2 init)
2. Kernel boots fine, dom0 is also booted fine if my external firewire chassis
is NOT plugged in during boot. All applications run normally. However, the
second I plug in my external firewire chassis, xen panics and hangs.
3.
  
Actual results:

kernel BUG at lib/../arch/i386/kernel/swiotlb.c:394!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: sbp2 bridge netloop netbk blktap blkbk ipv6 dm_mirror
dm_mod raid1 raid0 video sb
s i2c_ec button battery asus_acpi ac parport_pc lp parport sg iTCO_wdt
ahci ide_cd ohci1394 i2c_i801
 ieee1394 i2c_core cdrom serial_core pl2303 pcspkr sky2 usbserial floppy
ata_piix libata sd_mod scsi

Expected results:

Smooth sailin'

Additional info:

Please know that NON-xen kernels work very happily on this system, and firewire
is no trouble whatsoever.

Gigabyte GA-965P-DS3 motherboard (rev 1.0, bios F10)
E6400 Core-2-Duo CPU
4GB DDR2-800
ICH8 Northbridge
PCI-Express 1394a FireWire card.

Please see attached file for complete OOPS (sbp2 initilization).

This is what a normal init looks like on a non-xen kernel:

ieee1394: Initialized config rom entry `ip1394'
ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 17 (level, low) -> IRQ 17
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17]  MMIO=[f5004000-f50047ff]  Max
Packet=[2048]  IR/IT contexts=[4/8]
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[0012100200000523]
ieee1394: Node added: ID:BUS[0-01:1023]  GUID[0012100200000522]
ieee1394: Node added: ID:BUS[0-02:1023]  GUID[0012100200000521]
ieee1394: Node added: ID:BUS[0-03:1023]  GUID[0012100200000520]
ieee1394: Node added: ID:BUS[0-04:1023]  GUID[001210020000051f]
ieee1394: Node added: ID:BUS[0-05:1023]  GUID[001210020000051e]
ieee1394: Node changed: 0-00:1023 -> 0-06:1023
ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1)
ieee1394: sbp2: Try serialize_io=0 for better performance
scsi4 : SBP-2 IEEE-1394

And then it goes on to detect all of my drives happily.

Comment 1 George Shearer 2007-04-06 21:40:09 UTC

Created attachment 151899 [details]
OOPS during plugging in a firewire drive

Comment 2 Stefan Richter 2007-04-08 20:14:28 UTC

This was perhaps fixed in kernel.org's 2.6.21-rcX by patch
"[IA64] make swiotlb use bus_to_virt/virt_to_bus"
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=93fbff63e62b87fe450814db41f859d60b048fb8

and in kernel.org's 2.6.19.6, 2.6.20.2, and 2.6.16.44 by patch
"Missing critical phys_to_virt in lib/swiotlb.c"
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commit;h=e16b67f9a0ac6d9f89f680b7f3b439abfb1dac5e
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.19.y.git;a=commit;h=bcaaa45c3feb2fcc36a247011970d5026c286154
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=commit;h=d4705d6dc74016619a1a6565dd54c7c5269c25d0

Comment 3 George Shearer 2007-04-08 23:49:40 UTC

kernel-xen-2.6.20-1.2943.fc6 from testing repo still crashes in the same
fashion, and I believe it to include the patches noted in comment #2 for 2.6.20.2.

Comment 4 Stefan Richter 2007-04-09 00:27:56 UTC

Could you check the kernel source whether it really has the patch?

If it does, could you build a kernel with the following patch applied?
"ieee1394: sbp2: enforce 32bit DMA mapping"
http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=f8ab7cc6e5457670145e31af6571eb3a584dfddb

If you need help with that, give me the URL of the RPM with kernel-xen-...
sources.  (But not the SRPM please.)

Comment 5 George Shearer 2007-04-09 04:28:41 UTC

I applied the following patch to the 2943 kernel as suggested by comment #4,
however the problem still exists and the crash still happens in the same way:


*** a/drivers/ieee1394/sbp2.c   2007-02-04 13:44:54.000000000 -0500
--- b/drivers/ieee1394/sbp2.c   2007-04-08 21:10:01.000000000 -0400
***************
*** 765,770 ****
--- 765,775 ----
                        SBP2_ERR("failed to register lower 4GB address range");
                        goto failed_alloc;
                }
+ #else
+               if (dma_set_mask(hi->host->device.parent, DMA_32BIT_MASK)) {
+                       SBP2_ERR("failed to set 4GB DMA mask");
+                       goto failed_alloc;
+               }
  #endif
        }

Comment 6 Stefan Richter 2007-04-09 09:21:19 UTC

A naive newbie question:  Is it the host OS or a guest OS that explodes?

Comment 7 George Shearer 2007-04-09 15:31:34 UTC

(In reply to comment #6)
> A naive newbie question:  Is it the host OS or a guest OS that explodes?

Both crash. As best as I can tell, dom0 crashes first, followed shortly by xen
itself.

Comment 8 George Shearer 2007-04-10 16:17:56 UTC

Created attachment 152155 [details]
Xen Crash 2943 kernel sbp2

Comment 9 Stefan Richter 2007-04-10 16:25:33 UTC

Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand /
sync_single run?  The host's or the guest's?

Comment 10 George Shearer 2007-04-10 18:49:55 UTC

I believe it'(In reply to comment #9)
> Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand /
> sync_single run?  The host's or the guest's?

I am a newbie to xen.. but I'm guessing the context would be dom0.. the guest.

Comment 11 George Shearer 2007-04-14 20:08:05 UTC

Created attachment 152624 [details]
fedora-xen-2944 crash during boot


I just tested the newly released Fedora-xen-2944 kernel, and unfortunately this
bug still exists. FEDORA + XEN + SBP2 drives = OOPS. :(

Comment 12 George Shearer 2007-05-02 22:51:35 UTC

Created attachment 153998 [details]
fedora-xen-2948 crash during boot

I just tired the newly release 2.6.20-1.2948 xen kernel.. unfortunately, it
still crashes during sbp2 init. help :(

Comment 13 Stefan Richter 2007-05-02 23:37:28 UTC

Created attachment 154003 [details]
ieee1394: sbp2: move some memory allocations into non-atomic context and use GFP_DMA32

Re comment #3:	Could you attach lib/swiotlb.c here?  Make sure it is the one
used in the kernel you are running.

Also, you could try the attached patch.  I took it from the last upstream patch
submission round (post 2.6.21), so you might get conflicts when applying it to
2.6.20-something...  The original patch as it went into mainline only did a
GFP_ATOMIC -> GFP_KERNEL switch; in the attached version I also added GFP_DMA32
to the affected allocation to steer clear of swiotlb bounce buffers.

Comment 14 Chris Wright 2007-05-11 16:41:14 UTC

This is a problem with Xen's swiotlb which doesn't handle sync_single
with DMA_BIDECTIONAL.  There's three solutions.  Preferable is to use
TODEVICE or FROMDEVICE, but if that buffer gets written by both sides
that's not an option.  Next is simply disable this module in the Xen
build.  Last is fix Xen's swiotlb to handle DMA_BIDRECTIONAL in this
case.  The last option is the best, however unclear on what is needed
to make this fix (so 2nd option is mostly likely one to use).

Comment 15 Stefan Richter 2007-05-11 18:42:56 UTC

All bidirectional DMA mappings in drivers/ieee1394/sbp2.c were unnecessary. 
They were recently converted to the more specific DMA directions.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2446a79f4f0a5e88e5d8316dac407d66ac10f70d

I will look up the 2.6.20-1.2948 sources and post a refreshed version of that
commit circa tomorrow.

Comment 16 Stefan Richter 2007-05-13 13:34:57 UTC

Created attachment 154609 [details]
[PATCH 2.6.20] ieee1394: sbp2: optimize DMA direction of s/g tables

I can't find a suitable archive of 2.6.20-1.2948 sources.
So here you have the patch against vanilla 2.6.20.

Comment 17 George Shearer 2007-05-13 15:53:03 UTC

I am happy to report that the patch provided in comment #16 has fixed my
firewire issues. :) XEN is running happily now. Thanks!

Comment 18 Chris Wright 2007-05-17 18:06:19 UTC

I h

Comment 19 Chris Wright 2007-05-17 18:16:41 UTC

I have also seen this against the new drivers/firewire in rawhide, however it's
from different path.  The problem is in drivers/firewire/fw-ohci.c with the
Async Receive Contexts setup in ar_context_add_page.  While the buffer may
be bidirectional, I don't see the point for the bidirectional sync for device
(which is what's causing problem for Xen in rawhide) in that spot.  The buffer
is written by CPU and needs to by sync'd to device there, AFAICT.  Shouldn't
that just be DMA_TO_DEVICE?

Comment 20 Eduardo Habkost 2007-05-21 22:33:58 UTC

Removing dependency, that was added automatically by bugzilla cloning. The FC6 
bug is not dependent on the FC7 bug.

Comment 21 Eduardo Habkost 2007-06-05 18:22:33 UTC

Patch from comment #16 included on CVS.

Note You need to log in before you can comment on or make changes to this bug.