Bug 240471 - fw-ohci module crashes under kernel-xen
Summary: fw-ohci module crashes under kernel-xen
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel-xen
Version: rawhide
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Chris Wright
QA Contact: Virtualization Bugs
URL:
Whiteboard: bzcl34nup
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-17 18:29 UTC by Eduardo Habkost
Modified: 2009-12-14 20:37 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2008-05-07 01:45:49 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Experimental fix to make Xen swiotlb accept DMA_BIDIRECTIONAL DMA mappings (2.35 KB, patch)
2007-09-27 15:01 UTC, Eduardo Habkost
no flags Details | Diff

Description Eduardo Habkost 2007-05-17 18:29:25 UTC
Cloned from bug #235542, version changed to 'rawhide'

+++ This bug was initially created as a clone of Bug #235542 +++

Description of problem:

kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and newer kernels OOPS immediately upon
plugging in any sbp2/firewire attached hard drives.

Version-Release number of selected component (if applicable):

kernel-xen-2.6.19-1.2911.6.5.fc6.i686 and all newer kernels released 
officially
by the fedora project as of this posting.

How reproducible:

Happens every single time reliably.

Steps to Reproduce:
1. If my external firewire chassis is plugged in during boot of xen kernel,
crash will occur during boot. (during sbp2 init)
2. Kernel boots fine, dom0 is also booted fine if my external firewire chassis
is NOT plugged in during boot. All applications run normally. However, the
second I plug in my external firewire chassis, xen panics and hangs.
3.
  
Actual results:

kernel BUG at lib/../arch/i386/kernel/swiotlb.c:394!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: sbp2 bridge netloop netbk blktap blkbk ipv6 dm_mirror
dm_mod raid1 raid0 video sb
s i2c_ec button battery asus_acpi ac parport_pc lp parport sg iTCO_wdt
ahci ide_cd ohci1394 i2c_i801
 ieee1394 i2c_core cdrom serial_core pl2303 pcspkr sky2 usbserial floppy
ata_piix libata sd_mod scsi

Expected results:

Smooth sailin'

Additional info:

Please know that NON-xen kernels work very happily on this system, and 
firewire
is no trouble whatsoever.

Gigabyte GA-965P-DS3 motherboard (rev 1.0, bios F10)
E6400 Core-2-Duo CPU
4GB DDR2-800
ICH8 Northbridge
PCI-Express 1394a FireWire card.

Please see attached file for complete OOPS (sbp2 initilization).

This is what a normal init looks like on a non-xen kernel:

ieee1394: Initialized config rom entry `ip1394'
ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 17 (level, low) -> IRQ 17
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17]  MMIO=[f5004000-f50047ff]  
Max
Packet=[2048]  IR/IT contexts=[4/8]
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[0012100200000523]
ieee1394: Node added: ID:BUS[0-01:1023]  GUID[0012100200000522]
ieee1394: Node added: ID:BUS[0-02:1023]  GUID[0012100200000521]
ieee1394: Node added: ID:BUS[0-03:1023]  GUID[0012100200000520]
ieee1394: Node added: ID:BUS[0-04:1023]  GUID[001210020000051f]
ieee1394: Node added: ID:BUS[0-05:1023]  GUID[001210020000051e]
ieee1394: Node changed: 0-00:1023 -> 0-06:1023
ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1)
ieee1394: sbp2: Try serialize_io=0 for better performance
scsi4 : SBP-2 IEEE-1394

And then it goes on to detect all of my drives happily.

-- Additional comment from doc on 2007-04-06 17:40 EST --
Created an attachment (id=151899)
OOPS during plugging in a firewire drive


-- Additional comment from stefan-r-rhbz.de on 2007-04-08 16:14 
EST --
This was perhaps fixed in kernel.org's 2.6.21-rcX by patch
"[IA64] make swiotlb use bus_to_virt/virt_to_bus"
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=93fbff63e62b87fe450814db41f859d60b048fb8

and in kernel.org's 2.6.19.6, 2.6.20.2, and 2.6.16.44 by patch
"Missing critical phys_to_virt in lib/swiotlb.c"
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=commit;h=e16b67f9a0ac6d9f89f680b7f3b439abfb1dac5e
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.19.y.git;a=commit;h=bcaaa45c3feb2fcc36a247011970d5026c286154
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=commit;h=d4705d6dc74016619a1a6565dd54c7c5269c25d0


-- Additional comment from doc on 2007-04-08 19:49 EST --
kernel-xen-2.6.20-1.2943.fc6 from testing repo still crashes in the same
fashion, and I believe it to include the patches noted in comment #2 for 
2.6.20.2.

-- Additional comment from stefan-r-rhbz.de on 2007-04-08 20:27 
EST --
Could you check the kernel source whether it really has the patch?

If it does, could you build a kernel with the following patch applied?
"ieee1394: sbp2: enforce 32bit DMA mapping"
http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=f8ab7cc6e5457670145e31af6571eb3a584dfddb

If you need help with that, give me the URL of the RPM with kernel-xen-...
sources.  (But not the SRPM please.)

-- Additional comment from doc on 2007-04-09 00:28 EST --
I applied the following patch to the 2943 kernel as suggested by comment #4,
however the problem still exists and the crash still happens in the same way:


*** a/drivers/ieee1394/sbp2.c   2007-02-04 13:44:54.000000000 -0500
--- b/drivers/ieee1394/sbp2.c   2007-04-08 21:10:01.000000000 -0400
***************
*** 765,770 ****
--- 765,775 ----
                        SBP2_ERR("failed to register lower 4GB address 
range");
                        goto failed_alloc;
                }
+ #else
+               if (dma_set_mask(hi->host->device.parent, DMA_32BIT_MASK)) {
+                       SBP2_ERR("failed to set 4GB DMA mask");
+                       goto failed_alloc;
+               }
  #endif
        }




-- Additional comment from stefan-r-rhbz.de on 2007-04-09 05:21 
EST --
A naive newbie question:  Is it the host OS or a guest OS that explodes?

-- Additional comment from doc on 2007-04-09 11:31 EST --
(In reply to comment #6)
> A naive newbie question:  Is it the host OS or a guest OS that explodes?

Both crash. As best as I can tell, dom0 crashes first, followed shortly by xen
itself.

-- Additional comment from doc on 2007-04-10 12:17 EST --
Created an attachment (id=152155)
Xen Crash 2943 kernel sbp2


-- Additional comment from stefan-r-rhbz.de on 2007-04-10 12:25 
EST --
Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand /
sync_single run?  The host's or the guest's?

-- Additional comment from doc on 2007-04-10 14:49 EST --
I believe it'(In reply to comment #9)
> Forgive my ignorance, but in whose context is the sbp2scsi_queuecommand /
> sync_single run?  The host's or the guest's?

I am a newbie to xen.. but I'm guessing the context would be dom0.. the guest.


-- Additional comment from doc on 2007-04-14 16:08 EST --
Created an attachment (id=152624)
fedora-xen-2944 crash during boot


I just tested the newly released Fedora-xen-2944 kernel, and unfortunately 
this
bug still exists. FEDORA + XEN + SBP2 drives = OOPS. :(

-- Additional comment from doc on 2007-05-02 18:51 EST --
Created an attachment (id=153998)
fedora-xen-2948 crash during boot

I just tired the newly release 2.6.20-1.2948 xen kernel.. unfortunately, it
still crashes during sbp2 init. help :(

-- Additional comment from stefan-r-rhbz.de on 2007-05-02 19:37 
EST --
Created an attachment (id=154003)
ieee1394: sbp2: move some memory allocations into non-atomic context and use
GFP_DMA32

Re comment #3:	Could you attach lib/swiotlb.c here?  Make sure it is the one
used in the kernel you are running.

Also, you could try the attached patch.  I took it from the last upstream 
patch
submission round (post 2.6.21), so you might get conflicts when applying it to
2.6.20-something...  The original patch as it went into mainline only did a
GFP_ATOMIC -> GFP_KERNEL switch; in the attached version I also added 
GFP_DMA32
to the affected allocation to steer clear of swiotlb bounce buffers.

-- Additional comment from chrisw on 2007-05-11 12:41 EST --
This is a problem with Xen's swiotlb which doesn't handle sync_single
with DMA_BIDECTIONAL.  There's three solutions.  Preferable is to use
TODEVICE or FROMDEVICE, but if that buffer gets written by both sides
that's not an option.  Next is simply disable this module in the Xen
build.  Last is fix Xen's swiotlb to handle DMA_BIDRECTIONAL in this
case.  The last option is the best, however unclear on what is needed
to make this fix (so 2nd option is mostly likely one to use).

-- Additional comment from stefan-r-rhbz.de on 2007-05-11 14:42 
EST --
All bidirectional DMA mappings in drivers/ieee1394/sbp2.c were unnecessary. 
They were recently converted to the more specific DMA directions.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2446a79f4f0a5e88e5d8316dac407d66ac10f70d

I will look up the 2.6.20-1.2948 sources and post a refreshed version of that
commit circa tomorrow.

-- Additional comment from stefan-r-rhbz.de on 2007-05-13 09:34 
EST --
Created an attachment (id=154609)
[PATCH 2.6.20] ieee1394: sbp2: optimize DMA direction of s/g tables

I can't find a suitable archive of 2.6.20-1.2948 sources.
So here you have the patch against vanilla 2.6.20.

-- Additional comment from doc on 2007-05-13 11:53 EST --
I am happy to report that the patch provided in comment #16 has fixed my
firewire issues. :) XEN is running happily now. Thanks!


-- Additional comment from chrisw on 2007-05-17 14:06 EST --
I h

-- Additional comment from chrisw on 2007-05-17 14:16 EST --
I have also seen this against the new drivers/firewire in rawhide, however 
it's
from different path.  The problem is in drivers/firewire/fw-ohci.c with the
Async Receive Contexts setup in ar_context_add_page.  While the buffer may
be bidirectional, I don't see the point for the bidirectional sync for device
(which is what's causing problem for Xen in rawhide) in that spot.  The buffer
is written by CPU and needs to by sync'd to device there, AFAICT.  Shouldn't
that just be DMA_TO_DEVICE?

Comment 1 Red Hat Bugzilla 2007-07-25 01:41:19 UTC
change QA contact

Comment 2 Stefan Richter 2007-07-25 09:25:09 UTC
There is a bug in ar_context_add_page(), at least according to what LDD3 says
about DMA mapping:  The CPU shall not access memory after it has been
DMA-mapped.  But I guess this bug is harmless because there is another
dma_sync_single_for_device() after the belated accesses by the CPU.

Regarding a possible optimization for DMA_TO_DEVICE:  The struct ar_buffer which
is mapped there contains /1/ a "command descriptor" to program the controller,
and /2/ a buffer into which the controller will write incoming asynchronous
packets.  ("AR DMA" = asynchronous receive DMA context.)

Part /2/, i.e. ab->data, could be mapped as DMA_FROM_DEVICE.  Part /1/ is partly
to device, partly from device:  The first 24 bytes ab->descriptor.req_count...
.branch_address could be mapped as DMA_TO_DEVICE.  ab->descriptor.res_count is
to be initialized by the driver but will later be updated by the controller, so
 I don't see a way around DMA_BIDIRECTIONAL here, except for allocation of
consistent memory.  ab->descriptor.transfer_status is only written by the
controller.

The ohci1394 driver of the old ieee1394 stack simply uses consistent memory: 
pci_alloc_consistent() for the buffers and pci_pool_alloc() for the descriptors.

Comment 3 Eduardo Habkost 2007-09-27 00:54:00 UTC
The sbp2 problems seems to be fixed. However, bug #302471 has a report of 
crashes when loading fw-ohci (the problem that Chris mentioned on hist last 
comment above).

The firewire modules were disabled on kernel-xen until the problems are 
solved. I will keep this bug open as a reminder that fw-ohci needs to be fixed 
before enabling it on kernel-xen.

Comment 4 Stefan Richter 2007-09-27 06:51:32 UTC
> as a reminder that fw-ohci needs to be fixed
> before enabling it on kernel-xen.

You mean:  ...that XEN needs to be fixed before anything which uses
DMA_BIDIRECTIONAL can be enabled on it, or fw-ohci's AR DMA needs to be
converted to consistent memory allocations before it can be enabled on XEN.

Comment 5 Eduardo Habkost 2007-09-27 15:01:15 UTC
Created attachment 208501 [details]
Experimental fix to make Xen swiotlb accept DMA_BIDIRECTIONAL DMA mappings

This is an experimental patch to make Xen swiotlb implementation accept
DMA_BIDIRECTIONAL mappings.

Comment 6 Eduardo Habkost 2007-09-27 15:06:34 UTC
I am building a test kernel RPM with the experimental patch on attachment 
#208501 [details]. Testing by people with firewire ohci hardware will be welcome. The 
test package is available here: 
http://koji.fedoraproject.org/koji/taskinfo?taskID=176609


Additionally, I need people with sbp2 hardware to check if the fw-sbp2 module 
is really working under Xen, so I know if we can enable it on Rawhide/F-8 and 
F-7. A test package with fw-sbp2 enabled (but fw-ohci disabled) is being built 
at: http://koji.fedoraproject.org/koji/taskinfo?taskID=176648

Comment 7 Stefan Richter 2007-09-27 15:49:58 UTC
Re comment #5:  Eduardo, thanks for taking care of the root cause of the issue.

Re comment #6:  AFAIR fw-sbp2 won't do anything interesting as long as it hasn't
access to an SBP-2 device.  And for that it needs fw-ohci driving one or more
controllers.

Comment 8 Eduardo Habkost 2007-09-27 16:05:45 UTC
(In reply to comment #7)
> 
> Re comment #6:  AFAIR fw-sbp2 won't do anything interesting as long as it 
hasn't
> access to an SBP-2 device.  And for that it needs fw-ohci driving one or 
more
> controllers.

You are right. I was supposing sbp2 was simply a different type of controller. 
Now I have read the description of the config options.   :)

As it doesn't make sense to enable sbp2 without ohci support, I will keep all 
firewire modules disabled while the fix is not included on the Fedora 
packages.

Testing of fw-ohci and fw-sbp2 using the packages built on 
http://koji.fedoraproject.org/koji/taskinfo?taskID=176609 is still needed, 
however.

Comment 9 Bug Zapper 2008-04-04 00:45:51 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 10 Bug Zapper 2008-05-07 01:45:47 UTC
This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp


Note You need to log in before you can comment on or make changes to this bug.