Bug 453680 - Error in the uhci code causes usb not to work with iommu=calgary boot option
Summary: Error in the uhci code causes usb not to work with iommu=calgary boot option
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: x86_64
OS: All
low
medium
Target Milestone: rc
: ---
Assignee: Pete Zaitcev
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 474047
TreeView+ depends on / blocked
 
Reported: 2008-07-01 21:00 UTC by IBM Bug Proxy
Modified: 2009-06-20 04:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:26:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Creates Fallback DMA ops for devices that are not handled by Calgary (4.95 KB, text/plain)
2008-07-01 21:00 UTC, IBM Bug Proxy
no flags Details
Full dmesg output (44.58 KB, text/plain)
2008-07-01 21:00 UTC, IBM Bug Proxy
no flags Details
Rediffed, #ifdef added, off-by-one covered. (5.35 KB, patch)
2008-08-27 23:53 UTC, Pete Zaitcev
no flags Details | Diff
Fixed mapping_error, sync_for_foo, alloc_consistent (9.47 KB, patch)
2008-09-12 03:19 UTC, Pete Zaitcev
no flags Details | Diff
Boot messages from x3950M2 (48.62 KB, text/plain)
2008-09-16 17:11 UTC, IBM Bug Proxy
no flags Details
Backtrace from panic (iommu=calgary) on .122 kernel (44.71 KB, application/octet-stream)
2008-11-24 19:31 UTC, IBM Bug Proxy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 43359 0 None None None Never
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description IBM Bug Proxy 2008-07-01 21:00:31 UTC
=Comment: #0=================================================
ALEXIS H. BRUEMMER <ahbruemm.com> - 2008-03-21 19:50 EDT
Problem description:
When iommu=calgary is used as a boot option an error occurs in the uhci_hcd code
that keeps the usb port from working

Provide output from "uname -a", if possible:
Linux elm3c157.beaverton.ibm.com 2.6.18-84.el5 #1 SMP Fri Feb 29 16:26:52 EST
2008 x86_64 x86_64 x86_64 GNU/Linux

Hardware Environment
    Machine type (p650, x235, SF2, etc.): x3950M2
    Cpu type (Power4, Power5, IA-64, etc.):x86_64
    Describe any special hardware you think might be relevant to this problem: CalIOC2


Is this reproducible?
    If so, how long does it (did it) take to reproduce it?
    Describe the steps:
add "iommu=calgary" as a boot option for RHEL5.2 beta
boot
watch as the usb port fail to come up.

Is the system (not just the application) hung? No

Did the system produce an OOPS message on the console? yes
    If so, copy it here:
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.0: irq 185, io base 0x00003000
uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
uhci_hcd 0000:00:1d.0: HC died; cleaning up
=Comment: #1=================================================
ALEXIS H. BRUEMMER <ahbruemm.com> - 2008-03-21 19:51 EDT

Full dmesg output

=Comment: #5=================================================
ALEXIS H. BRUEMMER <ahbruemm.com> - 2008-07-01 16:45 EDT

Creates Fallback DMA ops for devices that are not handled by Calgary

Verified and tested on an x3950M2 with multiple pci express controllers
attached.  This patch does not break kABI.

Comment 1 IBM Bug Proxy 2008-07-01 21:00:34 UTC
Created attachment 310722 [details]
Creates Fallback DMA ops for devices that are not handled by Calgary

Comment 2 IBM Bug Proxy 2008-07-01 21:00:36 UTC
Created attachment 310723 [details]
Full dmesg output

Comment 3 IBM Bug Proxy 2008-07-15 20:00:49 UTC
------- Comment From abareval.com 2008-07-15 15:56 EDT-------
Hello Red Hat,
Any news on this particular bug?

Comment 4 Pete Zaitcev 2008-08-13 22:20:03 UTC
Apparently the upstream diff is: 1956a96de488feb05e95c08c9d5e80f63a4be2b1
It's quite a bit different from the proposed patch (attached in 310722).

Comment 5 IBM Bug Proxy 2008-08-13 23:21:31 UTC
(In reply to comment #11)
> ------- Comment From zaitcev 2008-08-13 18:20:03 EDT-------
> Apparently the upstream diff is: 1956a96de488feb05e95c08c9d5e80f63a4be2b1
> It's quite a bit different from the proposed patch (attached in 310722).
Yes they are different because the patch that went into mainline would have
broken kabi. These patches however are functionally the same.

Comment 6 Pete Zaitcev 2008-08-27 23:53:23 UTC
Created attachment 315160 [details]
Rediffed, #ifdef added, off-by-one covered.

Comment 7 Pete Zaitcev 2008-09-12 03:19:56 UTC
Created attachment 316513 [details]
Fixed mapping_error, sync_for_foo, alloc_consistent

Comment 8 Pete Zaitcev 2008-09-12 03:34:15 UTC
Our engineering review have determined that the patch from IBM contains
several errors. I wrote a patch (comment #7, id=316513) to address these
problems. Specifically:

0. Fallback is not set if end_pfn == MAX_DMA32_PFN exactly.
   My previous patch addressed that already.

1. The mapping_error method was not implemented, which is ok for
   Calgary, but not ok for swiotlb. Therefore, Calgary has to implement
   mapping_error. Worse, two different addresses can be "bad". For that
   reason, an additional reservation is made in calgary_reserve_regions.

2. calgary_unmap_sg did not do necessary fallback.

3. Fallback pointers weren't checked for NULL, yet in some cases
   they could be NULL (in alloc_coherent, if fallback is nommu,
   for example).

4. It was necessary to create fall-throughs for syncing, in case
   swiotlb is the fallback (it's when memcpy happens, so we cannot
   skip that).

5. In alloc_consistent, we have to allocate the pages if there's no
   fallback (e.g. nommu). To this end, original code before the IBM
   patch can be used, so I restored that.

I would like my patch to be passed over to IBM for additional review
and for testing (I don't have a ready access to a Calgary based system).

This is quite urgent, if we want to make the 5.3, although I'm not
sure if we can. Since the patch failed the review, it cannot go into
the release, and the deadline for submissions has passed for 5.3.

Comment 12 IBM Bug Proxy 2008-09-12 18:33:49 UTC
Can we get this kernel tested on 3950M2 hardware at IBM
http://people.redhat.com/zaitcev/ftp/453680/

thanks!

Comment 14 RHEL Program Management 2008-09-12 21:32:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 IBM Bug Proxy 2008-09-16 00:30:49 UTC
I tested the last patch in the list above,  Fixed mapping_error, sync_for_foo,
alloc_consistent, applied against the 2.6.18-92 source and iommu=calgary.  In
that combination the the USB devices are working.

I also applied against a 2.6.18-111 but was unable to test it because because
the -111 kernel panics with iommu=calgary with an error in the build_tce_table
function.  There is a separate recently submitted bug for that problem.

Comment 16 Pete Zaitcev 2008-09-16 00:35:32 UTC
What about the binary kernels that I built, were they not booting on
Calgary? Have I built a wrong architecture or something?

Comment 18 IBM Bug Proxy 2008-09-16 17:10:57 UTC
Sorry, I wasn't given very specific information on what needed to be done for
the calgary problems and I was just cc'ed on this bug yesterday.

I boot the kernel that Ed posted a link to, 2.6.18-106.el5.bz453680.6, with the
kernel param iommu=calgary.  It boots without the build_tce_table bug seen in
other kernels and the USB devices work-- keyboard, mouse and read/write from USB
floppy.

I am seeing this error reported over and over on the console but it may be a
local hardware problem...

usb 4-1: device not accepting address 120, error -71
usb 4-1: device not accepting address 121, error -71
usb 4-1: device descriptor read/64, error -71
usb 4-1: device descriptor read/64, error -71
usb 4-1: device descriptor read/64, error -71
usb 4-1: device descriptor read/64, error -71

I'll attach a console log of the boot messages.

Comment 19 IBM Bug Proxy 2008-09-16 17:11:02 UTC
Created attachment 316864 [details]
Boot messages from x3950M2

Comment 20 Pete Zaitcev 2008-09-16 19:43:33 UTC
The -71 is usually a hardware problem (missing token, bad CRC etc.),
BUT we may precipitate it in software.

Do other Linux kernels produce the same error -71 on the unit in
question (the bus # may be different from #4)? If not, I broke
something.

Also, if keyboard and floppy work, what device is not seen by the OS?
Maybe some kind of front panel display?

Finally, remote bug mirroring between IBM and Red Hat strips the
identity of the updater. Would someone who has access and tests the
patches contact me at zaitcev, please?

Comment 21 IBM Bug Proxy 2008-09-17 23:00:52 UTC
I was premature in declaring that the 2.6.18-106.el5.bz453680.6 kernel works on
x3950.  It does resolve the problem if I boot a single node.  If I boot the same
kernel on a 4-node configuration it panics during the boot in the calgary
initialization.

Ed Pollard suggested that it may be related to the amount of memory.  So I try
setting the memory for both single node and multinode boots to mem=32768mb.  The
single node boots successfully but the same params on the 4 node panics.

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at arch/x86_64/kernel/pci-calgary.c:1211
invalid opcode: 0000 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-106.el5.bz453680.6 #1
RIP: 0010:[<ffffffff803f9711>]  [<ffffffff803f9711>] calgary_iommu_init+0x262/0x79c
RSP: 0000:ffff81011c175e90  EFLAGS: 00010282
RAX: 0000000000000f18 RBX: ffff8107fd9edd70 RCX: 00000000ffffffff
RDX: ffff8107fe3e80a1 RSI: ffffffff80145bfe RDI: ffff8107fe3e817c
RBP: ffff8107fd830800 R08: ffff8107fd830800 R09: ffff8107fe3e8000
R10: 00000000ffffffff R11: ffffffff8015df0f R12: ffffc2001088a160
R13: ffff8107fd9edd40 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff803af000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81011c174000, task ffff81011c16f7a0)
Stack:  0000000000000000 ffffffff80419388 0000000000000000 0000000000000000
0000000000000000 ffffffff803f3bd2 0000000000000000 ffffffff803eba56
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[<ffffffff803f3bd2>] pci_iommu_init+0x9/0x17
[<ffffffff803eba56>] init+0x1f9/0x2f7
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8016ddf7>] acpi_ds_init_one_object+0x0/0x80
[<ffffffff803eb85d>] init+0x0/0x2f7
[<ffffffff8005dfa7>] child_rip+0x0/0x11

Code: 0f 0b 68 aa 5d 29 80 c2 bb 04 0f b6 c2 48 89 ef 48 6b c0 18
RIP  [<ffffffff803f9711>] calgary_iommu_init+0x262/0x79c
RSP <ffff81011c175e90>
<0>Kernel panic - not syncing: Fatal exception

Comment 22 Pete Zaitcev 2008-09-18 02:06:54 UTC
The bus number is 0xa1 (it's in %dl, and also multiplied by 0x18 in %rax
== 0xf18). We build with MAX_PHB_BUS_NUM == 0x80, therefore BUG_ON trips.
I don't think it's something either the patch for bug 453680 introduced,
but all the same someone with access to a Calgary-based system ought
to investigate the source of the 0xa1.

BTW, upstream has a bug (and we inherit it). We use dev->bus->number
in calgary_init to test if translation is enabled, and only then
we hit BUG_ON. It is an access beyond the end of bus_info[].

Comment 23 IBM Bug Proxy 2008-09-19 05:21:00 UTC
(In reply to comment #26)
> ------- Comment From zaitcev 2008-09-17 22:06:54 EDT-------
> The bus number is 0xa1 (it's in %dl, and also multiplied by 0x18 in %rax
> == 0xf18). We build with MAX_PHB_BUS_NUM == 0x80, therefore BUG_ON trips.
> I don't think it's something either the patch for bug 453680 introduced,
> but all the same someone with access to a Calgary-based system ought
> to investigate the source of the 0xa1.
>

This is a bogus BUG_ON() in the driver and needs to be addressed. The PCIe
busses are intentionally sparsely allocated on a multi-node x3950 M2. Hence, it
is possible (expected) that the total number of busses will be greater than 128.

Comment 27 Don Zickus 2008-11-04 16:50:20 UTC
in kernel-2.6.18-122.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 29 Chris Ward 2008-11-18 18:13:25 UTC
~~ Snapshot 3 is now available ~~ 

Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword).

Ping your Partner Manager with any additional questions. Thanks!

Comment 30 IBM Bug Proxy 2008-11-19 00:31:15 UTC
Tested with x86_64 on 2.6.18-122.el5. Behavior hasn't changed from previous
tests. System boots and works without "iommu=calgary" but panics on boot in
calgary_iommu_init() with iommu=calgary.

Comment 31 Chris Ward 2008-11-19 10:43:35 UTC
IBM, if I read comment #15 correctly, there has been a separate bug filed for the issue encountered in Comment #30 (panic) or is that issue a part of this original bug request. Could you point me to that bug #?

Moving to ASSIGNED since it appears this bug FAILS_QA.

Comment 32 Pete Zaitcev 2008-11-20 19:57:22 UTC
Chris, it's impossible to tell what panic in comment #30 refers to.
It may be the sparse thing, or may be something else. I need to see
the console output before I can tell.

Comment 33 Chris Ward 2008-11-21 09:17:44 UTC
IBM, could you please provide more information about the issue you are encountering, as mentioned in comment #30? Thanks!

Comment 34 IBM Bug Proxy 2008-11-24 06:30:59 UTC
Doug I assume the panic you refer to above is the issue we already know about, but without a full backtrace it's impossible to tell. Could you please capture the complete backtrace so we know what problem you're running into.

Comment 35 IBM Bug Proxy 2008-11-24 19:31:38 UTC
Created attachment 324530 [details]
Backtrace from panic (iommu=calgary) on .122 kernel

Comment 36 IBM Bug Proxy 2008-11-24 21:11:51 UTC
(In reply to comment #36)
> Created an attachment (id=41225) [details]
> Backtrace from panic (iommu=calgary) on .122 kernel
>

Thanks Doug.

Pete, this is not the sparse bus allocation assertion. This is an issue where the pci_dev sysdata pointer, used to store the iommu_table, is already in use (non-NULL) when the calgary init() routine attempts to use it. I believe there's a bug opened for this particular issue, but it may be only internal. I'll do some more
investigation.

Comment 39 Pete Zaitcev 2008-11-28 04:54:22 UTC
Thanks to Matt Brodeur's efforts, I secured access to a Calgary-based box,
ibm-x3950m2-02.rhts, and reproduced the panic in tce.c due to nonzero sysdata.

I found that the nonzero sysdata is set by the PCI domain support code,
from patch linux-2.6-x86-pci-domain-support.patch.

One way to work around this might be to attach the domain number
to the pci_bus instead of pci_dev, under __GENKSYMS__. Then avoid using
sysdata for domains, return the NUMA node number where it was and that
will have Calgari magically working again.

The alternarive, I guess is add #ifdef CONFIG_PCI_DOMAINS into pci-calgary.c
and a pointer member into struct pci_sysdata (hopefuly nobody made it
into kABI).

N.B. The sysdata business has absolutely nothing to do with this bug:
the fallbacks. The IBM's patch that I fixed up (comment #7) did not cause
this regression -- PCI domains did. I propose that we clone this bug to
track the sysdata/domains issue.

Comment 42 Chris Ward 2008-12-08 11:53:09 UTC
~~ Snapshot 5 is now available @ partners.redhat.com ~~ 

Partners, RHEL 5.3 Snapshot 5 is now available for testing. Please send us your testing feedback on this important bug fix / feature request AS SOON AS POSSIBLE. If you are unable to test, indicate this in a comment or escalate to your Partner Manager. If we do not receive your test feedback, this bug will be AT RISK of being dropped from the release.

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla
Keywords field, along with a description of the test results. 

If you encounter a new bug, CLONE this bug and request from your Partner
manager to review. We are no longer excepting new bugs into the release, bar
critical regressions.

Comment 43 Chris Ward 2008-12-16 16:29:21 UTC
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible.  If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.

Comment 44 Chris Ward 2009-01-05 12:33:26 UTC
Could someone please confirm with the current status of this fix?

Comment 47 IBM Bug Proxy 2009-01-08 20:01:17 UTC
From Red Hat BZ 474047:

+++++
I just installed and successfully booted the x86_64 RHEL5.3 RC1
2.6.18-128.el5 kernel with the iommu=calgary boot option on a
2-node IBM x3950 M2.

Thanks, Gary
+++++

We should have a single node configuration tested as well very soon.

Comment 48 IBM Bug Proxy 2009-01-08 22:51:23 UTC
No usb errors/issues were discovered on the system when booting single node. This bug can be closed as of Red Hat 5 Update 3 RC1.

Comment 49 errata-xmlrpc 2009-01-20 20:26:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.