Bug 628534

Summary: system reboots when AMD IOMMU is enabled
Product: Red Hat Enterprise Linux 5 Reporter: Stefan Assmann <sassmann>
Component: kernelAssignee: Kiran Thirumalai <kthiruma>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: andreas.herrmann3, bgollahe, bnagendr, jfeeney, nagananda.chumbalkar, peterm, qcai, ravikiran.thirumalai, tcamuso
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If AMD IOMMU is enabled in BIOS on ProLiant DL165 G7 systems, the system will reboot automatically when IOMMU attempts to initalize. To work around this issue, either disable IOMMU, or update the BIOS to version <filename>2010.09.06</filename> or later.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-17 19:04:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 656090    

Description Stefan Assmann 2010-08-30 10:26:53 UTC
Description of problem:
system: hp-dl165g7-01.rhts.eng.bos.redhat.com

After enabling AMD IOMMU in the BIOS the system always resets during IOMMU init.

hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 0, 0
hpet0: 4 32-bit timers, 14318180 Hz
ACPI: DMAR not present
GSI 16 sharing vector 0xA9 and IRQ 16
ACPI: PCI Interrupt 0000:00:00.2[A] -> GSI 55 (level, low) -> IRQ 169
AMD IOMMU: enabling GFX workaround for PCI device 02:00.0
AMD IOMMU: Enabling IOMMU at 00:00.2cap 0x40

After that line the system is reset. Only way to go beyond this is to disable the AMD IOMMU again.

Comment 3 Andreas Herrmann 2010-10-22 14:49:11 UTC
AFAIK we did not see similar problems during our tests.

Does the reset still occur with a newer RHEL 5.x kernel,
say kernel-2.6.18-225.el5?

Comment 4 Stefan Assmann 2010-11-10 07:57:36 UTC
Yes, it still occurs with kernel-2.6.18-230.el5.

System is
ProLiant DL165 G7
HP System BIOS - O37  (07/30/2010)

Comment 6 RHEL Program Management 2010-11-10 08:19:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Stefan Assmann 2010-11-10 11:03:34 UTC
This is possibly a BIOS bug.
Testing with upstream 2.6.35 revealed the following:
AMD-Vi: Can not reserve memory region fec20000 for mmio
AMD-Vi: This is a BIOS bug. Please contact your hardware vendor
Trying to free nonexistent resource <00000000fec20000-00000000fec23fff>

However upstream is able to handle the fact gracefully while RHEL5 is stuck in an endless reboot cycle.

Comment 8 Stefan Assmann 2010-11-10 12:36:06 UTC
I think I found the upstream commit that fixed it.
e82752d8b5a7e0a5e4d607fd8713549e2a4e2741 
x86/amd-iommu: Fix crash when request_mem_region fails

Unfortunately this will require further changes to the AMD IOMMU code. I'll try to cook something up.

Comment 9 Stefan Assmann 2010-11-17 11:16:29 UTC
Box boots fine with the latest BIOS revision
HP System BIOS - O37  (09/06/2010)

Just some IO_PAGE_FAULT displayed with a -194 kernel.
AMD IOMMU: Event logged [IO_PAGE_FAULT device=00:13.2 domain=0x0000 address=0x00000000000e43c0 flags=0x0050]
AMD IOMMU: Event logged [IO_PAGE_FAULT device=00:12.0 domain=0x0000 address=0x00000000000e5080 flags=0x0070]
AMD IOMMU: Event logged [IO_PAGE_FAULT device=00:12.0 domain=0x0000 address=0x00000000000e5040 flags=0x0050]
AMD IOMMU: Event logged [IO_PAGE_FAULT device=00:12.0 domain=0x0000 address=0x00000000ffffffc0 flags=0x0050]
AMD IOMMU: Event logged [IO_PAGE_FAULT device=00:12.0 domain=0x0000 address=0x00000000ffffffc0 flags=0x0050]
[...]

Comment 11 Linda Wang 2010-12-07 13:20:04 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
RHEL5 on ProLiant DL165 G7
systems the IOMMU needs to be disabled or the BIOS updated to version 
2010.09.06 or later.

Comment 16 Ryan Lerch 2011-01-05 02:06:40 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,2 @@
-RHEL5 on ProLiant DL165 G7
+If AMD IOMMU is enabled in BIOS on ProLiant DL165 G7
-systems the IOMMU needs to be disabled or the BIOS updated to version 
+systems, the system will reboot automatically when IOMMU attempts to initalize. To work around this issue, either disable IOMMU, or update the BIOS to version <filename>2010.09.06</filename> or later.-2010.09.06 or later.

Comment 17 Andreas Herrmann 2011-01-20 16:59:58 UTC
(For sake of completeness.)
In reply to comment #9 regarding the IO page faults, here a comment
from Joerg Roedel:

  "This is no real issue. The io-page-faults come from devices which are
  used by the BIOS and are not handed over to the OS yet (tyically USB
  controlers). From the time the IOMMU is initialized up to the point
  Linux loads the USB drivers such io-page-faults can happen.
  The BIOS can prevent that by defining unity-mapped ranges or
  exclusion-ranges. But the BIOSes I have seen don't do this."

Comment 18 Tony Camuso 2011-01-21 16:03:26 UTC
If I read this correctly, the problem is fixed with a later version of the BIOS. In that case, all that's needed is a CA from HP and a RH release note advising users to either update the BIOS or disable IOMMU in the BIOS.

Comment 20 John Feeney 2011-06-22 18:45:22 UTC
I believe the note for this was added to the 5.6 Technical Notes. Do we have a CA from HP so we can close this now?

Comment 21 RHEL Program Management 2011-08-17 19:04:55 UTC
Product Management has reviewed and declined this request.  You may appeal this
decision by reopening this request.