Bug 456638 - [Kdump] not work on HP-XW8600
Summary: [Kdump] not work on HP-XW8600
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: i686
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Martin Jenner
URL:
Whiteboard:
: 454974 454996 454998 458322 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-25 07:55 UTC by Qian Cai
Modified: 2018-10-20 00:39 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:18:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmidecode output (25.00 KB, text/plain)
2008-07-25 07:55 UTC, Qian Cai
no flags Details
full log of endless ACPI errors from the second Kernel (7.30 KB, text/plain)
2008-07-25 07:58 UTC, Qian Cai
no flags Details
full log of the second Kernel hung (8.72 KB, text/plain)
2008-07-25 07:59 UTC, Qian Cai
no flags Details
RHEL5 WAR for this issue (1.64 KB, patch)
2008-10-02 18:36 UTC, Prarit Bhargava
no flags Details | Diff
ACPI errors from "dmesg" on hp-xw6800-02 (30.17 KB, text/plain)
2008-11-12 08:23 UTC, Qian Cai
no flags Details
output of "dmidecode" on hp-xw6800-02 (20.55 KB, text/plain)
2008-11-12 08:47 UTC, Qian Cai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Qian Cai 2008-07-25 07:55:54 UTC
Description of problem:
Looks like Kdump still did not work on this model even with the latest BIOS.

BIOS Information
	Vendor: Hewlett-Packard
	Version: 786F5 v01.18
	Release Date: 05/05/2008

Triggering a crash, there were endless ACPI error messages from the second Kernel,


ACPI: Core revision 20060707
CPU0: Intel(R) Xeon(R) CPU            5140  @ 2.33GHz stepping 0b
Total of 1 processors activated (4676.69 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Brought up 1 CPUs
checking if image is initramfs... it is
Freeing initrd memory: 2833k freed
NET: Registered protocol family 16
No dock devices found.
ACPI: bus type pci registered
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
PCI: PCI BIOS revision 2.20 entry at 0xef47d, last bus=160
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI Error (evgpe-0711): No handler or method for GPE[ 0], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ 1], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ 2], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ 5], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ 6], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ 7], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ A], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[ F], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[10], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[11], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[12], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[13], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[14], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[15], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[16], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[17], disabling event
[20060707]
ACPI Error (evgpe-0711): No handler or method for GPE[18], disabling event
[20060707]

If I passed "noacpi", "acpi=off" to Kdump Kernel, the second hung,

Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled


Version-Release number of selected component (if applicable):
kernel-2.6.18-92.el5
kexec-tools-1.102pre-21.el5

How reproducible:
always

Steps to Reproduce:
1. configure Kdump with 128M@16M
2. SysRq-C

Comment 1 Qian Cai 2008-07-25 07:55:54 UTC
Created attachment 312629 [details]
dmidecode output

Comment 2 Qian Cai 2008-07-25 07:58:54 UTC
Created attachment 312630 [details]
full log of endless ACPI errors from the second Kernel

Comment 3 Qian Cai 2008-07-25 07:59:50 UTC
Created attachment 312631 [details]
full log of the second Kernel hung

Comment 4 Vivek Goyal 2008-09-30 20:50:56 UTC
This might be an BIOS issue. Some details of investigation so far.

- These systems have got GPE block and BIOS reports 2 addresses. One seems to be 32bit
address reachable from RSDT and other seems to be 64bit reachable from XSDT.

32bit version reports 0000F820 and 64bit version reports 000000000001F028. Upon doing some printk(), I found out that we default to using 64bit version both in first kernel and second kernel. On a 32bit machine this address is beyond IO port range of 0xffff.

I believe that could be the reason that when we try to disable the events in second kernel it effectively does not get disabled because port address is beyond reach. And for some reason in second kernel all the GPEs have fired up.

So the question to HP.

- Why there are two different addresses reported? Is it a bug or it signifies something.
- Second address is beyond IO port range and we can't disable events in second kernel even upon receiving a flood of events.  Is there a way to avoid this issue.

I think forcing the use of RSDT will make use of address F820 and that might fix the issue.

Thanks
Vivek

Comment 5 Jeff Burrell 2008-10-01 15:02:21 UTC
Vivek,

I'll get some answers from the BIOS engineers responsible for the xw8600 and get back to you.  In the meantime, the output from dmidecode indicates you're using a fairly old version of the system bios(1.18).  In the HP workstations IssueTracker #118848 there is the latest version (1.29) available.  On the very small chance that something changed with regard to the GPE block, you should probably retry after flashing the system to 1.29.

Hopefully I'll have some response shortly to share with you.

Jeff

Comment 6 Prarit Bhargava 2008-10-02 18:30:37 UTC
*** Bug 458322 has been marked as a duplicate of this bug. ***

Comment 7 Prarit Bhargava 2008-10-02 18:33:16 UTC
The problem appears to be that the address of the GPE0 register bank is reported incorrectly the HP xw BIOS.

From private email:

I can *prove* that there are two GPE0 addresses being reported.

I downloaded the acpica-unix package from Intel, and the pmtools package from Fedora.

I installed them both and did the following:

acpidump (to see if acpidump was working)

acpidump -b -t FACP -o FADT.aml (dumps the first FACP table)

iasl -d FADT.aml

This creates a file, FADT.dsl, which is human readable dump of the FADT table.

From this file we see:

[050h 080 4] GPE0 Block Address : 0000F828

The address for GPE0 registers is 0xf828 ... and ...

[0DCh 220 12] GPE0 Block : <Generic Address Structure>
[0DCh 220 1] Space ID : 01 (SystemIO)
[0DDh 221 1] Bit Width : 20
[0DEh 222 1] Bit Offset : 00
[0DFh 223 1] Access Width : 00
[0E0h 224 8] Address : 000000000001F030

This other structure reports that the address is 0x1f030.

Obviously ;) , 0x1f030 != 0xf828.

Comment 9 Jeff Burrell 2008-10-02 19:04:21 UTC
Our BIOS guys took a look yesterday and agree that what Prarit/Vivek have found is definitely a bug.  Interestingly, this bug is in a common part of the BIOS HP workstations share with the business desktops and has been in the tree for years, apparently without much consequence(until now :-), of course).

Since we're heavily into our Nehalem/Tylersburg development, it will take me a little while to get a test BIOS for either the xw6600 or xw8600.  I could, however, give you a test BIOS for the xw6800(Tylersburg) proto's you have if you wanted to test out the fix.  Let me know if you want to try that.

In the meantime, I'll see how quickly I can get something for the xw8600.

Comment 10 David Aquilina 2008-10-02 19:08:47 UTC
For any engineers who wish to take Jeff up on his offer of a test BIOS for the xw6800, please contact me for access to one of the prototypes.

Comment 11 Prarit Bhargava 2008-10-02 19:35:55 UTC
Hi Jeff,

Could we get a test BIOS for the xw6800?

I'll try to get you a list of problematic xw platforms,

P.

Comment 12 Jeff Burrell 2008-10-02 20:40:23 UTC
Prarit,

I thought I might be able to give you what I tested with here, which eliminates the GPE errors we've been seeing.  Unfortunately that test BIOS is the current top-of-tree that has a bunch of other things that would destabilize your prototypes.  The BIOS guys are planning a formal release late this week after which we can add this change back in for you to test with.  I am expecting I should have a version I can give you sometime next week.  Sorry for the delay...

Jeff

Comment 13 Prarit Bhargava 2008-10-15 13:55:51 UTC
*** Bug 454998 has been marked as a duplicate of this bug. ***

Comment 14 Prarit Bhargava 2008-10-15 14:25:53 UTC
*** Bug 454996 has been marked as a duplicate of this bug. ***

Comment 15 Prarit Bhargava 2008-10-15 14:33:45 UTC
*** Bug 454974 has been marked as a duplicate of this bug. ***

Comment 16 Charlie Wyse 2008-11-06 01:53:20 UTC
Is there a bios patch for xw9300 or xw9400s?  I am seeing a kdump failure on those two types of hardware.  It can be fixed by using the noapic option.  But I would be willing to test a new bios rev is possible.

Comment 17 Jeff Burrell 2008-11-06 15:16:31 UTC
No, currently there is no BIOS update available which fixes the bug identified by Prarit/Vivek for any of the shipping HP workstations.  The only BIOS patched so far is for the prototype systems xw4800/xw6800/xw8800.  The exact same problem exists in all HP workstations(and HP business desktops as well) because they have the same root BIOS tree in which this defect exists, so that change will propagate but it will do so slowly as all of those shipping platforms are nearing the end of their life.

Let me see if I can manage to get someone to make a special test version of the xw9400 BIOS for testing...

Jeff

Comment 18 Prarit Bhargava 2008-11-06 19:42:33 UTC
Everyone,

Jeff Burke (jburke) recently reported seeing this behavior on a xw8600 in RHTS.

I'm therefore proposing a WAR for this issue.  I will attach the WAR patch shortly and will submit the RHKL ASAP.

P.

Comment 22 Qian Cai 2008-11-12 08:03:48 UTC
It looks like -123.el5 Kernel includes the fix. However, I have still seen those ACPI errors on IA-32 hp-xw6800-02.rhts.bos.redhat.com for both bare-metal and Kdump Kernels. Please see attachments.

Comment 23 Qian Cai 2008-11-12 08:23:40 UTC
Created attachment 323304 [details]
ACPI errors from "dmesg" on hp-xw6800-02

Comment 24 Qian Cai 2008-11-12 08:47:53 UTC
Created attachment 323307 [details]
output of "dmidecode" on hp-xw6800-02

Comment 26 Don Zickus 2008-11-12 16:37:08 UTC
in kernel-2.6.18-123.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 29 Qian Cai 2008-11-13 02:29:57 UTC
New bug created for hp-xw6800-02.rhts.bos.redhat.com.
Bug 471341 -  [5.3] ACPI Error (evgpe-0711) on HP xw6800

Comment 30 Qian Cai 2008-11-13 10:49:00 UTC
Testing results for today so far 13 Nov. 2008

2.6.18-123 IA-32 Kernel:

[machine]      [bare-metal Kdump]   [Xen Domian 0 Kernel]
hp-xw6800-02   Fail                 ??
hp-xw8800-01   ??                   Fail[1]

[1] http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5127343

Comment 34 errata-xmlrpc 2009-01-20 20:18:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 35 Jeff Burrell 2009-01-28 00:16:00 UTC
Can someone post the WAR patch that was added to RHEL 5.3?  I have a customer who may need apply something similar to FC8 prior to releasing a BIOS fix.  This patch should be a reasonably good place to begin.  Thanks!

Jeff

Comment 36 Prarit Bhargava 2009-01-28 11:51:02 UTC
Comment on attachment 319276 [details]
RHEL5 WAR for this issue

Jeff, here is the patch.  We later made a change to this to catch *all* "HP xw" systems because of the large number of models that were failing.


Note You need to log in before you can comment on or make changes to this bug.