Bug 746755

Summary: *148000* 'GHES: Failed to read error status' kernel errors per day
Product: [Fedora] Fedora Reporter: mis
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, linux-bugs, madhu.chinakonda, matt_domsch, redhat, sysoper
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-07 15:00:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Good boot '/var/log/messages' file
none
Bad boot '/var/log/messages' file none

Description mis 2011-10-17 17:25:50 UTC
Created attachment 528604 [details]
Good boot '/var/log/messages' file

Description of problem:

Kernel 2.6.40.6-0.fc16.i686.PAE generates massive amount of kernel messages related to '[Firmware Warn]: GHES: Failed to read error status block address for hardware error source', whereas kernel 2.6.38.6-26.rc1.fc15.i686.PAE does not.

Version-Release number of selected component (if applicable):

kernel-PAE-2.6.40.6-0.fc15.i686

How reproducible:

Always

Steps to Reproduce:
1. Install a minimal FC15 system. (Note: kernel-PAE-2.6.38.6-26.rc1.fc15.i686 is installed)
2. Boot into system.
3. Use yum to upgrade system. (Kernel is upgraded to kernel-PAE-2.6.40.6-0.fc15.i686)
4. Reboot into new kernel.
5. # tail -f /var/log/messages
  
Actual results:

See massive numbers of '[Firmware Warn]: GHES: Failed to read error status block address for hardware error source' errors being logged.

Expected results:

No errors, as is the case when system is rebooted under kernel-PAE-2.6.38.6-26.rc1.fc15.i686.

Additional info:

After inspecting the difference between 'messages' files from the two kernel boots (kernel-PAE-2.6.38.6-26.rc1.fc15.i686 is 'good.txt', whereas kernel-PAE-2.6.40.6-0.fc15.i686 is 'bad.txt') I see some trace output in the newer kernel boot. Please see attached files.

Oct 17 12:38:53 www2 kernel: [    0.008587] ------------[ cut here ]------------
Oct 17 12:38:53 www2 kernel: [    0.008593] WARNING: at arch/x86/kernel/apic/apic.c:1237 setup_local_APIC+0xee/0x317()
Oct 17 12:38:53 www2 kernel: [    0.008595] Hardware name: PowerEdge R310
Oct 17 12:38:53 www2 kernel: [    0.008596] Modules linked in:
Oct 17 12:38:53 www2 kernel: [    0.008598] Pid: 1, comm: swapper Not tainted 2.6.40.6-0.fc15.i686.PAE #1
Oct 17 12:38:53 www2 kernel: [    0.008600] Call Trace:
Oct 17 12:38:53 www2 kernel: [    0.008604]  [<c07f548c>] ? printk+0x2d/0x2f
Oct 17 12:38:53 www2 kernel: [    0.008608]  [<c04436c5>] warn_slowpath_common+0x7c/0x91
Oct 17 12:38:53 www2 kernel: [    0.008610]  [<c07f05ad>] ? setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.008612]  [<c07f05ad>] ? setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.008614]  [<c04436fc>] warn_slowpath_null+0x22/0x24
Oct 17 12:38:53 www2 kernel: [    0.008617]  [<c07f05ad>] setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.008619]  [<c07f548c>] ? printk+0x2d/0x2f
Oct 17 12:38:53 www2 kernel: [    0.008622]  [<c042244f>] ? bigsmp_setup_apic_routing+0x20/0x22
Oct 17 12:38:53 www2 kernel: [    0.008627]  [<c0aac5ee>] native_smp_prepare_cpus+0x22f/0x2d2
Oct 17 12:38:53 www2 kernel: [    0.008630]  [<c0a9e7eb>] kernel_init+0x5d/0x136
Oct 17 12:38:53 www2 kernel: [    0.008633]  [<c0a9e78e>] ? start_kernel+0x353/0x353
Oct 17 12:38:53 www2 kernel: [    0.008636]  [<c080303e>] kernel_thread_helper+0x6/0x10
Oct 17 12:38:53 www2 kernel: [    0.008641] ---[ end trace a7919e7f17c0a725 ]---


and...


Oct 17 12:38:53 www2 kernel: [    0.132101] ------------[ cut here ]------------
Oct 17 12:38:53 www2 kernel: [    0.132109] WARNING: at arch/x86/kernel/apic/apic.c:1237 setup_local_APIC+0xee/0x317()
Oct 17 12:38:53 www2 kernel: [    0.132110] Hardware name: PowerEdge R310
Oct 17 12:38:53 www2 kernel: [    0.132111] Modules linked in:
Oct 17 12:38:53 www2 kernel: [    0.132115] Pid: 0, comm: kworker/0:0 Tainted: G        W   2.6.40.6-0.fc15.i686.PAE #1
Oct 17 12:38:53 www2 kernel: [    0.132116] Call Trace:
Oct 17 12:38:53 www2 kernel: [    0.132121]  [<c07f548c>] ? printk+0x2d/0x2f
Oct 17 12:38:53 www2 kernel: [    0.132125]  [<c04436c5>] warn_slowpath_common+0x7c/0x91
Oct 17 12:38:53 www2 kernel: [    0.132128]  [<c07f05ad>] ? setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.132130]  [<c07f05ad>] ? setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.132131]  [<c04436fc>] warn_slowpath_null+0x22/0x24
Oct 17 12:38:53 www2 kernel: [    0.132133]  [<c07f05ad>] setup_local_APIC+0xee/0x317
Oct 17 12:38:53 www2 kernel: [    0.132135]  [<c07eb0ed>] ? fpu_init+0x77/0x95
Oct 17 12:38:53 www2 kernel: [    0.132137]  [<c07eccc3>] ? cpu_init+0x146/0x14e
Oct 17 12:38:53 www2 kernel: [    0.132139]  [<c07ef70a>] start_secondary+0x105/0x259
Oct 17 12:38:53 www2 kernel: [    0.132141] ---[ end trace a7919e7f17c0a726 ]---

Lastly, I also see some ACPI errors in the 'bad.txt' file, that I don't see in the 'good.txt' file.

Oct 17 12:38:53 www2 kernel: [   36.395336] ACPI Error: No handler for Region [IPMI] (f242d540) [IPMI] (20110413/evregion-373)
Oct 17 12:38:53 www2 kernel: [   36.395341] ACPI Error: Region IPMI (ID=7) has no handler (20110413/exfldio-292)
Oct 17 12:38:53 www2 kernel: [   36.395345] ACPI Error: Method parse/execution failed [\_SB_.PMI0._GHL] (Node f2440768), AE_NOT_EXIST (20110413/psparse-536)
Oct 17 12:38:53 www2 kernel: [   36.395353] ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMC] (Node f2440708), AE_NOT_EXIST (20110413/psparse-536)
Oct 17 12:38:53 www2 kernel: [   36.395363] ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20110413/power_meter-773)

Comment 1 mis 2011-10-17 17:26:25 UTC
Created attachment 528605 [details]
Bad boot '/var/log/messages' file

Comment 2 Arul 2011-10-25 13:27:37 UTC
Have you found any solution to this problem?

I have same issue on my Dell PowerEdge after updating to ubuntu oneiric (3.0 based pae kernel). 

https://bugs.launchpad.net/ubuntu/+bug/881164

Comment 3 mis 2011-10-25 14:26:39 UTC
Nothing...

My issue is with Dell PowerEdge R310 Servers.

Still waiting for someone to acknowledge bug report.

Comment 4 mis 2011-11-11 21:03:13 UTC
Any update on this????

Comment 5 Dave Jones 2011-11-14 18:37:12 UTC
Reported upstream, and to Dell.

Comment 6 Dave Jones 2011-11-17 16:43:38 UTC
Does the 2.6.41 update improve the situation at all ?
There are 2 patches that went in (9fb0bfe, and b3b46d7) that might help.

Comment 7 mis 2011-11-21 16:23:40 UTC
Unfortunately no.

Comment 8 Paul Ogden 2011-11-23 22:53:59 UTC
Similar problems running 2.6.41.1-1.fc15.i686.PAE on a Supermicro 5015B-MT. Built this system just this week. Now /var/log/messages is filling by the minute with these messages:

Nov 23 14:52:03 newparis kernel: [  999.468008] [Firmware Warn]: GHES: Failed to read error status block address for hardware error source: 9.
Nov 23 14:52:03 newparis kernel: [  999.476010] [Firmware Warn]: GHES: Failed to read error status block address for hardware error source: 10.

Hoping there's a solution ( or an acceptable workaround ) soon. 

More details ( logs, dmidecode output, etc. ) can be provide if that helps.

Comment 9 mis 2011-11-24 14:03:22 UTC
We have rolled back to kernel-PAE-2.6.38.6-26.rc1.fc15, as this kernel doesn't generate these log entries.

Comment 10 Chuck Ebbert 2011-11-29 09:13:30 UTC
(In reply to comment #9)
> We have rolled back to kernel-PAE-2.6.38.6-26.rc1.fc15, as this kernel doesn't
> generate these log entries.

You should be able to work around this problem by adding "ghes.disable=1" to the kernel boot options.

Comment 11 Paul Ogden 2011-11-29 20:16:55 UTC
Both rolling to 2.6.38.6 and running 2.6.41.1 with "ghes.disable=1" suffice as workarounds. Would be nice to see a real solution. Also understand this may require firmware update(s) from vendor ( Supermicro ).

Thanks.

Comment 12 mis 2011-11-30 14:23:44 UTC
Using "ghes.disable=1" has proven to be a successful workaround for this issue.

Thanks.