710705 – Kdump fails due to ACPI Error

Bug 710705 - Kdump fails due to ACPI Error

Summary: Kdump fails due to ACPI Error

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	first=2.6.38.7 fixed=f17 hwmon acpi
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-06-04 06:56 UTC by Takahisa Tanaka
Modified:	2012-06-18 22:01 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-06-18 22:01:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
serial console log (23.33 KB, text/plain) 2011-06-04 06:59 UTC, Takahisa Tanaka	no flags	Details
serial console log (68.29 KB, text/plain) 2011-06-25 08:30 UTC, Takahisa Tanaka	no flags	Details
View All

Description Takahisa Tanaka 2011-06-04 06:56:47 UTC

Description of problem:
When trying to kdump on ASUSTeK M4A89GTD-PRO/USB3, I got a hang.
I got a serial console log. Many ACPI error had occurred immediately
after asus_atk0100 module was loaded, then second kernel hangs
because of out of memory. Complete serial console log, please see the
attached file.

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger
... snip ...
Loading asus_atk0110.ko module
[ 1.275734] ACPI Error: No handler or method for GPE1F, disabling event
... repeated ACPI error messages ...
[ 15.347343] ACPI Error: No handler or method for GPE13, disabling event
[ 16.171539] Out of memory: Kill process 220 (insmod) score 1 or sacrifice child
[ 16.179240] Killed process 220 (insmod) total-vm:1804kB, anon-rss:488kB, file-rss:148kB
[ 17.492928] ACPI Error: No handler or method for GPE14, disabling event
[ 17.901195] ACPI Exception: AE_NO_MEMORY, Unable to queue handler for GPE B - event
... repeated ACPI error messages ...
... finally hang

workaround:
This problem can be avoided by comment out the 'insmod /lib/asus_atk0110.ko'
line in init script, and re-create the kdump initrd. Unfortunately,
it's ineffective to add /etc/kdump.conf to 'blacklist asus_atk0100' line.

Version-Release number of selected component (if applicable):
kernel-2.6.38.7-30.fc15.x86_64
kexec-tools-2.0.0-43.fc15.x86_64

How reproducible:
always

Steps to Reproduce:
1. Boot to Fedora15 on ASUSTeK M4A89GTD-PRO/USB3
2. Setup kdump
3. Sysrq-c

Actual results:
Second kernel runs OOM Killer, then hang

Expected results:
Kdump success to dump vmcore

Additional info:

Comment 1 Takahisa Tanaka 2011-06-04 06:59:40 UTC

Created attachment 502956 [details]
serial console log

Comment 2 Takahisa Tanaka 2011-06-12 09:28:00 UTC

I tried to kdump with "crashkernel=256M" kernel command line option, 
but, the result was the same as "crashkernel=128M".

Comment 3 masanari iida 2011-06-15 17:29:46 UTC

I am using ASUS P5Q + Fedora 15(x86_64).
I found my system also using asus_atk100.ko. 
In my case, the vmcore file was generated as expected.

kernel-2.6.38.7-30.fc15.x86_64
kexec-tools-2.0.0-41.fc15.x86_64

# lsmod |grep asus 
asus_atk0110           12407  0 

# ls -l /var/crash/2011-06-15-17\:04/vmcore 
-r--------. 1 root root 8336112784 Jun 16 02:06 /var/crash/2011-06-15-17:04/vmcore

kernel parameter
crashkernel=256M@128M

# cat /proc/iomem 
00000000-0000ffff : reserved
00010000-0009cbff : System RAM
0009cc00-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cffff : pnp 00:0f
000d0000-000dffff : PCI Bus 0000:00
000e4000-000fffff : reserved
00100000-cff6ffff : System RAM
  01000000-0147deed : Kernel code
  0147deee-01b412ff : Kernel data
  01c34000-01daffff : Kernel bss
  08000000-17ffffff : Crash kernel
cff70000-cff7dfff : ACPI Tables
cff7e000-cffcffff : ACPI Non-volatile Storage
cffd0000-cfffffff : reserved
d0000000-dfffffff : PCI Bus 0000:00
  d0000000-dfffffff : PCI Bus 0000:01
    d0000000-dfffffff : 0000:01:00.0
e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
  e0000000-efffffff : pnp 00:0e
f0000000-ffffffff : PCI Bus 0000:00
  f0000000-f03fffff : PCI Bus 0000:04
  f0400000-f05fffff : PCI Bus 0000:03
  f0600000-f07fffff : PCI Bus 0000:02
  f8f00000-f8ffffff : PCI Bus 0000:04
  f9ff8000-f9ffbfff : 0000:00:1b.0
    f9ff8000-f9ffbfff : ICH HD audio
  f9ffe800-f9ffefff : 0000:00:1f.2
    f9ffe800-f9ffefff : ahci
  f9fff400-f9fff4ff : 0000:00:1f.3
  f9fff800-f9fffbff : 0000:00:1d.7
    f9fff800-f9fffbff : ehci_hcd
  f9fffc00-f9ffffff : 0000:00:1a.7
    f9fffc00-f9ffffff : ehci_hcd
  fa000000-fe8fffff : PCI Bus 0000:01
    fa000000-fbffffff : 0000:01:00.0
    fd000000-fdffffff : 0000:01:00.0
    fe880000-fe8fffff : 0000:01:00.0
  fe900000-fe9fffff : PCI Bus 0000:02
    fe9c0000-fe9fffff : 0000:02:00.0
      fe9c0000-fe9fffff : ATL1E
  fea00000-feafffff : PCI Bus 0000:03
    feaffc00-feafffff : 0000:03:00.0
  feb00000-febfffff : PCI Bus 0000:05
    febe0000-febeffff : 0000:05:00.0
    febfe000-febfefff : 0000:05:03.0
      febfe000-febfefff : firewire_ohci
    febffc00-febffcff : 0000:05:00.0
      febffc00-febffcff : r8169
  fec00000-fec003ff : IOAPIC 0
  fed00000-fed003ff : HPET 0
  fed08000-fed08fff : pnp 00:07
  fed14000-fed19fff : pnp 00:01
  fed1c000-fed1ffff : pnp 00:07
  fed20000-fed3ffff : pnp 00:07
  fed50000-fed8ffff : pnp 00:07
  fee00000-fee00fff : Local APIC
    fee00000-fee00fff : reserved
      fee00000-fee00fff : pnp 00:0c
  ffc00000-ffefffff : pnp 00:0a
  fff00000-ffffffff : reserved
100000000-22fffffff : System RAM

Comment 4 Takahisa Tanaka 2011-06-19 06:57:10 UTC

(In reply to comment #3)

Thank you for your infomation.
I have updated to the latest BIOS on my motherboard. But, kdump still was failed.
So, I am investigating the cause of the ACPI error.

Comment 5 Takahisa Tanaka 2011-06-25 08:30:45 UTC

Created attachment 509882 [details]
serial console log

Comment 6 Takahisa Tanaka 2011-06-25 08:33:04 UTC

When I tested on kernel-2.6.38.8-32, symptoms changed. Only asus_atk0110 did not necessarily cause this problem, it seems that there are some modules which cause this problem. I attached serial console log.

> This problem can be avoided by comment out the 'insmod /lib/asus_atk0110.ko'
> line in init script, and re-create the kdump initrd.

So, to comment out the 'insmod /lib/asus_atk0110.ko' is not workaround any more.
The only workaround is to add KDUMP_COMMANDLINE_APPEND line in /etc/sysconfig/kdump to 'acpi=off'. Therefore, I changed the title.

Comment 7 Dave Jones 2012-04-11 16:55:04 UTC

Does the 2.6.43 kernel help at all ?

Note that kdump was very neglected in Fedora until recently. F17 should be much better.

Comment 8 Takahisa Tanaka 2012-04-15 09:19:52 UTC

Hi Dave,
Thank you for your reply.

My test result is as follows. This problem seems to have generated from kernel 2.6.38. 

  Fedora13 with 2.6.33.3-85.fc13.x86_64: kdump success

  Fedora14 with 2.6.35.6-45.fc14.x86_64: kdump success
  Fedora14 with 2.6.35.14-106.fc14.x86_64: kdump success
  Fedora14 with 2.6.36.4(linux-2.6.36.4.tar.gz): kdump success
  Fedora14 with 2.6.37.6.x86_64(linux-2.6.37.6.tar.gz): kdump success
  Fedora14 with 2.6.38.x86_64(linux-2.6.38.tar.gz): kdump failed(ACPI error)
  Fedora14 with 2.6.38.8.x86_64(linux-2.6.38.8.tar.gz): kdump failed(ACPI error)
  Fedora15 with 2.6.38.6-26.rc1.fc15.x86_64: kdump failed(ACPI error)
  Fedora15 with 2.6.38.7-30.fc15.x86_64: kdump failed(ACPI error)
  Fedora15 with 2.6.38.8-32.fc15.x86_64: kdump failed(ACPI error)
  Fedora15 with 2.6.42.12-1.fc15.x86_64: kdump failed(ACPI error)
  Fedora15 with 2.6.43.1-5.fc15.x86_64: kdump failed(ACPI error)

  Fedora16 with 3.1.0-7.fc16.x86_64: kdump failed(ACPI error)
  Fedora16 with 3.3.1-3.fc16.x86_64: kdump failed(ACPI error)

I investigated the difference between 2.6.37.6 and 2.6.38. As a result, I suspect that the sp5100_tco driver should cause the ACPI error. The sp5100_tco is hardware watchdog timer driver for AMD/ATI SP5100 chipset that is used by M4A89GTD-PRO/USB3, and the driver has been introduced in kernel 2.6.38. 

The initrd which doesn't load sp5100_tco driver prevents ACPI error, because the blacklist option(blacklist sp5100_tco) of kdump.conf wasn't able to solve the problem. 

  # zcat /boot/initrd-kdump.img | cpio -id 
  # vi init
  ...
  ###echo "Loading sp5100_tco.ko module"   <--- comment out
  ###insmod /lib/sp5100_tco.ko             <--- comment out
  ...
  # find . | cpio -co | gzip -c > /boot/initrd-kdump.img
  # service kdump restart
  # sync
  # echo 1 > /proc/sys/kernel/sysrq
  # echo c > /proc/sysrq-trigger

The result of the test using above initrd is as follows. ACPI error doesn't occur, and kdump is success. 

  Fedora14 with 2.6.38.x86_64(linux-2.6.38.tar.gz): kdump success
  Fedora15 with 2.6.43.1-5.fc15.x86_64: kdump success
  Fedora16 with 3.3.1-3.fc16.x86_64: kdump success

Therefore, I think that sp5100_tco driver must cause ACPI error. but I'm not sure why sp5100_tco driver causes ACPI error. 


Regards,
Takahisa

Comment 9 Takahisa Tanaka 2012-06-10 02:51:34 UTC

I have found out why a sp5100_tco driver causes this problem, and confirmed that this problem had been solved in Fedora17(kernel-3.4.0-1.fc17.x86_64 and kexec-tools-2.0.3-38.fc17.x86_64).


The M4A89GTD-PRO/USB3 has a SB850 chipset. But, sp5100_tco driver supports only sp5100 and SB700 series  chipset. sp5100_tco driver doesn't support SB8x0 chipset(*), because the offset address for register of watchdog differs between SP5100/SB700 and SB800 chipsets. 
*In other words, sp5100_tco driver doesn't know that watchdog register offset address was changed from SB800 series chipset. 

The offset address of SP5100 and SB700 chipsets are as follows, quotes from the AMD SB700/710/750 Register Reference Guide(Page 164) and the AMD SP5100 Register Reference Guide(Page 166).

  WatchDogTimerControl 69h
  WatchDogTimerBase0   6Ch
  WatchDogTimerBase1   6Dh
  WatchDogTimerBase2   6Eh
  WatchDogTimerBase3   6Fh

The offset address(*) of SB800 chipsets are as follows, quotes from AMD SB800-Series Southbridges Register Reference Guide(Page 147).

  WatchDogTimerEn      48h
  WatchDogTimerConfig  4Ch

When sp5100_tco driver enables watchdog timer, sp5100_tco driver writes enable and timer resolution bits to WatchDogTimerControl(0x69). In SB800 series chipset, this offset address is base address of AcpiGpe0Blk(*). As a result, base address of AcpiGpe0Blk is wrong.
*This register is 16bit(0x68 and 0x69).

  AcpiPmTmrBlk         64h
  P_CNTBlk             66h
  AcpiGpe0Blk          68h  <--- sp5100_tco driver writes a wrong value here!!!
  AcpiSmiCmdBlk        6Ah
  AcpiPm2CntBlk        6Eh

For your reference, the code which writes a wrong value to AcpiGpe0Blk is the following. 

  http://lxr.linux.no/#linux+v3.4/drivers/watchdog/sp5100_tco.c
  332        /* Enable Watchdog timer and set the resolution to 1 sec. */
  333        outb(SP5100_PM_WATCHDOG_CONTROL, SP5100_IO_PM_INDEX_REG);
  334        val = inb(SP5100_IO_PM_DATA_REG);
  335        val |= SP5100_PM_WATCHDOG_SECOND_RES;
  336        val &= ~SP5100_PM_WATCHDOG_DISABLE;
  337        outb(val, SP5100_IO_PM_DATA_REG);  <--- here


I believe that ACPICA(ACPI Component Architecture) of kernel gets confused, because the base address of AcpiGpe0Blk is wrong, and the following ACPI error messages occur. 

  [    5.576051] ACPI Error: No handler or method for GPE00, disabling event (20110112/evgpe-753)
  [    5.577007] ACPI Error: No handler or method for GPE01, disabling event (20110112/evgpe-753)
  ... repeat endlessly...


In Fedora17, Processing of mkdumprd was changed into dracut, and dracut doesn't include the module which isn't related to booting in initramfs. Therefore, sp5100_tco driver isn't loaded under kdump execution, and this problem no longer occurs.  


Thanks, 
Takahisa

Comment 10 Dave Jones 2012-06-18 22:01:21 UTC

excellent, thanks for testing!

Note You need to log in before you can comment on or make changes to this bug.