Description of problem: When trying to kdump on ASUSTeK M4A89GTD-PRO/USB3, I got a hang. I got a serial console log. Many ACPI error had occurred immediately after asus_atk0100 module was loaded, then second kernel hangs because of out of memory. Complete serial console log, please see the attached file. # echo 1 > /proc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger ... snip ... Loading asus_atk0110.ko module [ 1.275734] ACPI Error: No handler or method for GPE1F, disabling event ... repeated ACPI error messages ... [ 15.347343] ACPI Error: No handler or method for GPE13, disabling event [ 16.171539] Out of memory: Kill process 220 (insmod) score 1 or sacrifice child [ 16.179240] Killed process 220 (insmod) total-vm:1804kB, anon-rss:488kB, file-rss:148kB [ 17.492928] ACPI Error: No handler or method for GPE14, disabling event [ 17.901195] ACPI Exception: AE_NO_MEMORY, Unable to queue handler for GPE B - event ... repeated ACPI error messages ... ... finally hang workaround: This problem can be avoided by comment out the 'insmod /lib/asus_atk0110.ko' line in init script, and re-create the kdump initrd. Unfortunately, it's ineffective to add /etc/kdump.conf to 'blacklist asus_atk0100' line. Version-Release number of selected component (if applicable): kernel-2.6.38.7-30.fc15.x86_64 kexec-tools-2.0.0-43.fc15.x86_64 How reproducible: always Steps to Reproduce: 1. Boot to Fedora15 on ASUSTeK M4A89GTD-PRO/USB3 2. Setup kdump 3. Sysrq-c Actual results: Second kernel runs OOM Killer, then hang Expected results: Kdump success to dump vmcore Additional info:
Created attachment 502956 [details] serial console log
I tried to kdump with "crashkernel=256M" kernel command line option, but, the result was the same as "crashkernel=128M".
I am using ASUS P5Q + Fedora 15(x86_64). I found my system also using asus_atk100.ko. In my case, the vmcore file was generated as expected. kernel-2.6.38.7-30.fc15.x86_64 kexec-tools-2.0.0-41.fc15.x86_64 # lsmod |grep asus asus_atk0110 12407 0 # ls -l /var/crash/2011-06-15-17\:04/vmcore -r--------. 1 root root 8336112784 Jun 16 02:06 /var/crash/2011-06-15-17:04/vmcore kernel parameter crashkernel=256M@128M # cat /proc/iomem 00000000-0000ffff : reserved 00010000-0009cbff : System RAM 0009cc00-0009ffff : reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000cffff : pnp 00:0f 000d0000-000dffff : PCI Bus 0000:00 000e4000-000fffff : reserved 00100000-cff6ffff : System RAM 01000000-0147deed : Kernel code 0147deee-01b412ff : Kernel data 01c34000-01daffff : Kernel bss 08000000-17ffffff : Crash kernel cff70000-cff7dfff : ACPI Tables cff7e000-cffcffff : ACPI Non-volatile Storage cffd0000-cfffffff : reserved d0000000-dfffffff : PCI Bus 0000:00 d0000000-dfffffff : PCI Bus 0000:01 d0000000-dfffffff : 0000:01:00.0 e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff] e0000000-efffffff : pnp 00:0e f0000000-ffffffff : PCI Bus 0000:00 f0000000-f03fffff : PCI Bus 0000:04 f0400000-f05fffff : PCI Bus 0000:03 f0600000-f07fffff : PCI Bus 0000:02 f8f00000-f8ffffff : PCI Bus 0000:04 f9ff8000-f9ffbfff : 0000:00:1b.0 f9ff8000-f9ffbfff : ICH HD audio f9ffe800-f9ffefff : 0000:00:1f.2 f9ffe800-f9ffefff : ahci f9fff400-f9fff4ff : 0000:00:1f.3 f9fff800-f9fffbff : 0000:00:1d.7 f9fff800-f9fffbff : ehci_hcd f9fffc00-f9ffffff : 0000:00:1a.7 f9fffc00-f9ffffff : ehci_hcd fa000000-fe8fffff : PCI Bus 0000:01 fa000000-fbffffff : 0000:01:00.0 fd000000-fdffffff : 0000:01:00.0 fe880000-fe8fffff : 0000:01:00.0 fe900000-fe9fffff : PCI Bus 0000:02 fe9c0000-fe9fffff : 0000:02:00.0 fe9c0000-fe9fffff : ATL1E fea00000-feafffff : PCI Bus 0000:03 feaffc00-feafffff : 0000:03:00.0 feb00000-febfffff : PCI Bus 0000:05 febe0000-febeffff : 0000:05:00.0 febfe000-febfefff : 0000:05:03.0 febfe000-febfefff : firewire_ohci febffc00-febffcff : 0000:05:00.0 febffc00-febffcff : r8169 fec00000-fec003ff : IOAPIC 0 fed00000-fed003ff : HPET 0 fed08000-fed08fff : pnp 00:07 fed14000-fed19fff : pnp 00:01 fed1c000-fed1ffff : pnp 00:07 fed20000-fed3ffff : pnp 00:07 fed50000-fed8ffff : pnp 00:07 fee00000-fee00fff : Local APIC fee00000-fee00fff : reserved fee00000-fee00fff : pnp 00:0c ffc00000-ffefffff : pnp 00:0a fff00000-ffffffff : reserved 100000000-22fffffff : System RAM
(In reply to comment #3) Thank you for your infomation. I have updated to the latest BIOS on my motherboard. But, kdump still was failed. So, I am investigating the cause of the ACPI error.
Created attachment 509882 [details] serial console log
When I tested on kernel-2.6.38.8-32, symptoms changed. Only asus_atk0110 did not necessarily cause this problem, it seems that there are some modules which cause this problem. I attached serial console log. > This problem can be avoided by comment out the 'insmod /lib/asus_atk0110.ko' > line in init script, and re-create the kdump initrd. So, to comment out the 'insmod /lib/asus_atk0110.ko' is not workaround any more. The only workaround is to add KDUMP_COMMANDLINE_APPEND line in /etc/sysconfig/kdump to 'acpi=off'. Therefore, I changed the title.
Does the 2.6.43 kernel help at all ? Note that kdump was very neglected in Fedora until recently. F17 should be much better.
Hi Dave, Thank you for your reply. My test result is as follows. This problem seems to have generated from kernel 2.6.38. Fedora13 with 2.6.33.3-85.fc13.x86_64: kdump success Fedora14 with 2.6.35.6-45.fc14.x86_64: kdump success Fedora14 with 2.6.35.14-106.fc14.x86_64: kdump success Fedora14 with 2.6.36.4(linux-2.6.36.4.tar.gz): kdump success Fedora14 with 2.6.37.6.x86_64(linux-2.6.37.6.tar.gz): kdump success Fedora14 with 2.6.38.x86_64(linux-2.6.38.tar.gz): kdump failed(ACPI error) Fedora14 with 2.6.38.8.x86_64(linux-2.6.38.8.tar.gz): kdump failed(ACPI error) Fedora15 with 2.6.38.6-26.rc1.fc15.x86_64: kdump failed(ACPI error) Fedora15 with 2.6.38.7-30.fc15.x86_64: kdump failed(ACPI error) Fedora15 with 2.6.38.8-32.fc15.x86_64: kdump failed(ACPI error) Fedora15 with 2.6.42.12-1.fc15.x86_64: kdump failed(ACPI error) Fedora15 with 2.6.43.1-5.fc15.x86_64: kdump failed(ACPI error) Fedora16 with 3.1.0-7.fc16.x86_64: kdump failed(ACPI error) Fedora16 with 3.3.1-3.fc16.x86_64: kdump failed(ACPI error) I investigated the difference between 2.6.37.6 and 2.6.38. As a result, I suspect that the sp5100_tco driver should cause the ACPI error. The sp5100_tco is hardware watchdog timer driver for AMD/ATI SP5100 chipset that is used by M4A89GTD-PRO/USB3, and the driver has been introduced in kernel 2.6.38. The initrd which doesn't load sp5100_tco driver prevents ACPI error, because the blacklist option(blacklist sp5100_tco) of kdump.conf wasn't able to solve the problem. # zcat /boot/initrd-kdump.img | cpio -id # vi init ... ###echo "Loading sp5100_tco.ko module" <--- comment out ###insmod /lib/sp5100_tco.ko <--- comment out ... # find . | cpio -co | gzip -c > /boot/initrd-kdump.img # service kdump restart # sync # echo 1 > /proc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger The result of the test using above initrd is as follows. ACPI error doesn't occur, and kdump is success. Fedora14 with 2.6.38.x86_64(linux-2.6.38.tar.gz): kdump success Fedora15 with 2.6.43.1-5.fc15.x86_64: kdump success Fedora16 with 3.3.1-3.fc16.x86_64: kdump success Therefore, I think that sp5100_tco driver must cause ACPI error. but I'm not sure why sp5100_tco driver causes ACPI error. Regards, Takahisa
I have found out why a sp5100_tco driver causes this problem, and confirmed that this problem had been solved in Fedora17(kernel-3.4.0-1.fc17.x86_64 and kexec-tools-2.0.3-38.fc17.x86_64). The M4A89GTD-PRO/USB3 has a SB850 chipset. But, sp5100_tco driver supports only sp5100 and SB700 series chipset. sp5100_tco driver doesn't support SB8x0 chipset(*), because the offset address for register of watchdog differs between SP5100/SB700 and SB800 chipsets. *In other words, sp5100_tco driver doesn't know that watchdog register offset address was changed from SB800 series chipset. The offset address of SP5100 and SB700 chipsets are as follows, quotes from the AMD SB700/710/750 Register Reference Guide(Page 164) and the AMD SP5100 Register Reference Guide(Page 166). WatchDogTimerControl 69h WatchDogTimerBase0 6Ch WatchDogTimerBase1 6Dh WatchDogTimerBase2 6Eh WatchDogTimerBase3 6Fh The offset address(*) of SB800 chipsets are as follows, quotes from AMD SB800-Series Southbridges Register Reference Guide(Page 147). WatchDogTimerEn 48h WatchDogTimerConfig 4Ch When sp5100_tco driver enables watchdog timer, sp5100_tco driver writes enable and timer resolution bits to WatchDogTimerControl(0x69). In SB800 series chipset, this offset address is base address of AcpiGpe0Blk(*). As a result, base address of AcpiGpe0Blk is wrong. *This register is 16bit(0x68 and 0x69). AcpiPmTmrBlk 64h P_CNTBlk 66h AcpiGpe0Blk 68h <--- sp5100_tco driver writes a wrong value here!!! AcpiSmiCmdBlk 6Ah AcpiPm2CntBlk 6Eh For your reference, the code which writes a wrong value to AcpiGpe0Blk is the following. http://lxr.linux.no/#linux+v3.4/drivers/watchdog/sp5100_tco.c 332 /* Enable Watchdog timer and set the resolution to 1 sec. */ 333 outb(SP5100_PM_WATCHDOG_CONTROL, SP5100_IO_PM_INDEX_REG); 334 val = inb(SP5100_IO_PM_DATA_REG); 335 val |= SP5100_PM_WATCHDOG_SECOND_RES; 336 val &= ~SP5100_PM_WATCHDOG_DISABLE; 337 outb(val, SP5100_IO_PM_DATA_REG); <--- here I believe that ACPICA(ACPI Component Architecture) of kernel gets confused, because the base address of AcpiGpe0Blk is wrong, and the following ACPI error messages occur. [ 5.576051] ACPI Error: No handler or method for GPE00, disabling event (20110112/evgpe-753) [ 5.577007] ACPI Error: No handler or method for GPE01, disabling event (20110112/evgpe-753) ... repeat endlessly... In Fedora17, Processing of mkdumprd was changed into dracut, and dracut doesn't include the module which isn't related to booting in initramfs. Therefore, sp5100_tco driver isn't loaded under kdump execution, and this problem no longer occurs. Thanks, Takahisa
excellent, thanks for testing!