Description of problem: It has been observed several times on certain IBM x86_64 machines that capture kernel hanged at, SMP alternatives: switching to UP code ACPI: Core revision 20060707 ..MP-BIOS bug: 8254 timer not connected to IO-APIC With RHEL5.2-Server-20080225.2 tree, the following RHTS machines are affected as far as I tested, ibm-e326m.rhts.boston.redhat.com ibm-morrison.lab.boston.redhat.com ibm-pizzaro.rhts.boston.redhat.com Though, there is a workaround to add "noapic" to capture kernel command line. I have tried different version of either kernel (2.6.18-53.el5) or kexec-tools (1.101-194.4.el5) without success. Note that the problem is only triggered by certain crash scenarios. For example, LKDTM (Linux Kernel Dump Test Module)'s bug in do_irq(). Simple "echo c >/proc/sysrq-trigger" works perfect fine without problem. Version-Release number of selected component (if applicable): RHEL5.2-Server-20080225.2 kernel-2.6.18-83.el5 kexec-tools-1.102pre-10.el5 How reproducible: always (3 times in a row) Steps to Reproduce: 1. reserved one of the affected machines, and configured kdump and booted the kernel with crashkernel=128M@16M. 2. wget http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz; cd kdump/lib/lkdtm; export USE_SYMBOL_NAME=1; make 3. insmod lkdtm.ko cpoint_name=INT_HARDWARE_ENTRY cpoint_type=BUG cpoint_count=05 Actual results: Capture kernel hangs. Expected results: Capture kernel bring up successfully. Additional information: Both hanging and working (via sysrq-c) kernel booting logs have been attached. Compared two files showed some interesting data, --- ibm-e326m-hangs.log 2008-02-28 13:32:58.000000000 +0800 +++ ibm-e326m-works.log 2008-02-28 13:44:13.000000000 +0800 ... -CPU 0: aperture @ 410000000 size 32 MB +CPU 0: aperture @ 412000000 size 32 MB Aperture too small (32 MB) No AGP bridge found Memory: 118844k/147440k available (2456k kernel code, 12212k reserved, 1242k data, 196k init) -Calibrating delay using timer specific routine.. 3995.00 BogoMIPS (lpj=1997500) +Calibrating delay using timer specific routine.. 3994.95 BogoMIPS (lpj=1997477) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability @@ -128,10 +68,321 @@ Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) -CPU 0/0 -> Node 0 +CPU 0/1 -> Node 0 CPU: Physical Processor ID: 0 -CPU: Processor Core ID: 0 +CPU: Processor Core ID: 1 SMP alternatives: switching to UP code ACPI: Core revision 20060707 -..MP-BIOS bug: 8254 timer not connected to IO-APIC ... I have also tried to build a new kernel with "new early apic init patch" from BZ336371, but it failed to progress further on ibm-e326m, Booting 'Red Hat Enterprise Linux Server (2.6.18-83.el5.earlyapic)' root (hd0,0) Filesystem type is ext2fs, partition type 0x83 kernel /vmlinuz-2.6.18-83.el5.earlyapic ro root=/dev/VolGroup00/LogVol00 consol e=tty0 console=ttyS0,115200 [Linux-bzImage, setup=0x1e00, size=0x1c411c] initrd /initrd-2.6.18-83.el5.earlyapic.img [Linux-initrd @ 0x37cd3000, 0x31ce14 bytes]
Created attachment 296157 [details] boot log in ibm-e326m for capture kernel hangs
Created attachment 296160 [details] boot log in ibm-e326m for capture kernel works
Same problem on ibm-wildhorse-01, but looks like failed with a different crash scenario. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2100206
Cai, Do you know if this is a regression for 5.1? Thanks, Jeff
Not a regression against 5.1. It neither work for RHEL5U1 kernel (2.6.18-53.el5) nor kexec-tools (1.101-194.4.el5).
I could still see this with -89.el5 kernel, ... Total of 1 processors activated (3996.41 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...trying to set up timer (IRQ0) through the 8259A ... failed. timer doesn't work through the IO-APIC - disabling NMI Watchdog! ...trying to set up timer as Virtual Wire IRQ... failed. ...trying to set up timer as ExtINT IRQ... http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2681825
So look like i386 is also affected.
seems there are several issues with kdump and a few IBM systems. Once I can get to them reliably again I am going to start making sure that firmware is up to date on them all. A little searching around the net yielded this scenario being seen a couple years ago on several different types of systems and it seems to have been generally accepted as a bios problem, but I won't know until i can get to the systems. I am also a bit concerned that this is only being seen in the kdump kernel and not the boot kernel if it is indeed bios related.
Hi, any update so far? I am wondering if it is possible to update BIOS for those machines? ibm-e326m.rhts.boston.redhat.com ibm-morrison.lab.boston.redhat.com ibm-pizzaro.rhts.boston.redhat.com So I could retest in RHEL5.3.
I am leaving the office for a few days and will plan to do this when I get back. Sorry for the delay. Morrison should have had it's bios updated so you might give that one a try first.
ibm-morrison.rhts.bos.redhat.com is currently unavailable in RHTS.
Hi Ed, I'll retest it when you have time to update BIOS of those machines. Thanks!
I'll close this out, as using jprobe() to trigger artificial crashes probably not a good way to test Kdump. I'll create a new Kernel module to test those scenarios and open new BZs for any issue found.