Bug 435239 - [5.2][kdump] MP-BIOS bug: 8254 timer not connected to IO-APIC
Summary: [5.2][kdump] MP-BIOS bug: 8254 timer not connected to IO-APIC
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.2
Hardware: All Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Ed Pollard
QA Contact: Martin Jenner
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-28 05:55 UTC by Qian Cai
Modified: 2013-08-06 00:03 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-22 10:13:40 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
boot log in ibm-e326m for capture kernel hangs (7.16 KB, text/plain)
2008-02-28 05:55 UTC, Qian Cai
no flags Details
boot log in ibm-e326m for capture kernel works (17.29 KB, text/plain)
2008-02-28 05:56 UTC, Qian Cai
no flags Details

Description Qian Cai 2008-02-28 05:55:10 UTC
Description of problem:
It has been observed several times on certain IBM x86_64 machines that capture
kernel hanged at,

SMP alternatives: switching to UP code
ACPI: Core revision 20060707
..MP-BIOS bug: 8254 timer not connected to IO-APIC

With RHEL5.2-Server-20080225.2 tree, the following RHTS machines are affected as
far as I tested,

ibm-e326m.rhts.boston.redhat.com
ibm-morrison.lab.boston.redhat.com
ibm-pizzaro.rhts.boston.redhat.com

Though, there is a workaround to add "noapic" to capture kernel command line.

I have tried different version of either kernel (2.6.18-53.el5) or kexec-tools
(1.101-194.4.el5) without success.

Note that the problem is only triggered by certain crash scenarios. For example,
LKDTM (Linux Kernel Dump Test Module)'s bug in do_irq(). Simple "echo c
>/proc/sysrq-trigger" works perfect fine without problem.

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080225.2
kernel-2.6.18-83.el5
kexec-tools-1.102pre-10.el5

How reproducible:
always (3 times in a row)

Steps to Reproduce:
1. reserved one of the affected machines, and configured kdump and booted the
kernel with crashkernel=128M@16M.
2. wget
http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz; cd
kdump/lib/lkdtm; export USE_SYMBOL_NAME=1; make
3. insmod lkdtm.ko cpoint_name=INT_HARDWARE_ENTRY cpoint_type=BUG cpoint_count=05
  
Actual results:
Capture kernel hangs.

Expected results:
Capture kernel bring up successfully.

Additional information:
Both hanging and working (via sysrq-c) kernel booting logs have been attached.
Compared two files showed some interesting data,

--- ibm-e326m-hangs.log 2008-02-28 13:32:58.000000000 +0800
+++ ibm-e326m-works.log 2008-02-28 13:44:13.000000000 +0800
...
-CPU 0: aperture @ 410000000 size 32 MB
+CPU 0: aperture @ 412000000 size 32 MB
 Aperture too small (32 MB)
 No AGP bridge found
 Memory: 118844k/147440k available (2456k kernel code, 12212k reserved, 1242k
data, 196k init)
-Calibrating delay using timer specific routine.. 3995.00 BogoMIPS (lpj=1997500)
+Calibrating delay using timer specific routine.. 3994.95 BogoMIPS (lpj=1997477)
 Security Framework v1.0.0 initialized
 SELinux:  Initializing.
 selinux_register_security:  Registering secondary module capability
@@ -128,10 +68,321 @@
 Mount-cache hash table entries: 256
 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
 CPU: L2 Cache: 1024K (64 bytes/line)
-CPU 0/0 -> Node 0
+CPU 0/1 -> Node 0
 CPU: Physical Processor ID: 0
-CPU: Processor Core ID: 0
+CPU: Processor Core ID: 1
 SMP alternatives: switching to UP code
 ACPI: Core revision 20060707
-..MP-BIOS bug: 8254 timer not connected to IO-APIC
...

I have also tried to build a new kernel with "new early apic init patch" from
BZ336371, but it failed to progress further on ibm-e326m,

  Booting 'Red Hat Enterprise Linux Server (2.6.18-83.el5.earlyapic)'

root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.18-83.el5.earlyapic ro root=/dev/VolGroup00/LogVol00 consol
e=tty0 console=ttyS0,115200
   [Linux-bzImage, setup=0x1e00, size=0x1c411c]
initrd /initrd-2.6.18-83.el5.earlyapic.img
   [Linux-initrd @ 0x37cd3000, 0x31ce14 bytes]

Comment 1 Qian Cai 2008-02-28 05:55:10 UTC
Created attachment 296157 [details]
boot log in ibm-e326m for capture kernel hangs

Comment 2 Qian Cai 2008-02-28 05:56:36 UTC
Created attachment 296160 [details]
boot log in ibm-e326m for capture kernel works

Comment 3 Qian Cai 2008-03-04 08:18:36 UTC
Same problem on ibm-wildhorse-01, but looks like failed with a different crash
scenario.
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2100206

Comment 4 Jeff Burke 2008-03-31 18:22:06 UTC
Cai,
   Do you know if this is a regression for 5.1?

Thanks,
Jeff

Comment 5 Qian Cai 2008-03-31 22:27:50 UTC
Not a regression against 5.1. It neither work for RHEL5U1 kernel (2.6.18-53.el5)
nor kexec-tools (1.101-194.4.el5).

Comment 6 Qian Cai 2008-04-15 14:44:33 UTC
I could still see this with -89.el5 kernel,

...
Total of 1 processors activated (3996.41 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...  failed.
timer doesn't work through the IO-APIC - disabling NMI Watchdog!
...trying to set up timer as Virtual Wire IRQ... failed.
...trying to set up timer as ExtINT IRQ...

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2681825

Comment 7 Qian Cai 2008-04-15 14:47:51 UTC
So look like i386 is also affected.

Comment 8 Ed Pollard 2008-05-08 14:31:04 UTC
seems there are several issues with kdump and a few IBM systems. Once I can get
to them reliably again I am going to start making sure that firmware is up to
date on them all.  A little searching around the net yielded this scenario being
seen a couple years ago on several different types of systems and it seems to
have been generally accepted as a bios problem, but I won't know until i can get
to the systems. 

I am also a bit concerned that this is only being seen in the kdump kernel and
not the boot kernel if it is indeed bios related.

Comment 9 Qian Cai 2008-07-15 09:50:49 UTC
Hi, any update so far? I am wondering if it is possible to update BIOS for those
machines?

ibm-e326m.rhts.boston.redhat.com
ibm-morrison.lab.boston.redhat.com
ibm-pizzaro.rhts.boston.redhat.com

So I could retest in RHEL5.3.

Comment 10 Ed Pollard 2008-07-15 14:08:49 UTC
I am leaving the office for a few days and will plan to do this when I get back.
Sorry for the delay. Morrison should have had it's bios updated so you might
give that one a try first.

Comment 11 Qian Cai 2008-07-16 03:38:01 UTC
ibm-morrison.rhts.bos.redhat.com is currently unavailable in RHTS.

Comment 12 Qian Cai 2008-07-16 10:11:20 UTC
Hi Ed, I'll retest it when you have time to update BIOS of those machines. Thanks!

Comment 13 Qian Cai 2008-10-22 10:13:40 UTC
I'll close this out, as using jprobe() to trigger artificial crashes probably not a good way to test Kdump. I'll create a new Kernel module to test those scenarios and open new BZs for any issue found.


Note You need to log in before you can comment on or make changes to this bug.