Bug 789228 - kdump kernel : Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter
Summary: kdump kernel : Kernel panic - not syncing: IO-APIC + timer doesn't work! Try ...
Keywords:
Status: CLOSED DUPLICATE of bug 505527
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-10 06:30 UTC by Zhouping Liu
Modified: 2014-01-13 00:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-09-10 08:29:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Zhouping Liu 2012-02-10 06:30:26 UTC
Description of problem:

when I tested bug 742079, I found kdump hang in guest with some combination,
depend on comment 65 in bug 742079:
1. SSH + x86_64 + AMD CPU
2. NFS + i386 + AMD CPU
3. NFS + i386/x86_64 + Intel CPU
and Kdump frequently occurred in guest with AMD CPU,
I mainly tested the issues using the two machine:
hp-dl585g7-01.rhts.eng.nay.redhat.com &
ibm-x3550m3-03.rhts.eng.nay.redhat.com

Version-Release number of selected component (if applicable):
host: kernel-2.6.18-308.el5
guest: kernel-2.6.18-308.el5
       kexec-tools-1.102pre-154.el5

How reproducible:
not 100%

Steps to Reproduce:
1. setup kdump service using local and network target.
2. # echo c > /proc/sysrq-trigger
3.
  
Actual results:
the guest hangs

Expected results:
kdump is work, and it can save the vmcore file.

Comment 1 Cong Wang 2012-02-10 06:36:11 UTC
The full console log of the second kernel is needed.

Comment 2 Zhouping Liu 2012-02-10 12:45:42 UTC
hi, Amerigo,

for instance, I reproduced it on
SSH + x86_64 + hp-dl585g7-01.rhts.eng.nay.redhat.com,
and I found it's not 100% reproduced.

[root@localhost ~]# cat /etc/kdump.conf
net root.eng.nay.redhat.com
path /var/crash
core_collector makedumpfile -E -d 31
link_delay 60

[root@localhost ~]# cat /proc/cmdline 
ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200 console=tty0 rhgb quiet crashkernel=128M@16M
[root@localhost ~]# echo c > /proc/sysrq-trigger

Below is the console log:
Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Kernel 2.6.18-308.el5 on an x86_64

localhost.localdomain login: mtrr: type mismatch for c2000000,400000 old: uncachable new: write-combining
SysRq : Trigger a crashdump
Kexec: Warning: crash image not loaded
Kernel panic - not syncing: SysRq-triggered panic!

Comment 3 Cong Wang 2012-02-10 12:51:04 UTC
(In reply to comment #2)
> Kexec: Warning: crash image not loaded

Are you sure you have loaded kdump kernel (by running 'service kdump start')?

Comment 4 Zhouping Liu 2012-02-13 02:21:52 UTC
(In reply to comment #3)
> 
> Are you sure you have loaded kdump kernel (by running 'service kdump start')?
yes, there are more detail log, hope can help you debug.
...
Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Kernel 2.6.18-308.el5 on an x86_64

localhost.localdomain login: mtrr: type mismatch for c2000000,400000 old: uncachable new: write-combining

Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Kernel 2.6.18-308.el5 on an x86_64

localhost.localdomain login: root
Password: 
Last login: Sun Feb 12 21:16:47 from 192.168.122.1
[root@localhost ~]# cat /etc/kdump.conf
net ibm-x3550m3-03.rhts.eng.nay.redhat.com:/mnt/testarea/nfs
core_collector makedumpfile -c -d 3
[root@localhost ~]# mount -t nfs ibm-x3550m3-03.rhts.eng.nay.redhat.com:/mnt/testarea/nfs /mnt/
[root@localhost ~]# cd /mnt/
[root@localhost mnt]# ls
aa  var
[root@localhost mnt]# cd
[root@localhost ~]# umount /mnt/
[root@localhost ~]# service kdump restart
Stopping kdump:                                            [  OK  ]
Starting kdump:                                            [  OK  ]
[root@localhost ~]# echo c > /proc/sysrq-trigger 
SysRq : Trigger a crashdump
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter

Comment 5 Cong Wang 2012-02-13 03:00:59 UTC
The error you show in comment #4 is different from the one in comment #2. But anyway, this is a kernel problem.

Comment 6 Guangze Bai 2012-09-05 11:13:12 UTC
Reproduced this bug 100% in my testing:

host: RHEL5.9-Server-20120822.1_x86_64 + 337.el5 kernel
guest: RHEL5.9-Server-20120822.1_x86_64 + 337.el5 kernel
guest configuration: 2GB RAM + 2CPU + disk(virtio) + NIC(e1000)

In guest, configure local/network kdump and trigger crash via SysRq like this,

# echo c > /proc/sysrq-trigger 

SysRq : Trigger a crashdump
Linux version 2.6.18-337.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)) #1 SMP Mon Aug 20 07:55:09 EDT 2012
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 irqpoll maxcpus=1 reset_devices loglevel=7  hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
 BIOS-e820: 000000007fff0000 - 0000000080000000 (ACPI data)
 BIOS-e820: 00000000c0000000 - 00000000c1000000 (reserved)
 BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000010000 - 000000000009f000 (usable)
 user: 000000000009f000 - 00000000000a0000 (reserved)
 user: 0000000001000000 - 0000000001601000 (usable)
 user: 0000000001690000 - 0000000008ffc000 (usable)
 user: 000000007fff0000 - 0000000080000000 (ACPI data)
 user: 00000000c0000000 - 00000000c1000000 (reserved)
 user: 00000000fffbc000 - 0000000100000000 (reserved)
DMI 2.4 present.
kvm-clock: cpu 0, msr 7eff:804a9401, boot clock
No NUMA configuration found
Faking a node at 0000000000000000-0000000008ffc000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0xb008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:6 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:6 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled)
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled)
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled)
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled)
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled)
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 000000000009f000 - 00000000000a0000
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 0000000001601000 - 0000000001690000
Allocating PCI resources starting at 10000000 (gap: 8ffc000:76ff4000)
SMP: Allowing 16 CPUs, 14 hotplug CPUs
kvm-clock: cpu 0, msr 0:15db401, primary cpu clock
Built 1 zonelists.  Total pages: 32251
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 irqpoll maxcpus=1 reset_devices loglevel=7  hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdc=cdrom
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
Using TSC for driving interrupts
irq 105, desc: ffffffff80451c80, depth: 1, count: 0, unhandled: 0
->handle_irq():  ffffffff800be74a, handle_bad_irq+0x0/0x1f6
->chip(): ffffffff8032abc0, no_irq_chip+0x0/0x80
->action(): (null)
  IRQ_DISABLED set
unexpected IRQ trap at vector 69
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
ACPI: DMAR not present
Memory: 115336k/147440k available (2623k kernel code, 15720k reserved, 1676k data, 224k init)
Calibrating delay loop (skipped), value calculated using timer frequency.. 6133.56 BogoMIPS (lpj=3066782)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
, L1 D cache: 32K
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter

Comment 9 Amos Kong 2012-09-06 02:53:25 UTC
Hi Guangze,

Did you test by adding a kernel parameter "divider=10"?
Or disable this checking by adding 'no_timer_check' to kernel cmdline?

rhel5/Documentation/kernel-parameters.txt

        divider=        [IA-32,X86-64]
                        divide kernel HZ rate by given value.
                        Format: <num>, where <num> is between 1 and 25

rhel5/Documentation/x86_64/boot-options.txt
   no_timer_check Don't check the IO-APIC timer. This can work around
                problems with incorrect timer initialization on some boards.

Comment 10 Guangze Bai 2012-09-06 03:25:13 UTC
Amos,

With divider=10 in kdump kernel, the console log also same as c#6.

Below is try with no_timer_check in kdump kernel:

SysRq : Trigger a crashdump
Linux version 2.6.18-337.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)) #1 SMP Mon Aug 20 07:55:09 EDT 2012
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 no_timer_check irqpoll maxcpus=1 reset_devices loglevel=7  hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
 BIOS-e820: 000000007fff0000 - 0000000080000000 (ACPI data)
 BIOS-e820: 00000000c0000000 - 00000000c1000000 (reserved)
 BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000010000 - 000000000009f000 (usable)
 user: 000000000009f000 - 00000000000a0000 (reserved)
 user: 0000000001000000 - 0000000001601000 (usable)
 user: 0000000001690000 - 0000000008ffc000 (usable)
 user: 000000007fff0000 - 0000000080000000 (ACPI data)
 user: 00000000c0000000 - 00000000c1000000 (reserved)
 user: 00000000fffbc000 - 0000000100000000 (reserved)
DMI 2.4 present.
kvm-clock: cpu 0, msr 7eff:804a9401, boot clock
No NUMA configuration found
Faking a node at 0000000000000000-0000000008ffc000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0xb008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:6 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:6 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled)
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled)
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled)
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled)
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled)
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 000000000009f000 - 00000000000a0000
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 0000000001601000 - 0000000001690000
Allocating PCI resources starting at 10000000 (gap: 8ffc000:76ff4000)
SMP: Allowing 16 CPUs, 14 hotplug CPUs
kvm-clock: cpu 0, msr 0:15db401, primary cpu clock
Built 1 zonelists.  Total pages: 32251
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 no_timer_check irqpoll maxcpus=1 reset_devices loglevel=7  hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdc=cdrom
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
Using TSC for driving interrupts
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
ACPI: DMAR not present
Memory: 115336k/147440k available (2623k kernel code, 15720k reserved, 1676k data, 224k init)
Calibrating delay loop (skipped), value calculated using timer frequency.. 6133.56 BogoMIPS (lpj=3066782)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
, L1 D cache: 32K
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
Using local APIC timer interrupts.
WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong.
Detected 62.505 MHz APIC timer.
Brought up 1 CPUs
time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer.
time.c: Detected 3066.782 MHz processor.
checking if image is initramfs... it is
Freeing initrd memory: 5089k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI quirk: region b000-b03f claimed by PIIX4 ACPI
PCI quirk: region b100-b10f claimed by PIIX4 SMB
ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 10 11) *0, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 6 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
ACPI: DMAR not present
PCI-GART: No AMD northbridge found.
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 1, 8192 bytes)
TCP established hash table entries: 4096 (order: 4, 65536 bytes)
TCP bind hash table entries: 2048 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 4096 bind 2048)
TCP reno registered
audit: initializing netlink socket (disabled)
type=2000 audit(1346900734.000:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Initializing Cryptographic API
alg: No test for crc32c (crc32c-generic)
ksign: Installing public key data
Loading keyring
- Added public key BF701D582B157074
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
Limiting direct PCI/PCI transfers.
Activating ISA DMA hang workarounds.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
irq 11: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff800befc5>] __report_bad_irq+0x30/0x7d
 [<ffffffff800bf203>] note_interrupt+0x1f1/0x232
 [<ffffffff800be6d3>] __do_IRQ+0xe4/0x15b
 [<ffffffff8006d469>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d625>] ret_from_intr+0x0/0xa
 [<ffffffff801ce219>] klist_children_get+0x0/0x9
 [<ffffffff80012571>] __do_softirq+0x51/0x133
 [<ffffffff8005e30c>] call_softirq+0x1c/0x28
 [<ffffffff8006d5de>] do_softirq+0x2c/0x7d
 [<ffffffff8005dc9e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff801ce219>] klist_children_get+0x0/0x9
 [<ffffffff800bfd2f>] probe_irq_on+0x6e/0x151
 [<ffffffff801cb1e8>] serial8250_config_port+0x7c7/0x9c3
 [<ffffffff801c8c13>] uart_add_one_port+0xf8/0x278
 [<ffffffff801ce954>] device_add+0x34e/0x372
 [<ffffffff8048ee3f>] serial8250_init+0xdb/0x125
 [<ffffffff8046fa5e>] init+0x1f9/0x2f7
 [<ffffffff8005dfc1>] child_rip+0xa/0x11
 [<ffffffff8018a076>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff8046f865>] init+0x0/0x2f7
 [<ffffffff8005dfb7>] child_rip+0x0/0x11

handlers:
Disabling IRQ #11

Amos, kdump kernel hangs up at there.

Comment 11 Dave Young 2012-09-06 08:22:08 UTC
Guangze help verified rhel6 guest fail as well in rhel5 host, also rhel6 guest works well in rhel6 host. So this probably is a rhel5 host kvm bug.

Comment 14 Dave Young 2012-09-10 08:29:21 UTC

*** This bug has been marked as a duplicate of bug 505527 ***


Note You need to log in before you can comment on or make changes to this bug.