Description of problem: when I tested bug 742079, I found kdump hang in guest with some combination, depend on comment 65 in bug 742079: 1. SSH + x86_64 + AMD CPU 2. NFS + i386 + AMD CPU 3. NFS + i386/x86_64 + Intel CPU and Kdump frequently occurred in guest with AMD CPU, I mainly tested the issues using the two machine: hp-dl585g7-01.rhts.eng.nay.redhat.com & ibm-x3550m3-03.rhts.eng.nay.redhat.com Version-Release number of selected component (if applicable): host: kernel-2.6.18-308.el5 guest: kernel-2.6.18-308.el5 kexec-tools-1.102pre-154.el5 How reproducible: not 100% Steps to Reproduce: 1. setup kdump service using local and network target. 2. # echo c > /proc/sysrq-trigger 3. Actual results: the guest hangs Expected results: kdump is work, and it can save the vmcore file.
The full console log of the second kernel is needed.
hi, Amerigo, for instance, I reproduced it on SSH + x86_64 + hp-dl585g7-01.rhts.eng.nay.redhat.com, and I found it's not 100% reproduced. [root@localhost ~]# cat /etc/kdump.conf net root.eng.nay.redhat.com path /var/crash core_collector makedumpfile -E -d 31 link_delay 60 [root@localhost ~]# cat /proc/cmdline ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200 console=tty0 rhgb quiet crashkernel=128M@16M [root@localhost ~]# echo c > /proc/sysrq-trigger Below is the console log: Red Hat Enterprise Linux Server release 5.8 (Tikanga) Kernel 2.6.18-308.el5 on an x86_64 localhost.localdomain login: mtrr: type mismatch for c2000000,400000 old: uncachable new: write-combining SysRq : Trigger a crashdump Kexec: Warning: crash image not loaded Kernel panic - not syncing: SysRq-triggered panic!
(In reply to comment #2) > Kexec: Warning: crash image not loaded Are you sure you have loaded kdump kernel (by running 'service kdump start')?
(In reply to comment #3) > > Are you sure you have loaded kdump kernel (by running 'service kdump start')? yes, there are more detail log, hope can help you debug. ... Red Hat Enterprise Linux Server release 5.8 (Tikanga) Kernel 2.6.18-308.el5 on an x86_64 localhost.localdomain login: mtrr: type mismatch for c2000000,400000 old: uncachable new: write-combining Red Hat Enterprise Linux Server release 5.8 (Tikanga) Kernel 2.6.18-308.el5 on an x86_64 localhost.localdomain login: root Password: Last login: Sun Feb 12 21:16:47 from 192.168.122.1 [root@localhost ~]# cat /etc/kdump.conf net ibm-x3550m3-03.rhts.eng.nay.redhat.com:/mnt/testarea/nfs core_collector makedumpfile -c -d 3 [root@localhost ~]# mount -t nfs ibm-x3550m3-03.rhts.eng.nay.redhat.com:/mnt/testarea/nfs /mnt/ [root@localhost ~]# cd /mnt/ [root@localhost mnt]# ls aa var [root@localhost mnt]# cd [root@localhost ~]# umount /mnt/ [root@localhost ~]# service kdump restart Stopping kdump: [ OK ] Starting kdump: [ OK ] [root@localhost ~]# echo c > /proc/sysrq-trigger SysRq : Trigger a crashdump Memory for crash kernel (0x0 to 0x0) notwithin permissible range ..MP-BIOS bug: 8254 timer not connected to IO-APIC Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter
The error you show in comment #4 is different from the one in comment #2. But anyway, this is a kernel problem.
Reproduced this bug 100% in my testing: host: RHEL5.9-Server-20120822.1_x86_64 + 337.el5 kernel guest: RHEL5.9-Server-20120822.1_x86_64 + 337.el5 kernel guest configuration: 2GB RAM + 2CPU + disk(virtio) + NIC(e1000) In guest, configure local/network kdump and trigger crash via SysRq like this, # echo c > /proc/sysrq-trigger SysRq : Trigger a crashdump Linux version 2.6.18-337.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)) #1 SMP Mon Aug 20 07:55:09 EDT 2012 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 irqpoll maxcpus=1 reset_devices loglevel=7 hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 000000007fff0000 (usable) BIOS-e820: 000000007fff0000 - 0000000080000000 (ACPI data) BIOS-e820: 00000000c0000000 - 00000000c1000000 (reserved) BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved) user-defined physical RAM map: user: 0000000000010000 - 000000000009f000 (usable) user: 000000000009f000 - 00000000000a0000 (reserved) user: 0000000001000000 - 0000000001601000 (usable) user: 0000000001690000 - 0000000008ffc000 (usable) user: 000000007fff0000 - 0000000080000000 (ACPI data) user: 00000000c0000000 - 00000000c1000000 (reserved) user: 00000000fffbc000 - 0000000100000000 (reserved) DMI 2.4 present. kvm-clock: cpu 0, msr 7eff:804a9401, boot clock No NUMA configuration found Faking a node at 0000000000000000-0000000008ffc000 Bootmem setup node 0 0000000000000000-0000000008ffc000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump ACPI: PM-Timer IO Port: 0xb008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:6 APIC version 20 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 6:6 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled) ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled) ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled) ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled) ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled) ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled) ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled) ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled) ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled) ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled) ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled) ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level) Setting APIC routing to physical flat Using ACPI (MADT) for SMP configuration information Nosave address range: 000000000009f000 - 00000000000a0000 Nosave address range: 00000000000a0000 - 0000000001000000 Nosave address range: 0000000001601000 - 0000000001690000 Allocating PCI resources starting at 10000000 (gap: 8ffc000:76ff4000) SMP: Allowing 16 CPUs, 14 hotplug CPUs kvm-clock: cpu 0, msr 0:15db401, primary cpu clock Built 1 zonelists. Total pages: 32251 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 irqpoll maxcpus=1 reset_devices loglevel=7 hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K Misrouted IRQ fixup and polling support enabled This may significantly impact system performance ide_setup: hdc=cdrom Initializing CPU#0 PID hash table entries: 512 (order: 9, 4096 bytes) Using TSC for driving interrupts irq 105, desc: ffffffff80451c80, depth: 1, count: 0, unhandled: 0 ->handle_irq(): ffffffff800be74a, handle_bad_irq+0x0/0x1f6 ->chip(): ffffffff8032abc0, no_irq_chip+0x0/0x80 ->action(): (null) IRQ_DISABLED set unexpected IRQ trap at vector 69 Console: colour VGA+ 80x25 Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... ACPI: DMAR not present Memory: 115336k/147440k available (2623k kernel code, 15720k reserved, 1676k data, 224k init) Calibrating delay loop (skipped), value calculated using timer frequency.. 6133.56 BogoMIPS (lpj=3066782) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 , L1 D cache: 32K SMP alternatives: switching to UP code ACPI: Core revision 20060707 ..MP-BIOS bug: 8254 timer not connected to IO-APIC Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter
Hi Guangze, Did you test by adding a kernel parameter "divider=10"? Or disable this checking by adding 'no_timer_check' to kernel cmdline? rhel5/Documentation/kernel-parameters.txt divider= [IA-32,X86-64] divide kernel HZ rate by given value. Format: <num>, where <num> is between 1 and 25 rhel5/Documentation/x86_64/boot-options.txt no_timer_check Don't check the IO-APIC timer. This can work around problems with incorrect timer initialization on some boards.
Amos, With divider=10 in kdump kernel, the console log also same as c#6. Below is try with no_timer_check in kdump kernel: SysRq : Trigger a crashdump Linux version 2.6.18-337.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)) #1 SMP Mon Aug 20 07:55:09 EDT 2012 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 no_timer_check irqpoll maxcpus=1 reset_devices loglevel=7 hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 000000007fff0000 (usable) BIOS-e820: 000000007fff0000 - 0000000080000000 (ACPI data) BIOS-e820: 00000000c0000000 - 00000000c1000000 (reserved) BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved) user-defined physical RAM map: user: 0000000000010000 - 000000000009f000 (usable) user: 000000000009f000 - 00000000000a0000 (reserved) user: 0000000001000000 - 0000000001601000 (usable) user: 0000000001690000 - 0000000008ffc000 (usable) user: 000000007fff0000 - 0000000080000000 (ACPI data) user: 00000000c0000000 - 00000000c1000000 (reserved) user: 00000000fffbc000 - 0000000100000000 (reserved) DMI 2.4 present. kvm-clock: cpu 0, msr 7eff:804a9401, boot clock No NUMA configuration found Faking a node at 0000000000000000-0000000008ffc000 Bootmem setup node 0 0000000000000000-0000000008ffc000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump ACPI: PM-Timer IO Port: 0xb008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:6 APIC version 20 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 6:6 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled) ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled) ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled) ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled) ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled) ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled) ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled) ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled) ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled) ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled) ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled) ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level) Setting APIC routing to physical flat Using ACPI (MADT) for SMP configuration information Nosave address range: 000000000009f000 - 00000000000a0000 Nosave address range: 00000000000a0000 - 0000000001000000 Nosave address range: 0000000001601000 - 0000000001690000 Allocating PCI resources starting at 10000000 (gap: 8ffc000:76ff4000) SMP: Allowing 16 CPUs, 14 hotplug CPUs kvm-clock: cpu 0, msr 0:15db401, primary cpu clock Built 1 zonelists. Total pages: 32251 Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 loglevel=7 no_timer_check irqpoll maxcpus=1 reset_devices loglevel=7 hdc=cdrom memmap=exactmap memmap=572K@64K memmap=6148K@16384K memmap=124336K@23104K elfcorehdr=147440K memmap=4K$636K memmap=64K#2097088K memmap=16384K$3145728K memmap=272K$4194032K Misrouted IRQ fixup and polling support enabled This may significantly impact system performance ide_setup: hdc=cdrom Initializing CPU#0 PID hash table entries: 512 (order: 9, 4096 bytes) Using TSC for driving interrupts Console: colour VGA+ 80x25 Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... ACPI: DMAR not present Memory: 115336k/147440k available (2623k kernel code, 15720k reserved, 1676k data, 224k init) Calibrating delay loop (skipped), value calculated using timer frequency.. 6133.56 BogoMIPS (lpj=3066782) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 , L1 D cache: 32K SMP alternatives: switching to UP code ACPI: Core revision 20060707 Using local APIC timer interrupts. WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong. Detected 62.505 MHz APIC timer. Brought up 1 CPUs time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer. time.c: Detected 3066.782 MHz processor. checking if image is initramfs... it is Freeing initrd memory: 5089k freed NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using configuration type 1 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: No dock devices found. ACPI: PCI Root Bridge [PCI0] (0000:00) PCI quirk: region b000-b03f claimed by PIIX4 ACPI PCI quirk: region b100-b10f claimed by PIIX4 SMB ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11) ACPI: PCI Interrupt Link [LNKB] (IRQs 5 10 11) *0, disabled. ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11) ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11) Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init pnp: PnP ACPI: found 6 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default ACPI: DMAR not present PCI-GART: No AMD northbridge found. NET: Registered protocol family 2 IP route cache hash table entries: 1024 (order: 1, 8192 bytes) TCP established hash table entries: 4096 (order: 4, 65536 bytes) TCP bind hash table entries: 2048 (order: 3, 32768 bytes) TCP: Hash tables configured (established 4096 bind 2048) TCP reno registered audit: initializing netlink socket (disabled) type=2000 audit(1346900734.000:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) Initializing Cryptographic API alg: No test for crc32c (crc32c-generic) ksign: Installing public key data Loading keyring - Added public key BF701D582B157074 - User ID: Red Hat, Inc. (Kernel Module GPG key) io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) Limiting direct PCI/PCI transfers. Activating ISA DMA hang workarounds. pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Real Time Clock Driver v1.12ac Non-volatile memory driver v1.2 Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled irq 11: nobody cared (try booting with the "irqpoll" option) Call Trace: <IRQ> [<ffffffff800befc5>] __report_bad_irq+0x30/0x7d [<ffffffff800bf203>] note_interrupt+0x1f1/0x232 [<ffffffff800be6d3>] __do_IRQ+0xe4/0x15b [<ffffffff8006d469>] do_IRQ+0xe9/0xf7 [<ffffffff8005d625>] ret_from_intr+0x0/0xa [<ffffffff801ce219>] klist_children_get+0x0/0x9 [<ffffffff80012571>] __do_softirq+0x51/0x133 [<ffffffff8005e30c>] call_softirq+0x1c/0x28 [<ffffffff8006d5de>] do_softirq+0x2c/0x7d [<ffffffff8005dc9e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff801ce219>] klist_children_get+0x0/0x9 [<ffffffff800bfd2f>] probe_irq_on+0x6e/0x151 [<ffffffff801cb1e8>] serial8250_config_port+0x7c7/0x9c3 [<ffffffff801c8c13>] uart_add_one_port+0xf8/0x278 [<ffffffff801ce954>] device_add+0x34e/0x372 [<ffffffff8048ee3f>] serial8250_init+0xdb/0x125 [<ffffffff8046fa5e>] init+0x1f9/0x2f7 [<ffffffff8005dfc1>] child_rip+0xa/0x11 [<ffffffff8018a076>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff8046f865>] init+0x0/0x2f7 [<ffffffff8005dfb7>] child_rip+0x0/0x11 handlers: Disabling IRQ #11 Amos, kdump kernel hangs up at there.
Guangze help verified rhel6 guest fail as well in rhel5 host, also rhel6 guest works well in rhel6 host. So this probably is a rhel5 host kvm bug.
*** This bug has been marked as a duplicate of bug 505527 ***