Bug 505527 - [RHEL5.4 KVM]: Kdump on Intel fails because of misrouted timer IRQs
Summary: [RHEL5.4 KVM]: Kdump on Intel fails because of misrouted timer IRQs
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Gleb Natapov
QA Contact: Han Pingtian
URL:
Whiteboard: see also bug 418501
: 784232 789228 (view as bug list)
Depends On:
Blocks: 507548 527955 Rhel5KvmTier2 591850 745153
TreeView+ depends on / blocked
 
Reported: 2009-06-12 09:03 UTC by Qian Cai
Modified: 2018-10-27 13:59 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 506863 596223 (view as bug list)
Environment:
Last Closed: 2010-07-11 13:56:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Kdump serial log (12.82 KB, text/plain)
2012-02-06 17:28 UTC, IBM Bug Proxy
no flags Details
sosreport of x3650a (3.94 MB, application/x-bzip)
2012-02-06 17:28 UTC, IBM Bug Proxy
no flags Details

Description Qian Cai 2009-06-12 09:03:38 UTC
Description of problem:
Since there seems no dump mechanism for KVM in RHEL5.4,

# virsh list
 Id Name                 State
----------------------------------
  3 guest-83-193.rhts.bos.redhat.com running
  4 guest-83-98.rhts.bos.redhat.com running

# virsh dump guest-83-98.rhts.bos.redhat.com vmcore
error: Failed to core dump domain guest-83-98.rhts.bos.redhat.com to vmcore
error: this function is not supported by the hypervisor: virDomainCoreDump

I have tried to use kdump on both 32-bit and 64-bit guests hosted on both Intel and AMD systems, but there is no dice.

32-bit guest -- looks like the probing critical disks code in kdump initramfs is waiting forever for vda device coming up.

# echo c >/proc/sysrq-trigger 
SysRq : Trigger a crashdump
Linux version 2.6.18-153.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jun 10 17:51:46 EDT 2009
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
 BIOS-e820: 000000003fff0000 - 0000000040000000 (ACPI data)
 BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000008f5b000 (usable)
0MB HIGHMEM available.
143MB LOWMEM available.
found SMP MP-table at 000fbd10
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
Using x86 segment limits to approximate NX protection
DMI 2.4 present.
Using APIC driver default
ACPI: PM-Timer IO Port: 0xb008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:2 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] disabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled)
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled)
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled)
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled)
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled)
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled)
ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 10000000 (gap: 08f5b000:f70a5000)
kvm_get_tsc_khz: cpu 0, msr 0:151b001
TSC: Frequency read from the hypervisor
Detected 2992.496 MHz processor.
Built 1 zonelists.  Total pages: 36699
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdc=cdrom memmap=exactmap memmap=640K@0K memmap=130412K@16384K elfcorehdr=146796K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdc=cdrom
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c1360000 soft=c1340000
PID hash table entries: 1024 (order: 10, 4096 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 122212k/146796k available (2153k kernel code, 8680k reserved, 896k data, 232k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay loop (skipped), value calculated using timer frequency.. 5984.99 BogoMIPS (lpj=2992496)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 14k freed
ACPI: Core revision 20060707
CPU0: Intel QEMU Virtual CPU version 0.9.1 stepping 03
Total of 1 processors activated (5984.99 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=0 apic2=-1 pin2=-1
Brought up 1 CPUs
checking if image is initramfs... it is
Freeing initrd memory: 3130k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfb510, last bus=0
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI quirk: region b000-b03f claimed by PIIX4 ACPI
PCI quirk: region b100-b10f claimed by PIIX4 SMB
ACPI: PCI Interrupt Link [LNKA] (IRQs 5 10 11) *0, disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 10 11) *0, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 5 *10 11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 6 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
NET: Registered protocol family 2
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
TCP established hash table entries: 8192 (order: 4, 65536 bytes)
TCP bind hash table entries: 4096 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 8192 bind 4096)
TCP reno registered
apm: BIOS not found.
audit: initializing netlink socket (disabled)
type=2000 audit(1244795036.923:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
Initializing Cryptographic API
alg: No test for crc32c (crc32c-generic)
ksign: Installing public key data
Loading keyring
- Added public key 546D3B93D1DBB4A4
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
Limiting direct PCI/PCI transfers.
Activating ISA DMA hang workarounds.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
�serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
brd: module loaded
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PIIX3: IDE controller at PCI slot 0000:00:01.1
PIIX3: chipset revision 0
PIIX3: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xc000-0xc007, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0xc008-0xc00f, BIOS settings: hdc:DMA, hdd:pio
hdc: QEMU DVD-ROM, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
ACPI: (supports<6>Time: tsc clocksource has been installed.
 S3 S4 S5)
Initalizing network drop monitor service
Freeing unused kernel memory: 232k freed
Write protecting the kernel read-only data: 4294955412k
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading scsi_mod.ko module
SCSI subsystem initialized
Loading sd_mod.ko module
Loading libata.ko module
Loading ata_piix.ko module
Loading virtio.ko module
Loading virtio_blk.ko module
Loading jbd.ko module
Loading ext3.ko module
Loading dm-mod.ko module
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel
Loading dm-log.ko module
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Waiting for required block device discovery
Waiting for vda...input: ImExPS/2 Generic Explorer Mouse as /class/input/input0

kdump kernel on 64-bit guest is either panic (Intel) or reset immediately (AMD),

# echo c >/proc/sysrq-trigger 
SysRq : Trigger a crashdump
Linux version 2.6.18-153.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jun 10 17:53:33 EDT 2009
Command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdc=cdrom memmap=exactmap memmap=640K@0K memmap=5264K@16384K memmap=125152K@22288K elfcorehdr=147440K memmap=64K#1048512K memmap=272K$4194032K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
 BIOS-e820: 000000003fff0000 - 0000000040000000 (ACPI data)
 BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000001524000 (usable)
 user: 00000000015c4000 - 0000000008ffc000 (usable)
 user: 000000003fff0000 - 0000000040000000 (ACPI data)
 user: 00000000fffbc000 - 0000000100000000 (reserved)
DMI 2.4 present.
No NUMA configuration found
Faking a node at 0000000000000000-0000000008ffc000
Bootmem setup node 0 0000000000000000-0000000008ffc000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0xb008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:2 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:2 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled)
ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled)
ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled)
ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled)
ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled)
ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled)
ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled)
ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 00000000000a0000 - 0000000001000000
Nosave address range: 0000000001524000 - 00000000015c4000
Allocating PCI resources starting at 50000000 (gap: 40000000:bffbc000)
SMP: Allowing 16 CPUs, 14 hotplug CPUs
Built 1 zonelists.  Total pages: 32252
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hdc=cdrom memmap=exactmap memmap=640K@0K memmap=5264K@16384K memmap=125152K@22288K elfcorehdr=147440K memmap=64K#1048512K memmap=272K$4194032K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
ide_setup: hdc=cdrom
Initializing CPU#0
PID hash table entries: 512 (order: 9, 4096 bytes)
kvm_get_tsc_khz: cpu 0, msr 0:19dc001
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
ACPI: DMAR not present
Memory: 117872k/147440k available (2549k kernel code, 13184k reserved, 1287k data, 208k init)
Calibrating delay loop (skipped), value calculated using timer frequency.. 5984.99 BogoMIPS (lpj=2992496)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter

Version-Release number of selected component (if applicable):
RHEL5.4-Server-20090608.0
kernel-2.6.18-152.el5
kvm-83-72.el5
kexec-tools-1.102pre-72.el5

How reproducible:
always

Comment 1 Neil Horman 2009-06-12 12:16:38 UTC
Cai, not sure what to do about this, I suppose we could undertake the effort to get kdump working in a kvm environment, but I have a feeling thats going to be a pretty major undertaking to get it working correctly on all platforms.  I agree that we need some dump capturing facility, but it seems like the best course of action would be to get the native facility working, in this case the core facility that appears to be partially available (but non-functional) via virsh.  What do you think

Comment 2 Qian Cai 2009-06-12 16:53:03 UTC
I(In reply to comment #1)
> Cai, not sure what to do about this, I suppose we could undertake the effort to
> get kdump working in a kvm environment, but I have a feeling thats going to be
> a pretty major undertaking to get it working correctly on all platforms.  

Probably, but looks pretty good at the first glance, at least we are booting into
the kdump kernel in most cases. :)

> I agree that we need some dump capturing facility, but it seems like the best
> course of action would be to get the native facility working, in this case the
> core facility that appears to be partially available (but non-functional) via
> virsh.  What do you think  

I don't know how to answer, but Stephen Tweedie's comments for kdump on Xen guests
make perfect sense for me,

https://bugzilla.redhat.com/show_bug.cgi?id=253333#c10

"...but that doesn't mean we shouldn't be trying to make HVM work as much
like baremetal as possible.  If kdump from a guest doesn't work, then
that's really a host bug.  And users may well want to have all their
systems use the same sort of dumping, such as kdump over ssh or netdump
to a centralised host, whether they are baremetal hosts or virtual
guests."

Anyway, I'll add more people I am aware of here to look at what their opinions.

Comment 3 Dave Anderson 2009-06-12 17:56:26 UTC
I was under the impression that kdump was supposed to "just work" 
with KVM guests.  However, a few months ago (Feb/Mar 2009), Chris Smith
of HP was working on a prototype to emulate the xen "xm dump" capability,
that would sit underneath "virsh dump".  Chris's dumpfile output was a
simple netdump/kdump-like ELF core file -- without all the crap that the
xendump format(s) use -- and he had it successfully running with the crash
utility.

The last I knew was that Chris had posted it to the qemu-devel mailing
list, and I have an old email snippet from Chris where Avi Kavity (now
w/Red Hat via Qumranet) was making comments/suggestions about his patch.

But I have no idea where it stands at this point...

As with Xen, it seems stupid (to me) to waste guest memory to support
kdump on guests, but apparently it does work in RHEL5 as an alternative to
"xm dump" at least on some architectures (?).  And my same sentiment applies
with KVM, and as a KVM user, Chris felt the same way.

Comment 4 Neil Horman 2009-06-12 18:11:54 UTC
virsh dump doesn't work yet (according to cai's first comment).

Thats why I really brought this question up though.  I don't know much about kvm, and while I can see how kdump might 'just work' with virt technology, rarely if ever does kdump 'just work' with anything :).  Given the problem description, I can see that getting kdump to work with kvm might be a fairly massive undertaking, especially considering the varied falures under different arches.  The introduction of a dump facility through virsh suggests loosely to me that perhaps someone discovered a hinderance with kdump that made the kdump approach to dump collection prohibitive or impossible.  I really don't know.  But it begs the question in my mind, should we be putting our efforts here toward getting kdump working or getting the kvm dump mechanism working?

Comment 5 Dave Anderson 2009-06-12 18:28:04 UTC
> virsh dump doesn't work yet

I don't know about the "yet" part -- it appears it's simply not supported at all:

> # virsh dump guest-83-98.rhts.bos.redhat.com vmcore
> error: Failed to core dump domain guest-83-98.rhts.bos.redhat.com to vmcore
> error: this function is not supported by the hypervisor: virDomainCoreDump

I know less about KVM than you, but the fact that there is no KVM host-based
dumping facility is what led Chris Smith down the path of implementing it
himself.  I don't recall him having a problem with kdump, though. 

FWIW, I just fired off an email to him to see what's up.

Comment 6 Neil Horman 2009-06-12 19:04:05 UTC
>I don't know about the "yet" part -- it appears it's simply not supported at
>all:

True, but virsh at least recognizes the command, and the kvm kernel bits know that the entry point for that code is simply missing.  So I'm hoping that filling in those blanks isn't too hard

>I know less about KVM than you
Don't be so sure :)


>FWIW, I just fired off an email to him to see what's up.  
Thats probably the best idea.  Lets figure out what the upstrem intent and direction is here.  That will help us figure out what we ought do.

Comment 7 Chris Lalancette 2009-06-15 09:18:33 UTC
So, I'll add a couple of things here.

1)  We really should get the "virsh dump" ability working with KVM.  This is one easy way to get debugging information.  I'm wondering if a whole new dumpfile type thing is overkill, though.  We already have a way to migrate guest CPU and memory state to a file (it's how we implement virsh save).  I wonder if we could just implement virsh dump the same way, and then crash could be taught to look at that kind of file.  What I don't know is if there is enough information in those types of files for crash to function reasonably, although I can easily provide a save file to Dave for poking around.  That being said..

2)  I believe we *do* want kdump to work properly inside KVM guests.  There are a couple of reasons for this:

 a)  There's currently no easy way to "automatically" trigger a crashdump from a KVM guest.  That is, if the guest crashes without kdump configured, it is (usually) just sitting there spinning a CPU.  That's fine, but KVM doesn't really have a way to distinguish this situation from an OS that is just working very very hard.  Because of this, crash dumping would be a manual operation, which is not ideal.  Note that Xen fully virtualized guests have this same problem, since kdump doesn't work reliably there.
 b)  Customers who are already using kdump in their infrastructure would probably prefer to have a consistent way to get coredumps, including their virtualization.  So getting kdump working under KVM would maintain feature parity with bare-metal.

The good news about getting kdump working in a KVM guest is that the "hardware" is limited, and we have the ability to change both the guest kernel (for bug fixes), as well as the virtual hardware.  Someone just needs to go through it and see where the problems are.

Chris Lalancette

Comment 8 Neil Horman 2009-06-15 11:01:08 UTC
Ok, Chris, well lets try to tackle the hardest problem first.  The 64 bit guest seems to have a problem setting up the system timer.  Usually that means that the 8245 timer can't deliver interrupts to the OS, which in turn suggests that the guest can't configure either its ioapic or its lapic properly.  How does KVM emulate those pieces of hardware?  Anything you can point me to?

Comment 9 Chris Lalancette 2009-06-15 11:20:50 UTC
(In reply to comment #8)
> Ok, Chris, well lets try to tackle the hardest problem first.  The 64 bit guest
> seems to have a problem setting up the system timer.  Usually that means that
> the 8245 timer can't deliver interrupts to the OS, which in turn suggests that
> the guest can't configure either its ioapic or its lapic properly.  How does
> KVM emulate those pieces of hardware?  Anything you can point me to?  

There are actually two ways that KVM can emulate the timers and *APICs.  One is through the Qemu device model, where all timers are emulated in the userspace process.  The other way is to emulate these timers in-kernel, which is the default, although which method is in-use is going to depend on the KVM command-line invocation.  Cai, can you supply us with the logs from /var/log/libvirt/qemu/<guest>?  That will tell us which is in use.

For the moment, I'm going to assume all of this is emulated in-kernel, since that is the default.  That being the case, the 8254 is emulated in the kvm bits under kernel/i8254.c.  The ioapic emulation is under kernel/ioapic.c, and the lapic stuff is under kernel/lapic.c.

Chris Lalancette

Comment 11 Dave Anderson 2009-06-15 12:30:35 UTC
> We already have a way to migrate guest CPU and memory state to a file
> (it's how we implement virsh save).  I wonder if we could just implement
> virsh dump the same way, and then crash could be taught to look at that
> kind of file.  What I don't know is if there is enough information in
> those types of files for crash to function reasonably...

Chris -- do you have a pointer to something that describes the saved file format?

Comment 12 Chris Lalancette 2009-06-15 13:19:23 UTC
(replying to original report, and Neil...)

Even more interesting, I just tried this on my own hardware, and it seemed to work.  That is, my host machine is an AMD RevF machine running 2.6.18-153.el5.x86_64 and kvm-83-74, and my guest is also running 2.6.18-153.el5.x86_64.  With this combination, I was able to successfully get a core with kdump inside the guest.  So that means that 1) the bug was in KVM and got fixed between -72 and -74, 2) the bug is machine specific (although it's all virtual hardware, so that would be a bit weird), or 3) Cai's test used a different set of configuration options.  We'll have to get more information about the original test to try to reproduce.

Cai, we need the original logs, plus we need to know which machine(s) the original tests were conducted on.

Neil, you might want to hold off looking at this until we can get a reliable reproducer (or confirmed it is fixed in kvm).

Thanks,
Chris Lalancette

Comment 13 Chris Lalancette 2009-06-15 13:30:52 UTC
(In reply to comment #11)
> > We already have a way to migrate guest CPU and memory state to a file
> > (it's how we implement virsh save).  I wonder if we could just implement
> > virsh dump the same way, and then crash could be taught to look at that
> > kind of file.  What I don't know is if there is enough information in
> > those types of files for crash to function reasonably...
> 
> Chris -- do you have a pointer to something that describes the saved file
> format?  

Dave,

Unfortunately, I don't really know of any documentation about it.  From what I remember, it's more or less just a binary file that contains a list of devices and their state at the time of the migration/save.  Since CPU and memory are considered "devices", they dump their data to the file just like anything else.  The only canonical place I know for this state is in the qemu source itself.

I know you had done something to read Xen save files in the past; this might be similar, although I don't really know.

If this is not enough information for crash to be able to decipher what's going on, then I'd be interested in picking up the previous work and trying to get that working.  Do you have pointers to the mailing list entries?

Thanks,
Chris Lalancette

Comment 14 Qian Cai 2009-06-15 13:50:46 UTC
The original tests were conducted on an Intel and AMD RHTS hosts.

dell-pe2900-01.rhts.bos.redhat.com (Intel(R) Xeon(R) CPU 5160 @ 3.00GHz)
amd-ma78gm-01.rhts.bos.redhat.com (AMD Phenom(tm) 9750 Quad-Core Processor)

More system information:
http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=17
http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=13

Each host was installed one 32-bit and 64-bit guests using similar commands,
(32-bit)
virt-install --name guest-81-61.rhts.bos.redhat.com --mac 00:16:3E:51:3D:56 --cdrom /var/lib/libvirt/images/guest-81-61.rhts.bos.redhat.com.iso --ram=1024 --vcpus=1 --file-size=20 --hvm --extra-args ks=http://lab.rhts.bos.redhat.com/kickstarts/hosts/guest-81-61.rhts.bos.redhat.com/ks.cfg --prompt --accelerate --os-variant=virtio26  --noreboot

(64-bit)
virt-install --name guest-83-63.rhts.bos.redhat.com --mac 00:16:3E:53:3F:E1 --cdrom /var/lib/libvirt/images/guest-83-63.rhts.bos.redhat.com.iso --ram=1024 --vcpus=2 --file-size=20 --hvm --extra-args ks=http://lab.rhts.bos.redhat.com/kickstarts/hosts/guest-83-63.rhts.bos.redhat.com/ks.cfg --prompt --accelerate --os-variant=virtio26  --noreboot

/var/log/libvirt/qemu/<guest> logs can be found at RHTS pages,
Intel host: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8508117
AMD host: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8508672

Let me know if need anything else.

Comment 15 Dave Anderson 2009-06-15 14:07:11 UTC
> Unfortunately, I don't really know of any documentation about it.  From what I
> remember, it's more or less just a binary file that contains a list of devices
> and their state at the time of the migration/save.  Since CPU and memory are
> considered "devices", they dump their data to the file just like anything
> else.  The only canonical place I know for this state is in the qemu source
> itself.

In the qemu-list thread (link below), a Paul Brooks suggested that the
post-processing of the existing snapshot/savevm mechanism into a "whatever
format you want" might be a better idea than Chris's new proposal, to which
Avi Kavity responded:

  "savevm falls into the poorly documented category, I'm afraid. But it does
   have the advantage of carrying device state, not just cpu and memory state,
   which might be useful in extreme situations."

> I know you had done something to read Xen save files in the past; this
> might be similar, although I don't really know.

I did  -- and I guess it still works -- but I have to believe that
whatever KVM does, the output file looks nothing like the xen scheme,
and really won't help much at all.

> Do you have pointers to the mailing list entries?

Anyway, I haven't heard back from Chris Smith, but here's the original post:

  http://lists.gnu.org/archive/html/qemu-devel/2009-03/msg01159.html

Comment 16 Neil Horman 2009-06-15 18:25:51 UTC
Thanks chris, That helps.  It actually right off the bat explains why we're not booting a kdump kernel all the way.  We're panic-ing on boot in the timer init code because of this comment from __inject_pit_timer_intr:
 /*
         * Provides NMI watchdog support via Virtual Wire mode.
         * The route is: PIT -> PIC -> LVT0 in NMI mode.
         *
         * Note: Our Virtual Wire implementation is simplified, only
         * propagating PIT interrupts to all VCPUs when they have set
         * LVT0 to NMI delivery. Other PIC interrupts are just sent to
         * VCPU0, and only if its LVT0 is in EXTINT mode.
         */

We only send PIT interrupts to VCPUS which have LVT0 set to EXTINT mode, which we do not after a kdump crash iirc.  This is a bit like the mcp55 bug I just handled, except there it was a secret southbridge register that steered the PIT interrupts to the BSP.  My guess is that we're crashing on VCPUX here X != 0, and so we're not receving timer interrupts because that VCPU doesn't have its lapic LVT0 pin set to NMI delivery.

Chris, what are your thoughts on fixing this?  Can we simply extend the PIT interupt delivery to all cpu's who have LV0 set to EXTINT mode? (which i think is how bare metal handles this)?

Comment 17 Eduardo Habkost 2009-06-15 20:55:48 UTC
I tested kdump on KVM guests a few months ago (upstream code, not RHEL), and I had issues when the kdump kernel was booting on any CPU except vcpu0.

At that time, there were lots of "if (vcpu==0)" checks on the in-kernel timer code. I don't know if it looks better today. I hope so, but I am not too optimistic.

Comment 20 Chris Lalancette 2009-06-17 09:26:16 UTC
(In reply to comment #16)
> Thanks chris, That helps.  It actually right off the bat explains why we're not
> booting a kdump kernel all the way.  We're panic-ing on boot in the timer init
> code because of this comment from __inject_pit_timer_intr:
>  /*
>          * Provides NMI watchdog support via Virtual Wire mode.
>          * The route is: PIT -> PIC -> LVT0 in NMI mode.
>          *
>          * Note: Our Virtual Wire implementation is simplified, only
>          * propagating PIT interrupts to all VCPUs when they have set
>          * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>          * VCPU0, and only if its LVT0 is in EXTINT mode.
>          */
> 
> We only send PIT interrupts to VCPUS which have LVT0 set to EXTINT mode, which
> we do not after a kdump crash iirc.  This is a bit like the mcp55 bug I just
> handled, except there it was a secret southbridge register that steered the PIT
> interrupts to the BSP.  My guess is that we're crashing on VCPUX here X != 0,
> and so we're not receving timer interrupts because that VCPU doesn't have its
> lapic LVT0 pin set to NMI delivery.
> 
> Chris, what are your thoughts on fixing this?  Can we simply extend the PIT
> interupt delivery to all cpu's who have LV0 set to EXTINT mode? (which i think
> is how bare metal handles this)?  

Yeah, that might be reasonable, although I'm not quite sure why it was done this way in the first place.  Avi, do you have any thoughts about this?

Chris Lalancette

Comment 22 Neil Horman 2009-06-17 18:11:27 UTC
So, when we call machine_crash_shutdown while the system is normally running (in preparation for a jump to the kdump kernel image), we call disable_io_APIC, which additionally writes to the local apic config for the crashing cpu, placing LVT0 in EXTINT mode.  Since the 8254 timer only delivers interrupts to cpus who's LVT0 pin is set to NMI mode, we miss the timer interrupts.  Is there any reason that we can't agument __inject_pit_timer_intr, such that, if no vcpus are in nmi mode, we simply call kvm_apic_local_deliver, to emulate a system which has only one cpu (as kdump will behave in such a way)?  That I think should allow the booting vcpu (even if it is not vcpu0 to revieve interrupts from the pic, as its LVT0 pin is in EXTINT mode.

Comment 24 Munehiro IKEDA 2009-06-18 19:54:16 UTC
(In reply to comment #0)
> I have tried to use kdump on both 32-bit and 64-bit guests hosted on both Intel
> and AMD systems, but there is no dice.
> 
> 32-bit guest -- looks like the probing critical disks code in kdump initramfs
> is waiting forever for vda device coming up.

In my trial, this was caused by lack of kernel modules in initrd for kdump kernel.
If we use virtio disks (/dev/vd*) in guest, we need 4 modules below to detect them.

  - virtio.ko
  - virtio_blk.ko
  - virtio_ring.ko
  - virtio_pci.ko

However only former 2 modules are included in initrd generated by current mkdumprd, because depmod outputs the dependency only between virtio and virtio_blk but not between virtio_blk and virtio_ring nor virtio_pci.
(I don't know why but I guess they are theoretically dependent but are not from the viewpoint of implementation)

If I generate initrd in a guest by following, kdump succeeded.
  # mkdumprd -d  --with=virtio_pci --with=virtio_ring initrd-`uname -r`kdump.img `uname -r`

So, I think there are 2 options to solve this problem.

(1) Modify mkdumprd to include necessary modules if virtio devices are connected.
(2) Make virtio_blk.ko depend on virtio_pci.ko explicitly.

My trial condition was below.

Arch: x86_64 (both of host/guest)
Package versions:
  kernel-2.6.18-152.el5
  kexec-tools-1.102pre-70.el5
  kvm-83-72.el5
  kmod-kvm-83-72

Comment 25 Neil Horman 2009-06-18 20:18:23 UTC
Munehiro, I would request that you open a separate bugzilla for that problem.  We're tracking at least two issues here, and we've been primarily focusing on the apic issue in kvm with this bug.  If you could open a separate bz  for the virtio drivers, I'd appreciate it.  Thanks!

Comment 28 Chris Lalancette 2009-06-23 09:52:16 UTC
OK.  There are actually at least 4 separate bugs in the core dumping capabilities of KVM.  I don't want to get them all jumbled together, so I've opened separate bugs for each of the problems, and also opened a tracker bug as:

https://bugzilla.redhat.com/show_bug.cgi?id=507548

I'll use *this* bug to track the misrouted IRQ problem.

Chris Lalancette

Comment 30 Paolo Bonzini 2009-08-17 14:05:57 UTC
The misrouted IRQ problem afflicts also Xen HVM.

Comment 31 Chris Lalancette 2009-10-08 12:07:04 UTC
Just as a quick update here, I've been working my way through various issues that this exposed.  There are a few problems I've found while going over the KVM codebase:

1)  There are some odd places where something like the i8254 (PIT timer) chip directly causes an interrupt to be placed on the BSP's queue of interrupts.  This doesn't seem conceptually correct, though; the i8254 should raise an interrupt either with the i8259 (PIC), or with the IOAPIC, which should then, in turn, raise an interrupt on the appropriate VCPU(s).

2)  Leaving 1) aside, after reading the Intel MPS specification, there are 4 modes you can use for SMP interrupt routing (even though the paper only mentions 3, there is a 4th implied one):

a)  PIT -> PIC -> BSP int - this matches the old DOS-style, and is only suitable for single processor systems.  This is one mode KVM implements today.

b)  PIT -> PIC -> BSP LAPIC - this mostly matches the old DOS-style, but with the BSP LAPIC placed into "Virtual Wire" mode.  However, there is a subtlety here that kdump takes advantage of, but which is *not* implemented in KVM.  The subtlety is that although this is only guaranteed to be routed to the BSP LAPIC, in practice most motherboards have a single electrical wire connect to *all* of the LAPICs.  Therefore, if any of the LAPIC's are in "Virtual Wire" mode, they can also receive the timer interrupts even though they are not the BSP.  KVM does not implement this, which is the main reason why kdump doesn't work.

c)  PIT -> IOAPIC -> BSP int - in this mode, the IOAPIC is placed into "Virtual Wire" mode (confusingly, the same terminology as the LAPIC), and acts more or less just like a standard PIC.  All interrupts are routed to the BSP interrupt pin.

d)  PIT -> IOAPIC -> LAPIC - this is the mode used for SMP.  When you have multiple processor operating, the interrupts goes to the IOAPIC, which then arbitrates among them and delivers the interrupts.  (I'm still not entirely clear how it decides the priority of interrupts and/or which LAPIC to deliver too, but I'll have to do more reading).

So given the above, I'm trying to come up with patches that cause the i8254 to route interrupts the right way (i.e. through the PIC or IOAPIC), and that also implement the subtlety of solution b).

Chris Lalancette

Comment 32 Dor Laor 2009-11-29 14:51:39 UTC
Chris, any update? will we make it to rhel5.5?

Comment 33 Chris Lalancette 2009-11-30 16:25:14 UTC
(In reply to comment #32)
> Chris, any update? will we make it to rhel5.5?  

Sorry, just getting back to it :(.  It seems unlikely, given the scope of the patches now, but I'll do my best.

Chris Lalancette

Comment 35 Simon Grinberg 2010-03-16 12:33:15 UTC
Reoccurred to a 5.5beta customer, added the Issue Tracker number

Version:
[RHEL5.5 GA SnapShot2]

Comment 36 Issue Tracker 2010-03-16 12:38:51 UTC
Event posted on 03-16-2010 08:38am EDT by sgrinber

Masaki-San,

Thank you for raising our attention to this. 
I've connected this ticket to the existing BZ. 

Additionally we'll need to recheck this one just prior to beta and if it
wasn't fixed till 5.5 GA we'll need to have a kbase and release note in
place.

Best regards,
Simon. 


This event sent from IssueTracker by sgrinber 
 issue 585773

Comment 37 Simon Grinberg 2010-03-16 12:39:48 UTC
Reoccurred to a 5.5beta customer, added the Issue Tracker number

Version:
[RHEL5.5 GA SnapShot2]

Comment 38 Issue Tracker 2010-03-16 12:44:22 UTC
Event posted on 03-04-2010 09:47am EST by mfuruta

Dear SEG,

I would like to escalate this, because this bug is occurred beta period
and NEC reported there's BZ regarding to this issue.

> SEG Escalation Template
>
> All Issues: Problem Description
> ---------------------------------------------------
> 1. Time and date of problem:

This occurred on test enviroment on customer's site.

> 2. System architecture(s):

KVM Host:
RHEL5.5 beta (2.6.18-186.el5) on x86_64, x86
RHEL5.5 Snapshot2(2.6.18-189.el5) on x86_64, x86

KVM Guest:
RHEL5.5 beta (2.6.18-186.el5) on x86_64, x86

> 3. Provide a clear and concise problem description as it is understood
at the time of escalation. Please be as specific as possible in your
description. Do not use the generic term "hang", as that can mean many
things.
>    Observed behavior:

Kdump fails because second kernel fails to start on KVM guest and
following messages is outputted.

-----
SysRq : Trigger a crashdump
Memory for crash kernel(0x0 to 0x0) notwithin permissible range
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter
------

This problem occurs on KVM guest with multiple processor and does not
occurs with single processor.

This problem can be worked around by adding 'noapic' kernel parameter to
KDUMP_COMMANDLINE of /etc/sysconfig/kdump.

>    Desired behavior:

 kdump can execute normally without workaround.

> 4. Specific action requested of SEG:
>
> 5. Is a defect (bug) in the product suspected? yes/no
>    Bugzilla number (if one already exists):

This problem has already been registered to bugzilla.
   BZ#505527: [RHEL5.4 KVM]: Kdump on Intel fails because of misrouted
timer IRQs

> 6. Does a proposed patch exist? yes/no
>    If yes, attach patch, making sure it is in unified diff format (diff
-pruN)
> 7. What is the impact to the customer when they experience this problem?
This is especially important for severity one and two issues:
>    Example: "This system houses our accounts payable database. When the
system crashes we are unable to process payroll, and other payable
functions. This is especially critical as we approach end of our
quarter."

 kdump does not work on KVM guest with multiple processor using the
default configuration.

> All Issues: Supporting Information
> ------------------------------------------------------
> 1. Other actions already taken in working the problem (tech-list
posting, google searches, fulltext search, consultation with another
engineer, etc.):
>    Relevant data found (if any):
> 2. Attach sosreport.

 See attached file for sosreport.
 Host  : sosreport-ClassicB-91994-5d671d.tar.bz2
 Guest : sosreport-ClassicB-VM1-9313-d42ca8.tar.bz2

> 3. Attach other supporting data (if any).
>
> 4. Provide issue reproduction information, including location and access
of reproducer machine, if available.
>    Location and access information for reproducer machine:
>    Steps to reproduce the problem:
>       or
>    Ticket update containing steps to reproduce the problem:

Steps to Reproduce:
1. Set kdump and service start.
2. run command "echo c > /proc/sysrq-trigger"

Actual results:
 kdump fails before second kernel start.

-----
SysRq : Trigger a crashdump
Memory for crash kernel(0x0 to 0x0) notwithin permissible range
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter
------

> 5. Known hot-fix packages on the system:
> 6. Customer applied changes from the last 30 days:

Thank you in advance.

Best Regards,
Masaki Furuta



Issue escalated to Support Engineering Group by: mfuruta.
Category set to: Kernel
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by sgrinber 
 issue 585773

Comment 44 Ronen Hod 2012-02-06 17:14:27 UTC
*** Bug 784232 has been marked as a duplicate of this bug. ***

Comment 45 IBM Bug Proxy 2012-02-06 17:28:05 UTC
Created attachment 559698 [details]
Kdump serial log

Comment 46 IBM Bug Proxy 2012-02-06 17:28:17 UTC
Created attachment 559699 [details]
sosreport of x3650a

Comment 47 Dave Young 2012-09-10 08:29:21 UTC
*** Bug 789228 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.