Bug 815785

Summary: kdump fails with lapic error in xen hvm guest
Product: Red Hat Enterprise Linux 6 Reporter: Qixiang Wan <qwan>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: drjones, kzhang, leiwang, moli, qguan, yuzhou
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: xen
Fixed In Version: kernel-2.6.32-269.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 13:59:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 653816    
Attachments:
Description Flags
second kernel call trace with LAPIC error
none
second kernel call trace and continue, then reboot after "lost interrupt" error none

Description Qixiang Wan 2012-04-24 13:56:58 UTC
Description of problem:
When using kdump in a RHEL6.3 xen HVM guest, the second kernel call trace and hang with the following error:

------------[ cut here ]------------
WARNING: at arch/x86/kernel/apic/apic.c:1304 setup_local_APIC+0x189/0x290() (Not tainted)
Hardware name: HVM domU
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.32-263.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b6b7>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b70a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffff814f6d49>] ? setup_local_APIC+0x189/0x290
 [<ffffffff81c30402>] ? native_smp_prepare_cpus+0x2bd/0x389
 [<ffffffff81c21740>] ? kernel_init+0x112/0x2fe
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81c2162e>] ? kernel_init+0x0/0x2fe
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
---[ end trace a7919e7f17c0a725 ]---
Spurious LAPIC timer interrupt on cpu 0
do_IRQ: 0.73 No irq handler for vector (irq -1)

It's more reproducible when trigger crash while scping data from guest, or it may continue the boot process after the above call trace and then hang with "lost interrupt" error.

It's very likely caused by commit 0a267f9:
[x86] kdump: No need to disable ioapic in crash path (Don Zickus) [783322]

Version-Release number of selected component (if applicable):
kernel-2.6.32-263

How reproducible:
100%

Steps to Reproduce:
1. Install a xen HVM guest
2. Enable kdump with the steps in https://access.redhat.com/knowledge/solutions/92943, pasting the necessary steps for RHEL6.3 here:

[1] Add the kernel command line parameter xen_emul_unplug=never to the kernel's command line and boot. This boots using the emulated devices (and appropriate drivers) and without paravirt drivers.
[2] Start the kdump service service kdump start. This will generate a dumprd with the drivers necessary for the emulated devices.
[3] Edit /etc/modprobe.d/blacklist.conf by adding the three lines shown below to blacklist the drivers used for the emulated devices. This will ensure that even if the host presents the emulated devices to the guest, the guest will use the paravirt drivers instead.

blacklist ata_piix
blacklist 8139too
blacklist 8139cp

[4] Remove the xen_emul_unplug=never kernel command line parameter added in step 1 and add the kernel command line xen_emul_unplug=unnecessary and reboot.
[5] Ensure that the kdump service has started: service kdump status
[6] Run echo c >/proc/sysrq-trigger to force a crash that should invoke kdump

Actual results:
The second kernel call trace and hang

Expected results:
kdump should work

Additional info:

Comment 1 Qixiang Wan 2012-04-24 14:03:32 UTC
Created attachment 579867 [details]
second kernel call trace with LAPIC error

This error is more reproducible if trigger the crash while scping data from guest

Comment 2 Qixiang Wan 2012-04-24 14:05:58 UTC
Created attachment 579868 [details]
second kernel call trace and continue, then reboot after "lost interrupt" error

Guest has a chance (if don't scp data from guest when trigger the crash) to continue boot after the call trace, but it will reboot after "lost interrupt" error later.

Comment 3 Andrew Jones 2012-04-24 18:05:45 UTC
I've started a brew build here

https://brewweb.devel.redhat.com/taskinfo?taskID=4334583

that has 0a267f9 reverted for testing.

Comment 4 Qixiang Wan 2012-04-25 03:08:26 UTC
(In reply to comment #3)
> I've started a brew build here
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=4334583
> 
> that has 0a267f9 reverted for testing.

Tested this build, kdump works well without any call trace.

Comment 5 Andrew Jones 2012-04-25 07:54:53 UTC
Thanks for the testing qwan!

I'll start chatting with dzickus about this.

Comment 6 RHEL Program Management 2012-04-25 08:10:07 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 7 Andrew Jones 2012-04-25 12:40:42 UTC
This brew build has the patch (hack) below to try and keep 0a267f9

https://brewweb.devel.redhat.com/taskinfo?taskID=4336690

I'm not sure if we want to do this, but I guess we can test it to see if it
even works for starters.



diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c1b0780..1ec6287 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -30,6 +30,8 @@
 #include <asm/virtext.h>
 #include <asm/iommu.h>

+#include <xen/xen.h>
+

 int in_crash_kexec;

@@ -103,6 +105,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
        cpu_emergency_svm_disable();

        lapic_shutdown();
+#if defined(CONFIG_X86_IO_APIC)
+       if (xen_hvm_domain())
+               disable_IO_APIC();
+#endif
        if (mcp55_rewrite) {
                u32 cfg;
                printk(KERN_CRIT "REWRITING MCP55 CFG REG\n");

Comment 8 Qixiang Wan 2012-04-25 12:58:00 UTC
(In reply to comment #7)
> This brew build has the patch (hack) below to try and keep 0a267f9
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=4336690
> 
> I'm not sure if we want to do this, but I guess we can test it to see if it
> even works for starters.

It works in the same environment.

Comment 9 Andrew Jones 2012-05-02 14:24:21 UTC
Don posted a 'revert 0a267f9' patch with under this BZ, so kicking it to POST. He'll revisit the issue for 6.4.

Comment 10 Jarod Wilson 2012-05-02 16:21:46 UTC
Patch(es) available on kernel-2.6.32-269.el6

Comment 13 Qixiang Wan 2012-05-03 05:57:21 UTC
Verified with kernel-2.6.32-269.el6. With this build, kdump service can start
successfully and works well in xen HVM guest.

The latest build contains the following fixes:
Bug 810222 - Revert "[virt] xen: mask MTRR feature from guest BZ#750758" (fix
in -262)
Bug 811815 - [FJ6.2 Bug]: kdump service fails with the message "Kdump is
unsupported on this kernel" (fix in -266)
Bug 815785 - kdump fails with lapic error in xen hvm guest (fix in -269).

With all of the above three fixes integrated, kdump in RHEL6.3 xen hvm guest
works well now. So verify these 3 bugs together.

Test steps:

[1] Add the kernel command line parameter xen_emul_unplug=never to the kernel's
command line and boot.
[2] Start the kdump service.
[3] Blacklist the drivers used for xen emulated device by adding the following
tree lines to /etc/modprobe.d/blacklist.conf:

blacklist ata_piix
blacklist 8139too
blacklist 8139cp

[4] Remove the xen_emul_unplug=never kernel command line parameter added in
step 1 and add the kernel command line xen_emul_unplug=unnecessary, then
reboot.
[5] Ensure that the kdump service has started.
[6] Run echo c >/proc/sysrq-trigger to force a crash that should invoke kdump

Comment 15 errata-xmlrpc 2012-06-20 13:59:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html