Bug 839966

Summary: Trigger RHEL7 crash in guest domU, host don't generate core file
Product: Red Hat Enterprise Linux 7 Reporter: Wei Shi <wshi>
Component: kernelAssignee: Vitaly Kuznetsov <vkuznets>
kernel sub component: Xen QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: drjones, leiwang, lersek, lkong, qwan, shwang
Version: 7.0Keywords: EC2
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: xen
Fixed In Version: kernel-3.10.0-137.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 11:28:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 741684    

Description Wei Shi 2012-07-13 10:15:44 UTC
Description of problem:
After trigger a crash on rhel7 guest, the crashinfo is dump to the guest's screen output, but host seems don't catch this crash message, no dump core file is generated in /var/lib/xen/dump dir

I also try this case on the same host with same xen-config.sxp to test rhel6.3 (2.6.32-278.el6.x86_64) HVM guest, rhel6.3 works fine with core file generated in host

Version-Release number of selected component (if applicable):
Host: RHEL5.8 2.6.18-318.el5xen x86_64
Guest: RHEL7.0 3.3.0-0.20.el7 HVM x86_64

How reproducible:
100%

Steps to Reproduce:
1. Check config items
xend-config.sxp
(enable-dump yes)

xen-hvm-guest-el7.cfg
on_crash = "restart"

2. lunch rhel7 guest

3. trigger guest crash(guest)
[root@rhel7 ~]# echo c > /proc/sysrq-trigger
[   43.625628] SysRq : Trigger a crash
[   43.626007] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   43.626007] IP: [<ffffffff813e8256>] sysrq_handle_crash+0x16/0x20
[   43.626007] PGD 3998f067 PUD 394b4067 PMD 0
[   43.626007] Oops: 0002 [#1] SMP
[   43.626007] CPU 0
[   43.626007] Modules linked in: 8139too xen_netfront 8139cp pcspkr i2c_piix4 mii i2c_core ata_generic pata_acpi xen_blkfront ata_piix libata floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[   43.626007]
[   43.626007] Pid: 639, comm: bash Not tainted 3.3.0-0.20.el7.x86_64 #1 Red Hat HVM domU
[   43.626007] RIP: 0010:[<ffffffff813e8256>]  [<ffffffff813e8256>] sysrq_handle_crash+0x16/0x20
[   43.626007] RSP: 0018:ffff880039e85e28  EFLAGS: 00010096
[   43.626007] RAX: 0000000000000010 RBX: ffffffff819dcfa0 RCX: 0000000000000001
[   43.626007] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000063
[   43.626007] RBP: ffff880039e85e28 R08: 0000000000000000 R09: 0000000000000000
[   43.626007] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000063
[   43.626007] R13: 0000000000000282 R14: 0000000000000000 R15: 0000000000000007
[   43.626007] FS:  00007f127540f740(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[   43.626007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.626007] CR2: 0000000000000000 CR3: 000000003998d000 CR4: 00000000000006f0
[   43.626007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   43.626007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   43.626007] Process bash (pid: 639, threadinfo ffff880039e84000, task ffff88003befcd20)
[   43.626007] Stack:
[   43.626007]  ffff880039e85e68 ffffffff813e89b7 ffff880039e85e68 0000000000000002
[   43.626007]  ffff88003950a840 ffffffff813e8a20 ffff880038998080 ffff880039e85f50
[   43.626007]  ffff880039e85e98 ffffffff813e8a6a ffff880039e85e98 00007f1275414000
[   43.626007] Call Trace:
[   43.626007]  [<ffffffff813e89b7>] __handle_sysrq+0x127/0x190
[   43.626007]  [<ffffffff813e8a20>] ? __handle_sysrq+0x190/0x190
[   43.626007]  [<ffffffff813e8a6a>] write_sysrq_trigger+0x4a/0x50
[   43.626007]  [<ffffffff812279f0>] proc_reg_write+0x80/0xc0
[   43.626007]  [<ffffffff811bd35f>] vfs_write+0xaf/0x190
[   43.626007]  [<ffffffff811bd69d>] sys_write+0x4d/0x90
[   43.626007]  [<ffffffff8166ba29>] system_call_fastpath+0x16/0x1b
[   43.626007] Code: d0 88 81 63 2a 95 82 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 c7 05 1d 64 56 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 55 48 89 e5 53 48 83 ec 08 66 66
[   43.626007] RIP  [<ffffffff813e8256>] sysrq_handle_crash+0x16/0x20
[   43.626007]  RSP <ffff880039e85e28>
[   43.626007] CR2: 0000000000000000
[   43.626007] ---[ end trace 9cd54253aac3e4d4 ]---
[   43.626007] Kernel panic - not syncing: Fatal exception

4. no dump core file generated(host)
[root@dhcp-8-204 ~]# xm li
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     4858     8 r-----   2657.5
hvm-guest-el7                             57     1032     1 r-----    144.2
[root@dhcp-8-204 ~]# ls /var/lib/xen/dump/
[root@dhcp-8-204 ~]# 

Actual results:
no dump file generated, and guest domain is still in running status

Expected results:
host generate a dump core file, guest action is match on_crash's value in cfg file

Additional info:

Comment 2 Laszlo Ersek 2012-07-13 11:30:02 UTC
I'm sorry but precisely as reported this seems NOTABUG.

(In reply to comment #0)

> xen-hvm-guest-el7.cfg
> on_crash = "restart"

> 4. no dump core file generated(host)
> [root@dhcp-8-204 ~]# xm li
> Name                                      ID Mem(MiB) VCPUs State   Time(s)
> Domain-0                                   0     4858     8 r-----   2657.5
> hvm-guest-el7                             57     1032     1 r-----    144.2
> [root@dhcp-8-204 ~]# ls /var/lib/xen/dump/
> [root@dhcp-8-204 ~]# 
> 
> Actual results:
> no dump file generated, and guest domain is still in running status

The domid is quite high (57) which does not exclude at all that the domain was simply restarted (= new domain booted with the same guest config).

> Expected results:
> host generate a dump core file, guest action is match on_crash's value in
> cfg file

These two are contradictory in this exact case (see on_crash="restart" above); the second requirement is fulfilled (xend action matches on_crash setting).

Comment 3 Laszlo Ersek 2012-07-13 11:39:19 UTC
Hmmm, I may be wrong. enable-dump seems orthogonal.

Comment 4 Wei Shi 2012-07-16 01:45:11 UTC
(In reply to comment #2)
> I'm sorry but precisely as reported this seems NOTABUG.
> 
> (In reply to comment #0)
> 
> > xen-hvm-guest-el7.cfg
> > on_crash = "restart"
> 
> > 4. no dump core file generated(host)
> > [root@dhcp-8-204 ~]# xm li
> > Name                                      ID Mem(MiB) VCPUs State   Time(s)
> > Domain-0                                   0     4858     8 r-----   2657.5
> > hvm-guest-el7                             57     1032     1 r-----    144.2
> > [root@dhcp-8-204 ~]# ls /var/lib/xen/dump/
> > [root@dhcp-8-204 ~]# 
> > 
> > Actual results:
> > no dump file generated, and guest domain is still in running status
> 
> The domid is quite high (57) which does not exclude at all that the domain
> was simply restarted (= new domain booted with the same guest config).
> 
> > Expected results:
> > host generate a dump core file, guest action is match on_crash's value in
> > cfg file
> 
> These two are contradictory in this exact case (see on_crash="restart"
> above); the second requirement is fulfilled (xend action matches on_crash
> setting).

Sorry, i forgot to mention that no reboot is happenning, the domid 57 is just the original crash domU, no new domU is launched.
That's why i said it seems dom0 never catch the crash signal with domU.

Comment 6 Andrew Jones 2014-05-02 14:52:10 UTC
Assigning to Vitaly. I recommend trying this over Fedora 20 xen. It it doesn't reproduce, then we can close as wont-fix. If it does reproduce, then, if it looks like a host problem, we should open a bug to Fedora, if it's a guest problem we should fix it.

Comment 7 Lingfei Kong 2014-05-05 01:24:48 UTC
I can reproduce it on Fedora 20 xen(xen-4.3.2-2.fc20). rhel6.5 and rhel5.11 guest can generate a core file when trigger a crash in the guest, but rhel7.0 guest didn't generate the core file. So it probably a guest problem.

Comment 8 Vitaly Kuznetsov 2014-05-09 16:08:08 UTC
This issue is present in upstream 3.11.10 but was fixed in 3.12. Here is the commit:
commit 669b0ae961e87c824233475e987b2d39996d4849
Author: Vaughan Cao <vaughan.cao>
Date:   Fri Aug 16 16:10:56 2013 +0800

    xen/pvhvm: Initialize xen panic handler for PVHVM guests
    
    kernel use callback linked in panic_notifier_list to notice others when panic
    happens.
    NORET_TYPE void panic(const char * fmt, ...){
        ...
        atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
    }
    When Xen becomes aware of this, it will call xen_reboot(SHUTDOWN_crash) to
    send out an event with reason code - SHUTDOWN_crash.
    
    xen_panic_handler_init() is defined to register on panic_notifier_list but
    we only call it in xen_arch_setup which only be called by PV, this patch is
    necessary for PVHVM.
    
    Without this patch, setting 'on_crash=coredump-restart' in PVHVM guest config
    file won't lead a vmcore to be generate when the guest panics. It can be
    reproduced with 'echo c > /proc/sysrq-trigger'.
    
    Signed-off-by: Vaughan Cao <vaughan.cao>
    Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk>
    Acked-by: Joe Jin <joe.jin>

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index b5a22fa..15939e8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1713,6 +1713,8 @@ static void __init xen_hvm_guest_init(void)
 
        xen_hvm_init_shared_info();
 
+       xen_panic_handler_init();
+
        if (xen_feature(XENFEAT_hvm_callback_vector))
                xen_have_vector_callback = 1;
        xen_hvm_smp_init();

Comment 11 Jarod Wilson 2014-07-18 14:24:09 UTC
Patch(es) available on kernel-3.10.0-137.el7

Comment 14 Lingfei Kong 2015-01-13 07:01:32 UTC
Verify with kernel-3.10.0-221.el7.

Steps to verify:
1. Enable core-dumps in /etc/xen/xend-config.sxp
# grep enable-dump /etc/xen/xend-config.sxp
(enable-dump yes)
 
2. Create rhel7 hvm guest with on_crash = "restart"
# grep on_crash hvm-7.1-64-1.cfg 
on_crash = "restart"

# xm create hvm-7.1-64-1.cfg

# xm list 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0    13954    16 r-----   3891.7
hvm-7.1-64-1                              52     1032     4 r-----     21.4

3. Trigger guest crash
# echo c > /proc/sysrq-trigger 
[  149.299511] SysRq : Trigger a crash
[  149.300030] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  149.300030] IP: [<ffffffff81398026>] sysrq_handle_crash+0x16/0x20
[  149.300030] PGD 3b9d4067 PUD 3aa11067 PMD 0 
[  149.300030] Oops: 0002 [#1] SMP 
[  149.300030] Modules linked in: ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr serio_raw i2c_piix4 i2c_core xfs libcrc32c sd_mod crc_t10dif crct10dif_common ata_generic pata_acpi ata_piix libata xen_blkfront xen_netfront floppy dm_mirror dm_region_hash dm_log dm_mod
[  149.454922] CPU: 1 PID: 2463 Comm: bash Not tainted 3.10.0-221.el7.x86_64 #1
[  149.454922] Hardware name: Red Hat HVM domU, BIOS 3.1.2-402.el5 05/07/2013
[  149.454922] task: ffff88003ce3e660 ti: ffff88003b234000 task.ti: ffff88003b234000
[  149.454922] RIP: 0010:[<ffffffff81398026>]  [<ffffffff81398026>] sysrq_handle_crash+0x16/0x20
[  149.454922] RSP: 0018:ffff88003b237e80  EFLAGS: 00010046
[  149.454922] RAX: 000000000000000f RBX: ffffffff819c5660 RCX: 0000000000000000
[  149.454922] RDX: 0000000000000000 RSI: ffff88003fc8d488 RDI: 0000000000000063
[  149.454922] RBP: ffff88003b237e80 R08: 0000000000000092 R09: 00000000000001ff
[  149.454922] R10: 00000000000001fe R11: 0000000000000003 R12: 0000000000000063
[  149.454922] R13: 0000000000000246 R14: 0000000000000004 R15: 0000000000000000
[  149.454922] FS:  00007f87fafe0740(0000) GS:ffff88003fc80000(0000) knlGS:0000000000000000
[  149.454922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  149.454922] CR2: 0000000000000000 CR3: 000000003cdb1000 CR4: 00000000000006e0
[  149.454922] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  149.454922] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  149.454922] Stack:
[  149.454922]  ffff88003b237eb8 ffffffff813987d2 0000000000000002 00007f87fafe9000
[  149.454922]  ffff88003b237f48 0000000000000002 0000000000000000 ffff88003b237ed0
[  149.454922]  ffffffff81398caf ffff88003683e480 ffff88003b237ef0 ffffffff8122de6d
[  149.454922] Call Trace:
[  149.454922]  [<ffffffff813987d2>] __handle_sysrq+0xa2/0x170
[  149.454922]  [<ffffffff81398caf>] write_sysrq_trigger+0x2f/0x40
[  149.454922]  [<ffffffff8122de6d>] proc_reg_write+0x3d/0x80
[  149.454922]  [<ffffffff811c66dd>] vfs_write+0xbd/0x1e0
[  149.454922]  [<ffffffff811c7128>] SyS_write+0x58/0xb0
[  149.454922]  [<ffffffff816152a9>] system_call_fastpath+0x16/0x1b
[  149.454922] Code: eb 9b 45 01 f4 45 39 65 34 75 e5 4c 89 ef e8 e2 f7 ff ff eb db 66 66 66 66 90 55 c7 05 b0 0b 5a 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 2e 
[  149.454922] RIP  [<ffffffff81398026>] sysrq_handle_crash+0x16/0x20
[  149.454922]  RSP <ffff88003b237e80>
[  149.454922] CR2: 0000000000000000
[  149.454922] ---[ end trace 6f476705252cca2c ]---
[  149.454922] Kernel panic - not syncing: Fatal exception

4. After some seconds guest restart with new domain ID and core file was generated in /var/lib/xen/dump/
# xm list 
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0    13954    16 r-----   3914.7
hvm-7.1-64-1                              53     1032     1 r-----      3.2

# ls /var/lib/xen/dump/
2015-0113-2257.13-hvm-7.1-64-1.52.core

5. The core file is useable in guest:
crash> bt
PID: 2463   TASK: ffff88003ce3e660  CPU: 1   COMMAND: "bash"
 #0 [ffff88003b237ae8] xen_panic_event at ffffffff81003533
 #1 [ffff88003b237af8] notifier_call_chain at ffffffff81610c6c
 #2 [ffff88003b237b30] atomic_notifier_call_chain at ffffffff81610cca
 #3 [ffff88003b237b40] panic at ffffffff815fece8
 #4 [ffff88003b237bc0] oops_end at ffffffff8160da9b
 #5 [ffff88003b237be8] no_context at ffffffff815fe501
 #6 [ffff88003b237c38] __bad_area_nosemaphore at ffffffff815fe597
 #7 [ffff88003b237c80] bad_area at ffffffff815fe915
 #8 [ffff88003b237ca8] __do_page_fault at ffffffff816109f5
 #9 [ffff88003b237da8] do_page_fault at ffffffff81610aca
#10 [ffff88003b237dd0] page_fault at ffffffff8160cd08
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffff81398026  RSP: ffff88003b237e80  RFLAGS: 00010046
    RAX: 000000000000000f  RBX: ffffffff819c5660  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff88003fc8d488  RDI: 0000000000000063
    RBP: ffff88003b237e80   R8: 0000000000000092   R9: 00000000000001ff
    R10: 00000000000001fe  R11: 0000000000000003  R12: 0000000000000063
    R13: 0000000000000246  R14: 0000000000000004  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#11 [ffff88003b237e88] __handle_sysrq at ffffffff813987d2
#12 [ffff88003b237ec0] write_sysrq_trigger at ffffffff81398caf
#13 [ffff88003b237ed8] proc_reg_write at ffffffff8122de6d
#14 [ffff88003b237ef8] vfs_write at ffffffff811c66dd
#15 [ffff88003b237f38] sys_write at ffffffff811c7128
#16 [ffff88003b237f80] system_call_fastpath at ffffffff816152a9
    RIP: 00007f87fa6c29e0  RSP: 00007fff6efc3208  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: ffffffff816152a9  RCX: 0000000000000063
    RDX: 0000000000000002  RSI: 00007f87fafe9000  RDI: 0000000000000001



So bug is fxied.

Comment 16 errata-xmlrpc 2015-03-05 11:28:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html