Bug 805372

Summary: Host kernel panic happened when launching VM
Product: Red Hat Enterprise Linux 5 Reporter: Chao Yang <chayang>
Component: kvmAssignee: Karen Noel <knoel>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.8CC: chayang, gleb, juzhang, michen, mkenneth, qzhang, shuang, sluo, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-09 08:42:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 807971    

Description Chao Yang 2012-03-21 03:55:56 UTC
Description of problem:
Host kernel panic happened twice during launching 5.8 guest. I collected 3 vmcore files when panic happened for 3 times.

Version-Release number of selected component (if applicable):
kernel-2.6.18-308.el5.x86_64.rpm
kvm-83-249.el5_8
How reproducible:
not always

Steps to Reproduce:
1. 
2.
3.
  
Actual results:


Expected results:


Additional info:
will attach vmcore files later

Comment 1 Chao Yang 2012-03-21 05:19:03 UTC
(In reply to comment #0)
> Description of problem:
> Host kernel panic happened twice during launching 5.8 guest. I collected 3
> vmcore files when panic happened for 3 times.
> 
> Version-Release number of selected component (if applicable):
> kernel-2.6.18-308.el5.x86_64.rpm
> kvm-83-249.el5_8
sorry, should be  kvm-83-249.el5.x86_64.rpm

Comment 2 Chao Yang 2012-03-21 05:48:40 UTC
This issue is hardware specific, cause I can't reproduce this bug on another host with same cli and scenario.

Reproducible host info:
# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           W3520  @ 2.67GHz
stepping	: 5
cpu MHz		: 1596.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips	: 5333.49
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management: [8]

# dmidecode -s system-product-name
HP Z400 Workstation

Comment 4 Avi Kivity 2012-03-22 10:52:52 UTC
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<0000000000000000>]
PGD 1be43b067 PUD 1be43c067 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /class/net/lo/ifindex
CPU 1 
Modules linked in: tun nls_utf8 loop autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ksm(U) kvm_intel(U) kvm(U) snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer sr_mod cdrom snd_page_alloc tg3 snd_hwdep i7core_edac sg snd edac_mc pcspkr serio_raw shpchp soundcore tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 4395, comm: qemu-kvm Tainted: G     ---- 2.6.18-308.el5 #1
RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
RSP: 0018:ffff8101be293d90  EFLAGS: 00010246
RAX: ffff8101bea3ec68 RBX: ffff8101ed34b3c0 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff8101bea3c000 RDI: ffff8101ed34b3c0
RBP: ffff8101bea3c000 R08: ffff8101be293d78 R09: 0000000000000000
R10: ffff81021fce4008 R11: ffffffff8841c65b R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000041790000
FS:  000000004178f940(0063) GS:ffff81021fc097c0(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000001be43a000 CR4: 00000000000026a0
Process qemu-kvm (pid: 4395, threadinfo ffff8101be292000, task ffff8102187c9080)
list_add corruption. next->prev should be ffff81010774e4d0, but was 0000000000000000
Stack:  ffffffff883fb6bc ffff81020f648460 ffff8101bea3c000 ffff8101da330000
 0000000000000001 0000000000000010 ffffffff883fa192 ffff8101bdcd0040
 ffff81021e2edac0 ffff8101bdcd0040 ffffffff883eaad4 fffffffe7ffbfeff
Call Trace:
 [<ffffffff883fb6bc>] :kvm:kvm_set_irq+0x65/0xa3
 [<ffffffff883fa192>] :kvm:kvm_inject_pit_timer_irqs+0x8c/0xd7
 [<ffffffff883eaad4>] :kvm:kvm_arch_vcpu_ioctl_run+0x473/0x61e
 [<ffffffff883e5f95>] :kvm:kvm_vcpu_ioctl+0xf2/0x448
 [<ffffffff8008ee72>] default_wake_function+0x0/0xe
 [<ffffffff80041ea0>] do_ioctl+0x21/0x6b
 [<ffffffff8002ff2d>] vfs_ioctl+0x457/0x4b9
 [<ffffffff8004c26c>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 5 Avi Kivity 2012-03-22 10:54:57 UTC
Examining the 'struct kvm' pointer in %rbp (copied from %rdi):


crash> x/30x 0xffff8101bea3c000
0xffff8101bea3c000:     0x00000000      0x00000001      0xbea3c008      0xffff8101
0xffff8101bea3c010:     0xbea3c008      0xffff8101      0x00000001      0x00000001
0xffff8101bea3c020:     0x00000001      0x00000001      0xbea3c028      0xffff8101
0xffff8101bea3c030:     0xbea3c028      0xffff8101      0x1cbf9480      0xffff8102
0xffff8101bea3c040:     0x00000023      0x00000000      0x00000000      0x00000000
0xffff8101bea3c050:     0x000000a0      0x00000000      0x00000000      0x00000000
0xffff8101bea3c060:     0x100de000      0xffffc200      0x00000000      0x00000000
0xffff8101bea3c070:     0x100e0000      0xffffc200

Looks corrupted.

Looks like the oops handler encountered it own corruption while showing the oops:

list_add corruption. next->prev should be ffff81010774e4d0, but was
0000000000000000

In short, the machine is totally trashed.

Comment 6 Avi Kivity 2012-03-25 16:14:46 UTC
Please try booting with the ftrace_dump_on_oops kernel parameter, a serial console or netconsole, and running qemu under trace-cmd:

  trace-cmd -p function -e kvm -b 100000 /usr/libexec/qemu-kvm ...

and send the console output that results.

Comment 7 Chao Yang 2012-03-27 09:37:27 UTC
(In reply to comment #6)
> Please try booting with the ftrace_dump_on_oops kernel parameter, a serial
> console or netconsole, and running qemu under trace-cmd:
> 
>   trace-cmd -p function -e kvm -b 100000 /usr/libexec/qemu-kvm ...
> 
> and send the console output that results.
Hi Avi,
 The latest trace-cmd I could find in brew is trace-cmd-2.0-2.el5rt, and I indeed installed into my rhel5.8 host. But I don't see the tracing directory under /sys/kernel/debug after I mounted debugfs by mount -t debugfs nodev /sys/kernel/debug
 Any suggestion?

Comment 8 Avi Kivity 2012-03-27 14:23:42 UTC
Sorry, tracing isn't supported under RHEL 5.  I thought I posted a comment about it but it was for another bug.

Are you running with ksm enabled?  Please try disabling it.

Comment 9 Chao Yang 2012-04-01 11:39:02 UTC
(In reply to comment #8)
> Sorry, tracing isn't supported under RHEL 5.  I thought I posted a comment
> about it but it was for another bug.
> 
> Are you running with ksm enabled?  Please try disabling it.

Sorry for the late response. I remember I did not turn on ksm on my rhel5 host when crash happened. 
So far, I have kept VM running for several days, and no crash happens, seems it is not reproducible easily for me now.

Comment 10 RHEL Program Management 2012-04-02 10:49:11 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 11 Avi Kivity 2012-04-09 08:42:18 UTC
Not reproducible, only happens on one machine; not opened by a customer.  Closing.

If it reproduces, please reopen.