Bug 903454

Summary: kvm guest crash after long stop/cont cycle
Product: Red Hat Enterprise Linux 6 Reporter: Shaolong Hu <shu>
Component: qemu-kvmAssignee: Marcelo Tosatti <mtosatti>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.4CC: acathrow, amit.shah, bsarathy, juzhang, michen, mkenneth, mrezanin, qzhang, shu, sluo, stefanha, virt-maint
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.380.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-21 06:32:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shaolong Hu 2013-01-24 02:46:39 UTC
Description of problem:
------------------------
In bug 739151, kvm suppress cpu softlockup message, but with x-data-plan=on, it does not work, this bug is to track the issue.


Version-Release number of selected component (if applicable):
---------------------------------------------------------------
Test with kernel 355 in guest, 351 in host, and qemu-kvm 351


How reproducible:
------------------
2/2


Steps to Reproduce:
--------------------------
1.add stress in guest:
for(( I=0; I<`cat /proc/cpuinfo  | grep processor | wc -l`;I++)) ; do echo $I; taskset -c $I /bin/bash -c 'for ((;;));  do X=1; done &'  ; done

2.stop guest for a whole night
3.cont guest, still hit the cpu softlockup, and guest finally hangs in minutes

cmd has nothing special, expect for i use x-data-plane:

/usr/libexec/qemu-kvm -enable-kvm -M rhel6.4.0 -smp 4 -m 4G -name rhel6.4-64 -uuid 3f2ea5cd-3d29-48ff-aab2-23df1b6ae213 -drive file=RHEL-Server-6.4-64-virtio.raw,cache=none,if=none,rerror=stop,werror=stop,id=drive-virtio-disk0,format=raw,aio=native -device virtio-blk-pci,drive=drive-virtio-disk0,id=device-virtio-disk0,x-data-plane=on,scsi=off -netdev tap,script=/etc/qemu-ifup,id=netdev0 -device virtio-net-pci,netdev=netdev0,id=device-net0,mac=aa:54:00:11:22:33 -boot order=cd -monitor stdio -usb -device usb-tablet,id=input0 -chardev socket,id=s1,path=/tmp/s1,server,nowait -device isa-serial,chardev=s1 -vnc :10 -monitor tcp::1234,server,nowait -rtc base=localtime,clock=host,driftfix=slew


Actual results:
----------------
call trace, guest hangs in several minutes.

There are several call trace, i put them in the sequence they occur:

BUG: soft lockup - CPU#0 stuck for 76635s! [bash:2158]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput p
CPU 0
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput p

Pid: 2158, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff810817f6>]  [<ffffffff810817f6>] run_timer_softirq+0x176/0x340
RSP: 0000:ffff880028203e50  EFLAGS: 00000246
RAX: ffff880028203e90 RBX: ffff880028203ed0 RCX: 0000000000000ca0
RDX: ffff880028203e90 RSI: dead000000200200 RDI: 0000000000000000
RBP: ffffffff8100bb93 R08: ffffffff81eae200 R09: 0000000000000000
R10: 00000000b2dc04f6 R11: ffff880028203cc0 R12: ffff880028203dd0
R13: ffff880119b110a8 R14: ffffffff81ead540 R15: ffffffff815168eb
FS:  00007fe0946a2700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000368b21c2b0 CR3: 0000000118d49000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2158, threadinfo ffff88011a4ae000, task ffff880118eb0040)
Stack:
 ffff880028203e60 ffffffff81eaf160 ffffffff81eaed60 ffffffff81eae960
<d> ffff88011a4affd8 ffff88011a4affd8 000001002820e0e0 ffffffff81eae560
<d> ffff880028203e90 ffff880028203e90 0000000000008a7f 0000000000000141
Call Trace:
 <IRQ>
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff8109b75b>] ? hrtimer_interrupt+0x14b/0x260
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff815168f0>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
 <EOI>
Code: 48 be 00 02 20 00 00 00 ad de 48 89 42 08 48 89 10 49 c7 04 24 00 00 00 00 49 89 74 24 08 c7 03 00 00 00 00 fb 66 0f 1f 44 00 00 <48> 8b 55 a8 8b 0d 88 aa b4 00 8b 92 44 e0 ff ff 85 c9 89 55 b4




BUG: soft lockup - CPU#0 stuck for 67s! [bash:2158]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
CPU 0 
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2158, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0033:[<000000000042c4ca>]  [<000000000042c4ca>] 0x42c4ca
RSP: 002b:00007fff3ecb9ff0  EFLAGS: 00000293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000368af8ee88
RDX: 000000000078ad70 RSI: 0000000000000000 RDI: 000000368af8ee80
RBP: ffffffff8100bb8e R08: fefefefefefefeff R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: 00007fff3ecba124
R13: 0000000000000000 R14: 0000000000000000 R15: 000000000078a4d0
FS:  00007fe0946a2700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2703ff0000 CR3: 0000000118d49000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2158, threadinfo ffff88011a4ae000, task ffff880118eb0040)

Call Trace:

Message from syslogd@dhcp-8-128 at Jan 24 05:33:35 ...
 kernel:BUG: soft lockup - CPU#0 stuck for 67s! [bash:2158]
INFO: task dbus-daemon-lau:2376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbus-daemon-l D 0000000000000000     0  2376      1 0x00000080
 ffff88011a06dce8 0000000000000086 0000000000000000 ffff88011af26aa0
 ffff88011af26ad8 0000000000000000 ffff88011a06dca8 ffffffff81064a00
 ffff88011af27058 ffff88011a06dfd8 000000000000fb88 ffff88011af27058
Call Trace:
 [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff8150d36f>] ? thread_return+0x16d/0x76e
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
INFO: task nm-dispatcher.a:2382 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nm-dispatcher D 0000000000000002     0  2382   2381 0x00000080
 ffff8801131a3ce8 0000000000000086 ffff8801131a3c98 ffffffff8100988e
 ffff88011b3fc078 0000000000000000 00000000011a3ca8 ffff880028314280
 ffff88011c835058 ffff8801131a3fd8 000000000000fb88 ffff88011c835058
Call Trace:
 [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
 [<ffffffff8150d250>] ? thread_return+0x4e/0x76e
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
INFO: task crond:2386 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond         D 0000000000000002     0  2386   2385 0x00000080
 ffff8801124c7ce8 0000000000000086 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 ffff88011a145098 ffff8801124c7fd8 000000000000fb88 ffff88011a145098
Call Trace:
 [<ffffffffa00b3bf0>] ? ext4_file_open+0x0/0x130 [ext4]
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff8117e364>] ? nameidata_to_filp+0x54/0x70
 [<ffffffff81277059>] ? cpumask_next_and+0x29/0x50
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0



 kernel:BUG: soft lockup - CPU#0 stuck for 68s! [bash:2158]
BUG: soft lockup - CPU#2 stuck for 67s! [bash:2170]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
CPU 2 
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2170, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff8118f03d>]  [<ffffffff8118f03d>] path_put+0x1d/0x40
RSP: 0018:ffff88011a48df28  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88011a48df38 RCX: ffff88011a48c000
RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: ffff000000000000
R10: 0000000000000008 R11: 0000000000000246 R12: 00000000006dd7e8
R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000246
FS:  00007f7fcd515700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f27048c30d8 CR3: 000000011ac95000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2170, threadinfo ffff88011a48c000, task ffff88011b3fc040)
Stack:
 ffff88011a48df78 ffff88011b1ae400 ffff88011a48df78 ffffffff810dc4a4
<d> 0000000000000000 00000000006dd7a0 0000000001bcc4d0 0000000000000000
<d> 0000000000000000 00007fff716f6f44 00007fff716f6f44 ffffffff8100b1a5
Call Trace:
 [<ffffffff810dc4a4>] ? __audit_syscall_exit+0x1a4/0x290
 [<ffffffff8100b1a5>] ? sysret_audit+0x17/0x21
Code: 00 8b 45 d8 e9 30 ff ff ff 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb 48 8b 7f 08 e8 e6 bc 00 00 48 8b 3b <48> 85 ff 74 0f c7 87 d4 00 00 00 00 00 00 00 e8 cf 28 01 00 48 
Call Trace:
 [<ffffffff8118f03a>] ? path_put+0x1a/0x40
 [<ffffffff810dc4a4>] ? __audit_syscall_exit+0x1a4/0x290
 [<ffffffff8100b1a5>] ? sysret_audit+0x17/0x21

Comment 1 Stefan Hajnoczi 2013-01-24 14:10:29 UTC
I'll try to reproduce this with 4 VMs tonight.  2 with x-data-plane=on and 2 with x-data-plane=off.

Since there is no I/O going on and virtio-blk-data-plane does not block vcpus from executing code, it's hard to see how x-data-plane=on could cause softlockups inside the guest.

Comment 2 Stefan Hajnoczi 2013-01-24 16:38:43 UTC
I reproduced the soft lockup without a hard disk!  This is not related to virtio-blk-data-plane.

Live ISO: rhel-workstation-x86_64-live-6.4_20130123.0-1.iso
Guest kernel: 2.6.32-356.el6.x86_64
Host kernel: F18 3.7 (includes kvmclock patches for stopped bit)
qemu-kvm: qemu-kvm-0.12.1.2-2.352.el6
Guest CPU count: 3

kernel:BUG: soft lockup - CPU#0 stuck for 1520s! [bash:2529]

It seems that Bug 739151 may not be resolved completely.

Comment 3 juzhang 2013-01-25 02:45:55 UTC
According to https://bugzilla.redhat.com/show_bug.cgi?id=739151#c42. 
--- snip log from comment42---
Let's use the other bug (bug 883746) reported for problems in this feature.  If the problem looks different, please open a new bug.

Re-assign this bug to amit. 

Hi, Amit

Any mistake, please fix me.

Best Regards & Thanks,
Junyi

Comment 9 Shaolong Hu 2013-07-29 02:50:52 UTC
Test on qemu-kvm-0.12.1.2-2.380.el6 with steps in comment 0, after resumise the guest, no call trace, only one message:

Clocksource tsc unstable (delta = 2091710798 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.

Verified.

Comment 10 errata-xmlrpc 2013-11-21 06:32:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1553.html