903454 – kvm guest crash after long stop/cont cycle

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 903454 - kvm guest crash after long stop/cont cycle

Summary: kvm guest crash after long stop/cont cycle

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Marcelo Tosatti
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-24 02:46 UTC by Shaolong Hu
Modified:	2013-11-21 06:32 UTC (History)
CC List:	12 users (show)
Fixed In Version:	qemu-kvm-0.12.1.2-2.380.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-21 06:32:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2013:1553	0	normal	SHIPPED_LIVE	Important: qemu-kvm security, bug fix, and enhancement update	2013-11-20 21:40:29 UTC

Description Shaolong Hu 2013-01-24 02:46:39 UTC

Description of problem:
------------------------
In bug 739151, kvm suppress cpu softlockup message, but with x-data-plan=on, it does not work, this bug is to track the issue.


Version-Release number of selected component (if applicable):
---------------------------------------------------------------
Test with kernel 355 in guest, 351 in host, and qemu-kvm 351


How reproducible:
------------------
2/2


Steps to Reproduce:
--------------------------
1.add stress in guest:
for(( I=0; I<`cat /proc/cpuinfo  | grep processor | wc -l`;I++)) ; do echo $I; taskset -c $I /bin/bash -c 'for ((;;));  do X=1; done &'  ; done

2.stop guest for a whole night
3.cont guest, still hit the cpu softlockup, and guest finally hangs in minutes

cmd has nothing special, expect for i use x-data-plane:

/usr/libexec/qemu-kvm -enable-kvm -M rhel6.4.0 -smp 4 -m 4G -name rhel6.4-64 -uuid 3f2ea5cd-3d29-48ff-aab2-23df1b6ae213 -drive file=RHEL-Server-6.4-64-virtio.raw,cache=none,if=none,rerror=stop,werror=stop,id=drive-virtio-disk0,format=raw,aio=native -device virtio-blk-pci,drive=drive-virtio-disk0,id=device-virtio-disk0,x-data-plane=on,scsi=off -netdev tap,script=/etc/qemu-ifup,id=netdev0 -device virtio-net-pci,netdev=netdev0,id=device-net0,mac=aa:54:00:11:22:33 -boot order=cd -monitor stdio -usb -device usb-tablet,id=input0 -chardev socket,id=s1,path=/tmp/s1,server,nowait -device isa-serial,chardev=s1 -vnc :10 -monitor tcp::1234,server,nowait -rtc base=localtime,clock=host,driftfix=slew


Actual results:
----------------
call trace, guest hangs in several minutes.

There are several call trace, i put them in the sequence they occur:

BUG: soft lockup - CPU#0 stuck for 76635s! [bash:2158]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput p
CPU 0
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput p

Pid: 2158, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff810817f6>]  [<ffffffff810817f6>] run_timer_softirq+0x176/0x340
RSP: 0000:ffff880028203e50  EFLAGS: 00000246
RAX: ffff880028203e90 RBX: ffff880028203ed0 RCX: 0000000000000ca0
RDX: ffff880028203e90 RSI: dead000000200200 RDI: 0000000000000000
RBP: ffffffff8100bb93 R08: ffffffff81eae200 R09: 0000000000000000
R10: 00000000b2dc04f6 R11: ffff880028203cc0 R12: ffff880028203dd0
R13: ffff880119b110a8 R14: ffffffff81ead540 R15: ffffffff815168eb
FS:  00007fe0946a2700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000368b21c2b0 CR3: 0000000118d49000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2158, threadinfo ffff88011a4ae000, task ffff880118eb0040)
Stack:
 ffff880028203e60 ffffffff81eaf160 ffffffff81eaed60 ffffffff81eae960
<d> ffff88011a4affd8 ffff88011a4affd8 000001002820e0e0 ffffffff81eae560
<d> ffff880028203e90 ffff880028203e90 0000000000008a7f 0000000000000141
Call Trace:
 <IRQ>
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff8109b75b>] ? hrtimer_interrupt+0x14b/0x260
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff815168f0>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
 <EOI>
Code: 48 be 00 02 20 00 00 00 ad de 48 89 42 08 48 89 10 49 c7 04 24 00 00 00 00 49 89 74 24 08 c7 03 00 00 00 00 fb 66 0f 1f 44 00 00 <48> 8b 55 a8 8b 0d 88 aa b4 00 8b 92 44 e0 ff ff 85 c9 89 55 b4




BUG: soft lockup - CPU#0 stuck for 67s! [bash:2158]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
CPU 0 
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2158, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0033:[<000000000042c4ca>]  [<000000000042c4ca>] 0x42c4ca
RSP: 002b:00007fff3ecb9ff0  EFLAGS: 00000293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000368af8ee88
RDX: 000000000078ad70 RSI: 0000000000000000 RDI: 000000368af8ee80
RBP: ffffffff8100bb8e R08: fefefefefefefeff R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: 00007fff3ecba124
R13: 0000000000000000 R14: 0000000000000000 R15: 000000000078a4d0
FS:  00007fe0946a2700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2703ff0000 CR3: 0000000118d49000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2158, threadinfo ffff88011a4ae000, task ffff880118eb0040)

Call Trace:

Message from syslogd@dhcp-8-128 at Jan 24 05:33:35 ...
 kernel:BUG: soft lockup - CPU#0 stuck for 67s! [bash:2158]
INFO: task dbus-daemon-lau:2376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbus-daemon-l D 0000000000000000     0  2376      1 0x00000080
 ffff88011a06dce8 0000000000000086 0000000000000000 ffff88011af26aa0
 ffff88011af26ad8 0000000000000000 ffff88011a06dca8 ffffffff81064a00
 ffff88011af27058 ffff88011a06dfd8 000000000000fb88 ffff88011af27058
Call Trace:
 [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff8150d36f>] ? thread_return+0x16d/0x76e
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
INFO: task nm-dispatcher.a:2382 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nm-dispatcher D 0000000000000002     0  2382   2381 0x00000080
 ffff8801131a3ce8 0000000000000086 ffff8801131a3c98 ffffffff8100988e
 ffff88011b3fc078 0000000000000000 00000000011a3ca8 ffff880028314280
 ffff88011c835058 ffff8801131a3fd8 000000000000fb88 ffff88011c835058
Call Trace:
 [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
 [<ffffffff8150d250>] ? thread_return+0x4e/0x76e
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff810669eb>] ? enqueue_rt_entity+0x6b/0x80
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
INFO: task crond:2386 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond         D 0000000000000002     0  2386   2385 0x00000080
 ffff8801124c7ce8 0000000000000086 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 ffff88011a145098 ffff8801124c7fd8 000000000000fb88 ffff88011a145098
Call Trace:
 [<ffffffffa00b3bf0>] ? ext4_file_open+0x0/0x130 [ext4]
 [<ffffffff8150e0c5>] schedule_timeout+0x215/0x2e0
 [<ffffffff8117e364>] ? nameidata_to_filp+0x54/0x70
 [<ffffffff81277059>] ? cpumask_next_and+0x29/0x50
 [<ffffffff8150dd43>] wait_for_common+0x123/0x180
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffff8150de5d>] wait_for_completion+0x1d/0x20
 [<ffffffff8106513c>] sched_exec+0xdc/0xe0
 [<ffffffff81189ed0>] do_execve+0xe0/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0



 kernel:BUG: soft lockup - CPU#0 stuck for 68s! [bash:2158]
BUG: soft lockup - CPU#2 stuck for 67s! [bash:2170]
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
CPU 2 
Modules linked in: fuse sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput ppdev parport_pc parport sg microcode virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2170, comm: bash Not tainted 2.6.32-355.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff8118f03d>]  [<ffffffff8118f03d>] path_put+0x1d/0x40
RSP: 0018:ffff88011a48df28  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88011a48df38 RCX: ffff88011a48c000
RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: ffff000000000000
R10: 0000000000000008 R11: 0000000000000246 R12: 00000000006dd7e8
R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000246
FS:  00007f7fcd515700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f27048c30d8 CR3: 000000011ac95000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 2170, threadinfo ffff88011a48c000, task ffff88011b3fc040)
Stack:
 ffff88011a48df78 ffff88011b1ae400 ffff88011a48df78 ffffffff810dc4a4
<d> 0000000000000000 00000000006dd7a0 0000000001bcc4d0 0000000000000000
<d> 0000000000000000 00007fff716f6f44 00007fff716f6f44 ffffffff8100b1a5
Call Trace:
 [<ffffffff810dc4a4>] ? __audit_syscall_exit+0x1a4/0x290
 [<ffffffff8100b1a5>] ? sysret_audit+0x17/0x21
Code: 00 8b 45 d8 e9 30 ff ff ff 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb 48 8b 7f 08 e8 e6 bc 00 00 48 8b 3b <48> 85 ff 74 0f c7 87 d4 00 00 00 00 00 00 00 e8 cf 28 01 00 48 
Call Trace:
 [<ffffffff8118f03a>] ? path_put+0x1a/0x40
 [<ffffffff810dc4a4>] ? __audit_syscall_exit+0x1a4/0x290
 [<ffffffff8100b1a5>] ? sysret_audit+0x17/0x21

Comment 1 Stefan Hajnoczi 2013-01-24 14:10:29 UTC

I'll try to reproduce this with 4 VMs tonight.  2 with x-data-plane=on and 2 with x-data-plane=off.

Since there is no I/O going on and virtio-blk-data-plane does not block vcpus from executing code, it's hard to see how x-data-plane=on could cause softlockups inside the guest.

Comment 2 Stefan Hajnoczi 2013-01-24 16:38:43 UTC

I reproduced the soft lockup without a hard disk!  This is not related to virtio-blk-data-plane.

Live ISO: rhel-workstation-x86_64-live-6.4_20130123.0-1.iso
Guest kernel: 2.6.32-356.el6.x86_64
Host kernel: F18 3.7 (includes kvmclock patches for stopped bit)
qemu-kvm: qemu-kvm-0.12.1.2-2.352.el6
Guest CPU count: 3

kernel:BUG: soft lockup - CPU#0 stuck for 1520s! [bash:2529]

It seems that Bug 739151 may not be resolved completely.

Comment 3 juzhang 2013-01-25 02:45:55 UTC

According to https://bugzilla.redhat.com/show_bug.cgi?id=739151#c42. 
--- snip log from comment42---
Let's use the other bug (bug 883746) reported for problems in this feature.  If the problem looks different, please open a new bug.

Re-assign this bug to amit. 

Hi, Amit

Any mistake, please fix me.

Best Regards & Thanks,
Junyi

Comment 9 Shaolong Hu 2013-07-29 02:50:52 UTC

Test on qemu-kvm-0.12.1.2-2.380.el6 with steps in comment 0, after resumise the guest, no call trace, only one message:

Clocksource tsc unstable (delta = 2091710798 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.

Verified.

Comment 10 errata-xmlrpc 2013-11-21 06:32:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1553.html

Note You need to log in before you can comment on or make changes to this bug.