Bug 1057754

Summary: BUG: soft lockup - CPU#0 stuck for 67s! [qemu-kvm:20512]
Product: Red Hat Enterprise Linux 6 Reporter: Dan Yocum <dyocum>
Component: qemu-kvmAssignee: Andrew Jones <drjones>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.5CC: acathrow, bsarathy, drjones, juzhang, kchamart, klepikho, masao-takahashi, michen, mkenneth, qzhang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-15 14:53:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Yocum 2014-01-24 18:18:12 UTC
Description of problem:
BUG: soft lockup - CPU#0 stuck for 67s! [qemu-kvm:20512]
Modules linked in: ebt_redirect ebt_arp ebt_ip xt_CHECKSUM xt_conntrack
nfs lockd iptable_mangle fscache iptable_nat auth_rpcgss nf_nat nfs_acl
sunrpc ebtable_nat ebtables deflate zlib_deflate ctr camellia cast5
rmd160 crypto_null ccm serpent blowfish twofish_x86_64 twofish_common
ecb xcbc cbc sha256_generic sha512_generic des_generic aesni_intel
cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet
xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport
xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel
ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key autofs4
cpufreq_ondemand powernow_k8 freq_table mperf bridge bonding 8021q garp
stp llc xt_policy ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio
ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi vhost_net macvtap macvlan tun
kvm_amd kvm microcode fam15h_power amd64_edac_mod edac_core edac_mce_amd
k10temp i2c_piix4 i2c_core sg e1000 ext4 jbd2 mbcache sd_mod crc_t10dif
ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 0
Modules linked in: ebt_redirect ebt_arp ebt_ip xt_CHECKSUM xt_conntrack
nfs lockd iptable_mangle fscache iptable_nat auth_rpcgss nf_nat nfs_acl
sunrpc ebtable_nat ebtables deflate zlib_deflate ctr camellia cast5
rmd160 crypto_null ccm serpent blowfish twofish_x86_64 twofish_common
ecb xcbc cbc sha256_generic sha512_generic des_generic aesni_intel
cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet
xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport
xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel
ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key autofs4
cpufreq_ondemand powernow_k8 freq_table mperf bridge bonding 8021q garp
stp llc xt_policy ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio
ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi vhost_net macvtap macvlan tun
kvm_amd kvm microcode fam15h_power amd64_edac_mod edac_core edac_mce_amd
k10temp i2c_piix4 i2c_core sg e1000 ext4 jbd2 mbcache sd_mod crc_t10dif
ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 20512, comm: qemu-kvm Not tainted
2.6.32-358.118.1.openstack.el6.x86_64 #1 SeaMicro
SM15000-64-CC-AA-1Ox1/AMD Server CRB
RIP: 0010:[<ffffffffa0169ec0>]  [<ffffffffa0169ec0>]
kvm_load_guest_fpu+0x100/0x130 [kvm]
RSP: 0018:ffff880f814f1c70  EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffff880f814f1c78 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff88102676f940
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880c93df0b98
R13: 0000000000000000 R14: 0000000000000001 R15: ffff881027fa2080
FS:  00007ff3b579c700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000fb0c41000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 20512, threadinfo ffff880f814f0000, task
ffff881027fa2080)
Stack:
 ffff880c93df0b38 ffff880f814f1db8 ffffffffa0178ef5 ffff880f814f1ca8
<d> ffffffff8105231d ffff880f814f1ca8 0000000000000286 ffff880c93df23fc
<d> ffff880c93df23f8 ffff881027fa2080 ffff880c93df2450 ffff880c93df23e0
Call Trace:
 [<ffffffffa0178ef5>] ? kvm_arch_vcpu_ioctl_run+0x325/0x1150 [kvm]
 [<ffffffff8105231d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffffa0161ff4>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm]
 [<ffffffff8108512d>] ? __sigqueue_free+0x3d/0x50
 [<ffffffffa0164636>] ? kvm_dev_ioctl+0xa6/0x4b0 [kvm]
 [<ffffffff81088872>] ? __dequeue_signal+0x102/0x200
 [<ffffffff81195742>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff81195c0a>] ? do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81195e61>] ? sys_ioctl+0x81/0xa0
 [<ffffffff810dcad5>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Code: 00 00 b0 01 84 c0 75 13 48 8b 93 50 0a 00 00 31 c0 48 89 d7 48 0f
ae 0f 5b c9 c3 b8 ff ff ff ff 48 8b bb 50 0a 00 00 31 c9 89 c2 <48> 0f
ae 2f 5b c9 c3 66 0f 1f 84 00 00 00 00 00 48 8b 81 30 07
Call Trace:
 [<ffffffffa0178ef5>] ? kvm_arch_vcpu_ioctl_run+0x325/0x1150 [kvm]
 [<ffffffff8105231d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffffa0161ff4>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm]
 [<ffffffff8108512d>] ? __sigqueue_free+0x3d/0x50
 [<ffffffffa0164636>] ? kvm_dev_ioctl+0xa6/0x4b0 [kvm]
 [<ffffffff81088872>] ? __dequeue_signal+0x102/0x200
 [<ffffffff81195742>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff81195c0a>] ? do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81195e61>] ? sys_ioctl+0x81/0xa0
 [<ffffffff810dcad5>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):


qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64

AMD Opteron(tm) Processor 4365 EE              

Linux public-comp062.os1.phx2.redhat.com 2.6.32-358.118.1.openstack.el6.x86_64 #1 SMP Wed Aug 14 13:18:08 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

and:

Linux comp039.os1.phx2.redhat.com 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux



How reproducible:

random, but not infrequent

Steps to Reproduce:
1.  Launch a bunch of VMs on a system
2.
3.

Comment 2 Dan Yocum 2014-02-07 16:14:42 UTC
maybe relevant to this bug - these came up when I rebooted the host node and VMs:

kvm: 11517: cpu0 unhandled rdmsr: 0xc0011021
kvm: 11517: cpu0 unhandled rdmsr: 0xc0010112
kvm: 11517: cpu0 unhandled rdmsr: 0xc0010001
kvm: 11517: cpu1 unhandled rdmsr: 0xc0011021

Comment 3 Kashyap Chamarthy 2014-02-08 13:33:02 UTC
Similar soft-lockup I've seen as part of upstream QEMU/KVM testing with Intel Haswell machines:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=1058209 
    [3.14.0-0.rc0.git9.1.fc21] Booting into a guest on Intel 
    Haswell (bare-metal) throws soft lockups [qemu-system-x86:911]

Its associated upstream Kernel bug:

  https://bugzilla.kernel.org/show_bug.cgi?id=69491

Comment 4 Andrew Jones 2014-04-15 14:53:07 UTC
Just dug this BZ up out of my backlog. I checked the customer portal and see the case is closed, but grabbed the sos report anyway. However, it must not be the same one, because this sos_commands/general/dmesg is for a 2.6.32-220.el6.x86_64 kernel, not 2.6.32-358.118.1.openstack.el6.x86_64, as above. Anyway, I'd prefer they move to 6.5.z and see if it reproduces there before putting too much effort into the issue. OTOH, the dmesg in this sos report does show soft lockups for qemu-kvm, but I also see evidence that tracing was enabled at the time. So, in any case, I don't believe we have good enough information for this [now closed] customer case in order to proceed. I'm going to close as INSU for now, of course it can be reopened if necessary.

Comment 5 Dan Yocum 2014-04-15 16:54:44 UTC
(In reply to Andrew Jones from comment #4)
> Just dug this BZ up out of my backlog. I checked the customer portal and see
> the case is closed, but grabbed the sos report anyway. However, it must not
> be the same one, because this sos_commands/general/dmesg is for a
> 2.6.32-220.el6.x86_64 kernel, not 2.6.32-358.118.1.openstack.el6.x86_64, as
> above. Anyway, I'd prefer they move to 6.5.z and see if it reproduces there
> before putting too much effort into the issue. OTOH, the dmesg in this sos
> report does show soft lockups for qemu-kvm, but I also see evidence that
> tracing was enabled at the time. So, in any case, I don't believe we have
> good enough information for this [now closed] customer case in order to
> proceed. I'm going to close as INSU for now, of course it can be reopened if
> necessary.

Sounds reasonable - we're now at RHELv6.5 and kernel on the compute nodes ad the kernel is 2.6.32-431.5.1.el6.x86_64 and we haven't seen the soft-lockup since, I don't think.