Bug 694353

Summary: Multiple soft lockups in a short period of time
Product: Red Hat Enterprise Linux 5 Reporter: Rodney McKee <rodney.mckee>
Component: kvmAssignee: Karen Noel <knoel>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: medium    
Version: 5.2CC: anton, areis, cdahlin, cww, jduncan, jfeeney, knoel, michen, mkenneth, rdassen, rhod, riel, shuang, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-27 15:55:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 743405, 807974    
Attachments:
Description Flags
io-idle-race script none

Description Rodney McKee 2011-04-07 05:25:29 UTC
Description of problem:
System appears to just stop/pause

Version-Release number of selected component (if applicable):
2.6.18-92.1.6.el5 #1 SMP Fri Jun 20 02:36:06 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Not yet able to trigger

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Multiple lockups in a short period of time:

Linux 2.6.18-92.1.6.el5 #1 SMP Fri Jun 20 02:36:06 EDT 2008 x86_64 x86_64
x86_64 GNU/Linux


Mar 23 12:13:57 web3 kernel: BUG: soft lockup - CPU#1 stuck for 11s!
[pmcd:2172]
Mar 23 12:14:09 web3 kernel: CPU 1:
Mar 23 12:14:09 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:14:09 web3 kernel: Pid: 2172, comm: pmcd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:14:30 web3 kernel: RIP: 0010:[<ffffffff80011ec7>] 
[<ffffffff80011ec7>] __do_softirq+0x53/0xd6
Mar 23 12:14:30 web3 kernel: RSP: 0018:ffff81000276bf60  EFLAGS: 00000206
Mar 23 12:14:30 web3 kernel: RAX: 0000000000000022 RBX: ffffffff803b6f80 RCX:
0000000000000500
Mar 23 12:14:30 web3 kernel: RDX: ffff81006be95fd8 RSI: 0000000000000080 RDI:
ffff81006c0887e0
Mar 23 12:14:30 web3 kernel: RBP: ffff81000276bee0 R08: 0000000000000001 R09:
000000000000003f
Mar 23 12:14:30 web3 kernel: R10: ffff81007ff0c008 R11: 0000000000000000 R12:
ffffffff8005dc8e
Mar 23 12:14:30 web3 kernel: R13: 0000000000000022 R14: ffffffff80077090 R15:
ffff81000276bee0
Mar 23 12:14:30 web3 kernel: FS:  00002ac4a3c64da0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:14:30 web3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 12:15:31 web3 kernel: CR2: 00002aaaaabca1e0 CR3: 000000006bdab000 CR4:
00000000000006e0
Mar 23 12:16:47 web3 kernel: 
Mar 23 12:16:52 web3 kernel: Call Trace:
Mar 23 12:17:07 web3 kernel:  <IRQ>  [<ffffffff8005e2fc>]
call_softirq+0x1c/0x28
Mar 23 12:17:23 web3 kernel:  [<ffffffff8006c6e4>] do_softirq+0x2c/0x85
Mar 23 12:17:26 web3 kernel:  [<ffffffff8005dc8e>]
apic_timer_interrupt+0x66/0x6c
Mar 23 12:17:27 web3 kernel:  <EOI>  [<ffffffff800a5184>] s_show+0x167/0x293
Mar 23 12:17:27 web3 kernel:  [<ffffffff800a509e>] s_show+0x81/0x293
Mar 23 12:17:29 web3 kernel:  [<ffffffff8003f209>] seq_read+0x1b8/0x28c
Mar 23 12:17:31 web3 kernel:  [<ffffffff8000b337>] vfs_read+0xcb/0x171
Mar 23 12:17:31 web3 kernel:  [<ffffffff80011715>] sys_read+0x45/0x6e
Mar 23 12:17:31 web3 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 23 12:17:38 web3 kernel: 
Mar 23 12:17:38 web3 kernel: BUG: soft lockup - CPU#1 stuck for 50s!
[pmcd:2172]
Mar 23 12:17:39 web3 kernel: CPU 1:
Mar 23 12:17:39 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:17:42 web3 kernel: Pid: 2172, comm: pmcd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:17:43 web3 kernel: RIP: 0010:[<ffffffff800a5184>] 
[<ffffffff800a5184>] s_show+0x167/0x293
Mar 23 12:17:45 web3 kernel: RSP: 0018:ffff81006be95e58  EFLAGS: 00000206
Mar 23 12:17:45 web3 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff81007ffeeee0
Mar 23 12:17:47 web3 kernel: RDX: 0000000000000001 RSI: 0000000000000028 RDI:
ffff81007ffeef00
Mar 23 12:17:47 web3 kernel: RBP: 000000008005d67c R08: 0000000000000006 R09:
0000000000000020
Mar 23 12:17:48 web3 kernel: R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000001
Mar 23 12:17:48 web3 kernel: R13: 00002ac4a42c3000 R14: 0000000000000020 R15:
0000000000000000
Mar 23 12:17:48 web3 kernel: FS:  00002ac4a3c64da0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:17:49 web3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 12:17:50 web3 kernel: CR2: 00002aaaaabca1e0 CR3: 000000006bdab000 CR4:
00000000000006e0
Mar 23 12:17:50 web3 kernel: 
Mar 23 12:17:51 web3 kernel: Call Trace:
Mar 23 12:17:51 web3 kernel:  [<ffffffff800a509e>] s_show+0x81/0x293
Mar 23 12:17:52 web3 kernel:  [<ffffffff8003f209>] seq_read+0x1b8/0x28c
Mar 23 12:17:55 web3 kernel:  [<ffffffff8000b337>] vfs_read+0xcb/0x171
Mar 23 12:17:57 web3 kernel:  [<ffffffff80011715>] sys_read+0x45/0x6e
Mar 23 12:18:00 web3 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 23 12:18:01 web3 kernel: 
Mar 23 12:18:03 web3 kernel: BUG: soft lockup - CPU#1 stuck for 17s!
[pmcd:2172]
Mar 23 12:18:08 web3 kernel: CPU 1:
Mar 23 12:18:12 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:18:14 web3 kernel: Pid: 2172, comm: pmcd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:18:18 web3 kernel: RIP: 0010:[<ffffffff800a5184>] 
[<ffffffff800a5184>] s_show+0x167/0x293
Mar 23 12:18:20 web3 kernel: RSP: 0018:ffff81006be95e58  EFLAGS: 00000206
Mar 23 12:18:21 web3 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff81007fccb6e0
Mar 23 12:18:24 web3 kernel: RDX: 0000000000000001 RSI: 0000000000000005 RDI:
ffff81007fccb700
Mar 23 12:18:24 web3 kernel: RBP: 0000000000000000 R08: 0000000000000006 R09:
0000000000000020
Mar 23 12:18:26 web3 kernel: R10: 0000000000000000 R11: 0000000000000000 R12:
ffff81007dc33740
Mar 23 12:18:26 web3 kernel: R13: 00002ac4a42c3000 R14: 0000000000000020 R15:
0000000000000000
Mar 23 12:18:27 web3 kernel: FS:  00002ac4a3c64da0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:18:28 web3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 23 12:18:28 web3 kernel: CR2: 00002ac4a42c5000 CR3: 000000006bdab000 CR4:
00000000000006e0
Mar 23 12:18:28 web3 kernel: 
Mar 23 12:18:28 web3 kernel: Call Trace:
Mar 23 12:18:29 web3 kernel:  [<ffffffff800a509e>] s_show+0x81/0x293
Mar 23 12:18:29 web3 kernel:  [<ffffffff8003f209>] seq_read+0x1b8/0x28c
Mar 23 12:18:30 web3 kernel:  [<ffffffff8000b337>] vfs_read+0xcb/0x171
Mar 23 12:18:30 web3 kernel:  [<ffffffff80011715>] sys_read+0x45/0x6e
Mar 23 12:18:30 web3 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 23 12:18:31 web3 kernel: 
Mar 23 12:18:31 web3 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[pmcd:2172]
Mar 23 12:18:32 web3 kernel: CPU 1:
Mar 23 12:18:34 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:18:35 web3 kernel: Pid: 2172, comm: pmcd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:18:37 web3 kernel: RIP: 0010:[<ffffffff800a5184>] 
[<ffffffff800a5184>] s_show+0x167/0x293
Mar 23 12:18:41 web3 kernel: RSP: 0018:ffff81006be95e58  EFLAGS: 00000202
Mar 23 12:18:42 web3 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff81007ffd07e0
Mar 23 12:18:42 web3 kernel: RDX: 0000000000000001 RSI: 0000000000000007 RDI:
ffff81007ffd0800
Mar 23 12:18:43 web3 kernel: RBP: 000000008005d67c R08: 0000000000000006 R09:
0000000000000020
Mar 23 12:18:44 web3 kernel: R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000001
Mar 23 12:18:44 web3 kernel: R13: 00002ac4a42c3000 R14: 0000000000000020 R15:
0000000000000000
Mar 23 12:18:44 web3 kernel: FS:  00002ac4a3c64da0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:18:45 web3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 12:18:47 web3 kernel: CR2: 00002ac4a42c5000 CR3: 000000006bdab000 CR4:
00000000000006e0
Mar 23 12:18:47 web3 kernel: 
Mar 23 12:18:48 web3 kernel: Call Trace:
Mar 23 12:18:52 web3 kernel:  [<ffffffff800a509e>] s_show+0x81/0x293
Mar 23 12:18:54 web3 kernel:  [<ffffffff8003f209>] seq_read+0x1b8/0x28c
Mar 23 12:18:54 web3 kernel:  [<ffffffff8000b337>] vfs_read+0xcb/0x171
Mar 23 12:18:55 web3 kernel:  [<ffffffff80011715>] sys_read+0x45/0x6e
Mar 23 12:18:56 web3 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 23 12:18:57 web3 kernel: 
Mar 23 12:18:59 web3 kernel: BUG: soft lockup - CPU#1 stuck for 41s!
[pmcd:2172]
Mar 23 12:18:59 web3 kernel: CPU 1:
Mar 23 12:19:00 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:19:00 web3 kernel: Pid: 2172, comm: pmcd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:19:01 web3 kernel: RIP: 0010:[<ffffffff80011ec7>] 
[<ffffffff80011ec7>] __do_softirq+0x53/0xd6
Mar 23 12:19:03 web3 kernel: RSP: 0018:ffff81000276bf60  EFLAGS: 00000206
Mar 23 12:19:04 web3 kernel: RAX: 0000000000000022 RBX: ffffffff803b6f80 RCX:
0000000000000300
Mar 23 12:19:06 web3 kernel: RDX: ffff81006be95fd8 RSI: 0000000000000080 RDI:
ffff81006c0887e0
Mar 23 12:19:10 web3 kernel: RBP: ffff81000276bee0 R08: 0000000000000001 R09:
000000000000003f
Mar 23 12:19:12 web3 kernel: R10: ffff81007ff0c008 R11: 0000000000000000 R12:
ffffffff8005dc8e
Mar 23 12:19:14 web3 kernel: R13: 0000000000000022 R14: ffffffff80077090 R15:
ffff81000276bee0
Mar 23 12:19:15 web3 kernel: FS:  00002ac4a3c64da0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:19:17 web3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 12:19:19 web3 kernel: CR2: 00002ac4a42c5000 CR3: 000000006bdab000 CR4:
00000000000006e0
Mar 23 12:19:19 web3 kernel: 
Mar 23 12:19:20 web3 kernel: Call Trace:
Mar 23 12:19:20 web3 kernel:  <IRQ>  [<ffffffff8005e2fc>]
call_softirq+0x1c/0x28
Mar 23 12:19:20 web3 kernel:  [<ffffffff8006c6e4>] do_softirq+0x2c/0x85
Mar 23 12:19:20 web3 kernel:  [<ffffffff8005dc8e>]
apic_timer_interrupt+0x66/0x6c
Mar 23 12:19:20 web3 kernel:  <EOI>  [<ffffffff800a5184>] s_show+0x167/0x293
Mar 23 12:19:21 web3 kernel:  [<ffffffff800a509e>] s_show+0x81/0x293
Mar 23 12:19:21 web3 kernel:  [<ffffffff8003f209>] seq_read+0x1b8/0x28c
Mar 23 12:19:22 web3 kernel:  [<ffffffff8000b337>] vfs_read+0xcb/0x171
Mar 23 12:19:22 web3 kernel:  [<ffffffff80011715>] sys_read+0x45/0x6e
Mar 23 12:19:23 web3 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 23 12:19:24 web3 kernel: 
Mar 23 12:19:27 web3 kernel: BUG: soft lockup - CPU#0 stuck for 37s!
[bond0:1393]
Mar 23 12:19:28 web3 kernel: CPU 0:
Mar 23 12:19:29 web3 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api
bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy 8139too 8139cp ide_cd
i2c_piix4 mii pcspkr serio_raw cdrom i2c_core ata_piix libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:19:29 web3 kernel: Pid: 1393, comm: bond0 Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:19:29 web3 kernel: RIP: 0010:[<ffffffff80064af8>] 
[<ffffffff80064af8>] _spin_unlock_irqrestore+0x8/0x9
Mar 23 12:19:30 web3 kernel: RSP: 0018:ffff81007b52fd18  EFLAGS: 00000282
Mar 23 12:19:30 web3 kernel: RAX: 0000000000000000 RBX: ffff81007fbd8000 RCX:
000005e1ffffff10
Mar 23 12:19:32 web3 kernel: RDX: 0000000000000100 RSI: 0000000000000282 RDI:
ffff81007fbd8510
Mar 23 12:19:33 web3 kernel: RBP: 0000000000000000 R08: ffff81007b52e000 R09:
ffff81007ffdb000
Mar 23 12:19:34 web3 kernel: R10: 0000000000000000 R11: ffffffff8011ce11 R12:
0000000000000000
Mar 23 12:19:35 web3 kernel: R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
Mar 23 12:19:35 web3 kernel: FS:  0000000000000000(0000)
GS:ffffffff8039f000(0000) knlGS:0000000000000000
Mar 23 12:19:36 web3 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar 23 12:19:37 web3 kernel: CR2: 00002ade80f45fe0 CR3: 0000000020857000 CR4:
00000000000006e0
Mar 23 12:19:39 web3 kernel: 
Mar 23 12:19:39 web3 kernel: Call Trace:
Mar 23 12:19:39 web3 kernel:  [<ffffffff88153b51>]
:8139cp:cp_get_settings+0x42/0x4e
Mar 23 12:19:40 web3 kernel:  [<ffffffff8826504a>]
:bonding:bond_update_speed_duplex+0x3a/0x109
Mar 23 12:19:41 web3 kernel:  [<ffffffff80062f4b>] thread_return+0x0/0xdf
Mar 23 12:19:43 web3 kernel:  [<ffffffff882668f1>]
:bonding:__bond_mii_monitor+0x399/0x444
Mar 23 12:19:44 web3 kernel:  [<ffffffff882679c2>]
:bonding:bond_mii_monitor+0x0/0x8c
Mar 23 12:19:45 web3 kernel:  [<ffffffff882679ef>]
:bonding:bond_mii_monitor+0x2d/0x8c
Mar 23 12:19:46 web3 kernel:  [<ffffffff8004cea9>] run_workqueue+0x94/0xe4
Mar 23 12:19:47 web3 kernel:  [<ffffffff800497be>] worker_thread+0x0/0x122
Mar 23 12:19:48 web3 kernel:  [<ffffffff8009dd45>]
keventd_create_kthread+0x0/0xc4
Mar 23 12:19:50 web3 kernel:  [<ffffffff800498ae>] worker_thread+0xf0/0x122
Mar 23 12:19:50 web3 kernel:  [<ffffffff8008ad76>]
default_wake_function+0x0/0xe
Mar 23 12:19:51 web3 kernel:  [<ffffffff8009dd45>]
keventd_create_kthread+0x0/0xc4
Mar 23 12:19:52 web3 kernel:  [<ffffffff8009dd45>]
keventd_create_kthread+0x0/0xc4
Mar 23 12:19:53 web3 kernel:  [<ffffffff8003253d>] kthread+0xfe/0x132
Mar 23 12:19:55 web3 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Mar 23 12:19:56 web3 kernel:  [<ffffffff8009dd45>]
keventd_create_kthread+0x0/0xc4
Mar 23 12:19:57 web3 kernel:  [<ffffffff8003243f>] kthread+0x0/0x132
Mar 23 12:19:59 web3 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Mar 23 12:20:00 web3 kernel:

Comment 1 Rodney McKee 2011-04-07 05:27:44 UTC
Note this is a KVM guest system running on host with:
Linux 2.6.18-194.17.4.el5 #1 SMP Wed Oct 20 13:03:08 EDT 2010 x86_64 x86_64
x86_64 GNU/Linux

a second server sunning on the host also experienced similar soft lockups in
the same period:

Mar 23 12:14:38 app23 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
[puppetd:2118]
Mar 23 12:14:39 app23 kernel: CPU 0:
Mar 23 12:14:39 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl
sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video
sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
parport floppy 
ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:14:39 app23 kernel: Pid: 2118, comm: puppetd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:14:39 app23 kernel: RIP: 0010:[<ffffffff8003d278>] 
[<ffffffff8003d278>] rt_run_flush+0x61/0xb8
Mar 23 12:14:39 app23 kernel: RSP: 0000:ffffffff80418ef0  EFLAGS: 00010282
Mar 23 12:14:39 app23 kernel: RAX: ffff81007f5a1ff8 RBX: 0000000000000000 RCX:
ffff8100023783e0
Mar 23 12:14:39 app23 kernel: RDX: 000000000000000b RSI: ffffffff8022d030 RDI:
ffff81007ff14ffc
Mar 23 12:14:39 app23 kernel: RBP: ffffffff80418e70 R08: 0000000000001000 R09:
0000000000000000
Mar 23 12:14:39 app23 kernel: R10: ffffffff80418cd0 R11: 0000000000000154 R12:
ffffffff8005dc8e
Mar 23 12:14:39 app23 kernel: R13: 0000000000000ffc R14: ffffffff80077090 R15:
ffffffff80418e70
Mar 23 12:14:51 app23 kernel: FS:  00002b120e8b0f50(0000)
GS:ffffffff8039f000(0000) knlGS:0000000000000000
Mar 23 12:14:57 app23 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 23 12:15:05 app23 kernel: CR2: 00002b8e83900000 CR3: 0000000072422000 CR4:
00000000000006e0
Mar 23 12:15:09 app23 kernel: 
Mar 23 12:15:15 app23 kernel: Call Trace:
Mar 23 12:15:21 app23 kernel:  <IRQ>  [<ffffffff8022ea08>]
rt_secret_rebuild+0x0/0x26
Mar 23 12:15:21 app23 kernel:  [<ffffffff8022ea17>] rt_secret_rebuild+0xf/0x26
Mar 23 12:15:21 app23 kernel:  [<ffffffff800953c2>]
run_timer_softirq+0x133/0x1af
Mar 23 12:15:22 app23 kernel:  [<ffffffff80011ed2>] __do_softirq+0x5e/0xd6
Mar 23 12:15:24 app23 kernel:  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
Mar 23 12:15:25 app23 kernel:  [<ffffffff8006c6e4>] do_softirq+0x2c/0x85
Mar 23 12:15:26 app23 kernel:  [<ffffffff8005dc8e>]
apic_timer_interrupt+0x66/0x6c
Mar 23 12:15:28 app23 kernel:  <EOI> 
Mar 23 12:15:29 app23 kernel: BUG: soft lockup - CPU#0 stuck for 28s!
[puppetd:2118]
Mar 23 12:15:29 app23 kernel: CPU 0:
Mar 23 12:15:29 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl
sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video
sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
parport floppy 
ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:15:32 app23 kernel: Pid: 2118, comm: puppetd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:15:35 app23 kernel: RIP: 0010:[<ffffffff80010970>] 
[<ffffffff80010970>] handle_IRQ_event+0x1b/0x58
Mar 23 12:15:41 app23 kernel: RSP: 0000:ffffffff80418dd8  EFLAGS: 00000246
Mar 23 12:15:41 app23 kernel: RAX: 0000000000010000 RBX: 0000000000000e00 RCX:
000000000000000e
Mar 23 12:15:42 app23 kernel: RDX: ffff81007f94b840 RSI: ffffffff80418e48 RDI:
000000000000000e
Mar 23 12:15:44 app23 kernel: RBP: ffffffff80418d50 R08: 0000000000001000 R09:
0000000000000000
Mar 23 12:15:45 app23 kernel: R10: ffffffff80418cd0 R11: 0000000000000154 R12:
ffffffff8005dc8e
Mar 23 12:15:46 app23 kernel: R13: ffff81007f94b840 R14: ffffffff80077090 R15:
ffffffff80418d50
Mar 23 12:15:46 app23 kernel: FS:  00002b120e8b0f50(0000)
GS:ffffffff8039f000(0000) knlGS:0000000000000000
Mar 23 12:15:47 app23 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 23 12:15:49 app23 kernel: CR2: 00002b8e83900000 CR3: 0000000072422000 CR4:
00000000000006e0
Mar 23 12:15:52 app23 kernel: 
Mar 23 12:15:52 app23 kernel: Call Trace:
Mar 23 12:15:53 app23 kernel:  <IRQ>  [<ffffffff800b7359>] __do_IRQ+0xa4/0x103
Mar 23 12:15:53 app23 kernel:  [<ffffffff8006c567>] do_IRQ+0xe7/0xf5
Mar 23 12:15:54 app23 kernel:  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Mar 23 12:15:55 app23 kernel:  [<ffffffff8022d030>] dst_rcu_free+0x0/0x3a
Mar 23 12:15:56 app23 kernel:  [<ffffffff8003d278>] rt_run_flush+0x61/0xb8
Mar 23 12:15:57 app23 kernel:  [<ffffffff8022ea08>] rt_secret_rebuild+0x0/0x26
Mar 23 12:15:57 app23 kernel:  [<ffffffff8022ea17>] rt_secret_rebuild+0xf/0x26
Mar 23 12:15:57 app23 kernel:  [<ffffffff800953c2>]
run_timer_softirq+0x133/0x1af
Mar 23 12:15:58 app23 kernel:  [<ffffffff80011ed2>] __do_softirq+0x5e/0xd6
Mar 23 12:15:59 app23 kernel:  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
Mar 23 12:15:59 app23 kernel:  [<ffffffff8006c6e4>] do_softirq+0x2c/0x85
Mar 23 12:16:00 app23 kernel:  [<ffffffff8005dc8e>]
apic_timer_interrupt+0x66/0x6c
Mar 23 12:16:02 app23 kernel:  <EOI> 
Mar 23 12:16:11 app23 kernel: BUG: soft lockup - CPU#1 stuck for 28s!
[klogd:23683]
Mar 23 12:16:15 app23 kernel: CPU 1:
Mar 23 12:16:18 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl
sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video
sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
parport floppy 
ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 23 12:16:19 app23 kernel: Pid: 23683, comm: klogd Not tainted
2.6.18-92.1.6.el5 #1
Mar 23 12:16:20 app23 kernel: RIP: 0010:[<ffffffff80062460>] 
[<ffffffff80062460>] __sched_text_start+0x0/0xaeb
Mar 23 12:16:21 app23 kernel: RSP: 0000:ffff81004f25bf50  EFLAGS: 00000202
Mar 23 12:16:22 app23 kernel: RAX: ffff81004f25bf58 RBX: 00002aea430a5520 RCX:
0000000000000000
Mar 23 12:16:25 app23 kernel: RDX: 00000000000000a8 RSI: ffff81000276e000 RDI:
ffff81004f25c000
Mar 23 12:16:26 app23 kernel: RBP: 0000000000000004 R08: 000000000000002b R09:
00002aea430c1000
Mar 23 12:16:26 app23 kernel: R10: ffffffffffffffff R11: 00002aea430a5920 R12:
ffff81004f25a000
Mar 23 12:16:27 app23 kernel: R13: ffff81004d90e860 R14: 0000000000000000 R15:
ffffffff800625df
Mar 23 12:16:28 app23 kernel: FS:  00002aea436166e0(0000)
GS:ffff810002741840(0000) knlGS:0000000000000000
Mar 23 12:16:30 app23 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 12:16:32 app23 kernel: CR2: 000000357f4c8fb0 CR3: 0000000039b0b000 CR4:
00000000000006e0
Mar 23 12:16:33 app23 kernel: 
Mar 23 12:16:33 app23 kernel: Call Trace:
Mar 23 12:16:34 app23 kernel:  [<ffffffff80064e15>] paranoid_schedule1+0x6/0x9
Mar 23 12:16:35 app23 kernel:

Comment 7 RHEL Program Management 2012-04-11 21:56:57 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 8 Ronen Hod 2012-04-12 11:36:41 UTC
Avi,

Is this related to https://bugzilla.redhat.com/show_bug.cgi?id=782631?

Comment 9 Avi Kivity 2012-04-15 13:24:40 UTC
Could be.  Was pinning used in this scenario?

Comment 10 Suqin Huang 2012-04-17 10:25:31 UTC
Hi avi,
Can not reproduce this issue with kvm-83-251.el5

1. bootup two guests, guestA and guestB (both with 2G mem and 2 vcpus, virtio)
2. pin guestA to physical cpu 2, pin guestB to physical cpu 1,3
3. run attached script in guestA
4. keep rebooting in guestB

don't reproduce till now (around 4 hours), keep on testing

guest: 2.6.18-308.el5 x86_64
host: 
2.6.18-308.el5
kvm-83-251.el5

cmd:
/usr/libexec/qemu-kvm -name 308_249 -monitor stdio \
-serial unix:/tmp/serial-shuang,server,nowait \
-drive file=/home/bak.qcow2,index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 \
-net nic,vlan=0,model=virtio,macaddr=90:01:8a:76:45:a0 -net tap,vlan=0,script=/home/autotest-devel/client/tests/kvm/scripts/qemu-ifup-switch -net nic,vlan=1,model=virtio,macaddr=90:01:8a:76:45:71 -net tap,vlan=1,script=/home/autotest-devel/client/tests/kvm/scripts/qemu-ifup-switch -m 2048 -smp 2,cores=2,threads=1,sockets=1 -soundhw ac97 -vnc :1 -rtc-td-hack -M rhel5.4.0 -boot c -no-kvm-pit-reinjection -usbdevice tablet -balloon virtio

Comment 11 Suqin Huang 2012-04-17 10:26:44 UTC
Created attachment 577980 [details]
io-idle-race script

Comment 12 Suqin Huang 2012-04-19 09:39:35 UTC
hi avi,
get the following call trace when pin two guest vcpu to one physical cpu, 
and run dd, guest not hang

INFO: task syslogd:2264 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffff810001004420     0  2264      1          2267  2250 (NOTLB)
 ffff81012efb9d98 0000000000000082 ffff81012efb9d18 0000000000000246
 0000000000000000 0000000000000009 ffff81013a2be7e0 ffffffff80319b60
 0000013804c8a315 0000000000005098 ffff81013a2be9c8 00000000880317be
Call Trace:
 [<ffffffff88036e55>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a3444>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203
 [<ffffffff8002f997>] __writeback_single_inode+0x1dd/0x31c
 [<ffffffff800e3fac>] do_readv_writev+0x26e/0x291
 [<ffffffff800f73d2>] sync_inode+0x24/0x33
 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8
 [<ffffffff8004fe8d>] do_fsync+0x52/0xa4
 [<ffffffff800e4839>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task irqbalance:2347 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
irqbalance    D ffff810001004420     0  2347      1          2380  2267 (NOTLB)
 ffff81012e88fc58 0000000000000082 ffff810000019c10 ffff81012e88fc38
 0000000000000000 0000000000000007 ffff81013ad3c0c0 ffffffff80319b60
 00000141452226fa 0000000000019fc1 ffff81013ad3c2a8 0000000000000000
Call Trace:
 [<ffffffff80063c63>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cad>] .text.lock.mutex+0xf/0x14
 [<ffffffff8000d055>] do_lookup+0xd1/0x227
 [<ffffffff8000a29d>] __link_path_walk+0x9f4/0xf39
 [<ffffffff8000eb31>] link_path_walk+0x45/0xb8
 [<ffffffff8000ce04>] do_path_lookup+0x294/0x310
 [<ffffffff80130a4b>] selinux_file_alloc_security+0x2a/0x54
 [<ffffffff8002384b>] __path_lookup_intent_open+0x56/0x97
 [<ffffffff8001b120>] open_namei+0x73/0x6c0
 [<ffffffff8004e22e>] lock_hrtimer_base+0x26/0x4c
 [<ffffffff80027607>] do_filp_open+0x1c/0x38
 [<ffffffff80063cf9>] do_nanosleep+0x47/0x70
 [<ffffffff80019fd3>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task pcscd:2598 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pcscd         D ffff810001004420     0  2598      1          2602  2586 (NOTLB)
 ffff81012e653dd8 0000000000000086 ffffffff00000257 ffffffff00000000
 0000000300000001 0000000000000007 ffff81013aa5e820 ffffffff80319b60
 0000014326023349 0000000000014d02 ffff81013aa5ea08 000000003aa5e820
Call Trace:
 [<ffffffff8013057f>] inode_has_perm+0x56/0x63
 [<ffffffff80064a1b>] __down+0xc3/0xd8
 [<ffffffff8008ee72>] default_wake_function+0x0/0xe
 [<ffffffff800646d9>] __down_failed+0x35/0x3a
 [<ffffffff80064cdd>] .text.lock.kernel_lock+0xf/0x1f
 [<ffffffff80041ebc>] do_ioctl+0x3d/0x6b
 [<ffffffff8002ff2d>] vfs_ioctl+0x457/0x4b9
 [<ffffffff8004c26c>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 13 Avi Kivity 2012-04-22 16:14:16 UTC
(In reply to comment #12)
> hi avi,
> get the following call trace when pin two guest vcpu to one physical cpu, 
> and run dd, guest not hang
> 

This is with kvm-83-251.el5?

Comment 14 Suqin Huang 2012-04-23 02:29:27 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > hi avi,
> > get the following call trace when pin two guest vcpu to one physical cpu, 
> > and run dd, guest not hang
> > 
> 
> This is with kvm-83-251.el5?

yes

Comment 15 Suqin Huang 2012-04-26 02:11:12 UTC
Hi avi,
any suggestion for this bug, didn't get call trace as comment0, but get call as comment12

Comment 16 Avi Kivity 2012-05-24 12:26:32 UTC
If you pin vcpus, this is somewhat expected.  If the host is overloaded, this could happen as well.  Maybe we should use steal time to detect that the guest is not running all the time.  Copying Rik for an opinion.

Comment 17 Rik van Riel 2012-05-24 13:47:59 UTC
120 seconds should not be CPU scheduler induced latency.

Comment 20 Avi Kivity 2012-05-24 15:31:27 UTC
(In reply to comment #17)
> 120 seconds should not be CPU scheduler induced latency.

Suppose we're overcommitted, and smp_calling_function_single() within a spinlock.  Without PLE, the IPI latency is magnified to the time slice.

I don't think this is the case here (and steal time won't help if it were) though.

Comment 21 Rik van Riel 2012-05-24 15:34:39 UTC
The time slice is on the order of 30 milliseconds max.

That means a 120 second timeout is at least 3600 full time slices. That is why I doubt that the CPU scheduler is causing this 120 second latency.