| Summary: | Multiple soft lockups in a short period of time | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Rodney McKee <rodney.mckee> | ||||
| Component: | kvm | Assignee: | Karen Noel <knoel> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 5.2 | CC: | anton, areis, cdahlin, cww, jduncan, jfeeney, knoel, michen, mkenneth, rdassen, rhod, riel, shuang, tburke, virt-maint | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-05-27 15:55:49 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 743405, 807974 | ||||||
| Attachments: |
|
||||||
|
Description
Rodney McKee
2011-04-07 05:25:29 UTC
Note this is a KVM guest system running on host with: Linux 2.6.18-194.17.4.el5 #1 SMP Wed Oct 20 13:03:08 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux a second server sunning on the host also experienced similar soft lockups in the same period: Mar 23 12:14:38 app23 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [puppetd:2118] Mar 23 12:14:39 app23 kernel: CPU 0: Mar 23 12:14:39 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Mar 23 12:14:39 app23 kernel: Pid: 2118, comm: puppetd Not tainted 2.6.18-92.1.6.el5 #1 Mar 23 12:14:39 app23 kernel: RIP: 0010:[<ffffffff8003d278>] [<ffffffff8003d278>] rt_run_flush+0x61/0xb8 Mar 23 12:14:39 app23 kernel: RSP: 0000:ffffffff80418ef0 EFLAGS: 00010282 Mar 23 12:14:39 app23 kernel: RAX: ffff81007f5a1ff8 RBX: 0000000000000000 RCX: ffff8100023783e0 Mar 23 12:14:39 app23 kernel: RDX: 000000000000000b RSI: ffffffff8022d030 RDI: ffff81007ff14ffc Mar 23 12:14:39 app23 kernel: RBP: ffffffff80418e70 R08: 0000000000001000 R09: 0000000000000000 Mar 23 12:14:39 app23 kernel: R10: ffffffff80418cd0 R11: 0000000000000154 R12: ffffffff8005dc8e Mar 23 12:14:39 app23 kernel: R13: 0000000000000ffc R14: ffffffff80077090 R15: ffffffff80418e70 Mar 23 12:14:51 app23 kernel: FS: 00002b120e8b0f50(0000) GS:ffffffff8039f000(0000) knlGS:0000000000000000 Mar 23 12:14:57 app23 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 23 12:15:05 app23 kernel: CR2: 00002b8e83900000 CR3: 0000000072422000 CR4: 00000000000006e0 Mar 23 12:15:09 app23 kernel: Mar 23 12:15:15 app23 kernel: Call Trace: Mar 23 12:15:21 app23 kernel: <IRQ> [<ffffffff8022ea08>] rt_secret_rebuild+0x0/0x26 Mar 23 12:15:21 app23 kernel: [<ffffffff8022ea17>] rt_secret_rebuild+0xf/0x26 Mar 23 12:15:21 app23 kernel: [<ffffffff800953c2>] run_timer_softirq+0x133/0x1af Mar 23 12:15:22 app23 kernel: [<ffffffff80011ed2>] __do_softirq+0x5e/0xd6 Mar 23 12:15:24 app23 kernel: [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 Mar 23 12:15:25 app23 kernel: [<ffffffff8006c6e4>] do_softirq+0x2c/0x85 Mar 23 12:15:26 app23 kernel: [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c Mar 23 12:15:28 app23 kernel: <EOI> Mar 23 12:15:29 app23 kernel: BUG: soft lockup - CPU#0 stuck for 28s! [puppetd:2118] Mar 23 12:15:29 app23 kernel: CPU 0: Mar 23 12:15:29 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Mar 23 12:15:32 app23 kernel: Pid: 2118, comm: puppetd Not tainted 2.6.18-92.1.6.el5 #1 Mar 23 12:15:35 app23 kernel: RIP: 0010:[<ffffffff80010970>] [<ffffffff80010970>] handle_IRQ_event+0x1b/0x58 Mar 23 12:15:41 app23 kernel: RSP: 0000:ffffffff80418dd8 EFLAGS: 00000246 Mar 23 12:15:41 app23 kernel: RAX: 0000000000010000 RBX: 0000000000000e00 RCX: 000000000000000e Mar 23 12:15:42 app23 kernel: RDX: ffff81007f94b840 RSI: ffffffff80418e48 RDI: 000000000000000e Mar 23 12:15:44 app23 kernel: RBP: ffffffff80418d50 R08: 0000000000001000 R09: 0000000000000000 Mar 23 12:15:45 app23 kernel: R10: ffffffff80418cd0 R11: 0000000000000154 R12: ffffffff8005dc8e Mar 23 12:15:46 app23 kernel: R13: ffff81007f94b840 R14: ffffffff80077090 R15: ffffffff80418d50 Mar 23 12:15:46 app23 kernel: FS: 00002b120e8b0f50(0000) GS:ffffffff8039f000(0000) knlGS:0000000000000000 Mar 23 12:15:47 app23 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 23 12:15:49 app23 kernel: CR2: 00002b8e83900000 CR3: 0000000072422000 CR4: 00000000000006e0 Mar 23 12:15:52 app23 kernel: Mar 23 12:15:52 app23 kernel: Call Trace: Mar 23 12:15:53 app23 kernel: <IRQ> [<ffffffff800b7359>] __do_IRQ+0xa4/0x103 Mar 23 12:15:53 app23 kernel: [<ffffffff8006c567>] do_IRQ+0xe7/0xf5 Mar 23 12:15:54 app23 kernel: [<ffffffff8005d615>] ret_from_intr+0x0/0xa Mar 23 12:15:55 app23 kernel: [<ffffffff8022d030>] dst_rcu_free+0x0/0x3a Mar 23 12:15:56 app23 kernel: [<ffffffff8003d278>] rt_run_flush+0x61/0xb8 Mar 23 12:15:57 app23 kernel: [<ffffffff8022ea08>] rt_secret_rebuild+0x0/0x26 Mar 23 12:15:57 app23 kernel: [<ffffffff8022ea17>] rt_secret_rebuild+0xf/0x26 Mar 23 12:15:57 app23 kernel: [<ffffffff800953c2>] run_timer_softirq+0x133/0x1af Mar 23 12:15:58 app23 kernel: [<ffffffff80011ed2>] __do_softirq+0x5e/0xd6 Mar 23 12:15:59 app23 kernel: [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 Mar 23 12:15:59 app23 kernel: [<ffffffff8006c6e4>] do_softirq+0x2c/0x85 Mar 23 12:16:00 app23 kernel: [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c Mar 23 12:16:02 app23 kernel: <EOI> Mar 23 12:16:11 app23 kernel: BUG: soft lockup - CPU#1 stuck for 28s! [klogd:23683] Mar 23 12:16:15 app23 kernel: CPU 1: Mar 23 12:16:18 app23 kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc ipv6 xfrm_nalgo crypto_api bonding dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ide_cd 8139too i2c_piix4 8139cp i2c_core mii serio_raw pcspkr cdrom ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Mar 23 12:16:19 app23 kernel: Pid: 23683, comm: klogd Not tainted 2.6.18-92.1.6.el5 #1 Mar 23 12:16:20 app23 kernel: RIP: 0010:[<ffffffff80062460>] [<ffffffff80062460>] __sched_text_start+0x0/0xaeb Mar 23 12:16:21 app23 kernel: RSP: 0000:ffff81004f25bf50 EFLAGS: 00000202 Mar 23 12:16:22 app23 kernel: RAX: ffff81004f25bf58 RBX: 00002aea430a5520 RCX: 0000000000000000 Mar 23 12:16:25 app23 kernel: RDX: 00000000000000a8 RSI: ffff81000276e000 RDI: ffff81004f25c000 Mar 23 12:16:26 app23 kernel: RBP: 0000000000000004 R08: 000000000000002b R09: 00002aea430c1000 Mar 23 12:16:26 app23 kernel: R10: ffffffffffffffff R11: 00002aea430a5920 R12: ffff81004f25a000 Mar 23 12:16:27 app23 kernel: R13: ffff81004d90e860 R14: 0000000000000000 R15: ffffffff800625df Mar 23 12:16:28 app23 kernel: FS: 00002aea436166e0(0000) GS:ffff810002741840(0000) knlGS:0000000000000000 Mar 23 12:16:30 app23 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 23 12:16:32 app23 kernel: CR2: 000000357f4c8fb0 CR3: 0000000039b0b000 CR4: 00000000000006e0 Mar 23 12:16:33 app23 kernel: Mar 23 12:16:33 app23 kernel: Call Trace: Mar 23 12:16:34 app23 kernel: [<ffffffff80064e15>] paranoid_schedule1+0x6/0x9 Mar 23 12:16:35 app23 kernel: This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Avi, Is this related to https://bugzilla.redhat.com/show_bug.cgi?id=782631? Could be. Was pinning used in this scenario? Hi avi, Can not reproduce this issue with kvm-83-251.el5 1. bootup two guests, guestA and guestB (both with 2G mem and 2 vcpus, virtio) 2. pin guestA to physical cpu 2, pin guestB to physical cpu 1,3 3. run attached script in guestA 4. keep rebooting in guestB don't reproduce till now (around 4 hours), keep on testing guest: 2.6.18-308.el5 x86_64 host: 2.6.18-308.el5 kvm-83-251.el5 cmd: /usr/libexec/qemu-kvm -name 308_249 -monitor stdio \ -serial unix:/tmp/serial-shuang,server,nowait \ -drive file=/home/bak.qcow2,index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 \ -net nic,vlan=0,model=virtio,macaddr=90:01:8a:76:45:a0 -net tap,vlan=0,script=/home/autotest-devel/client/tests/kvm/scripts/qemu-ifup-switch -net nic,vlan=1,model=virtio,macaddr=90:01:8a:76:45:71 -net tap,vlan=1,script=/home/autotest-devel/client/tests/kvm/scripts/qemu-ifup-switch -m 2048 -smp 2,cores=2,threads=1,sockets=1 -soundhw ac97 -vnc :1 -rtc-td-hack -M rhel5.4.0 -boot c -no-kvm-pit-reinjection -usbdevice tablet -balloon virtio Created attachment 577980 [details]
io-idle-race script
hi avi, get the following call trace when pin two guest vcpu to one physical cpu, and run dd, guest not hang INFO: task syslogd:2264 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff810001004420 0 2264 1 2267 2250 (NOTLB) ffff81012efb9d98 0000000000000082 ffff81012efb9d18 0000000000000246 0000000000000000 0000000000000009 ffff81013a2be7e0 ffffffff80319b60 0000013804c8a315 0000000000005098 ffff81013a2be9c8 00000000880317be Call Trace: [<ffffffff88036e55>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a3444>] autoremove_wake_function+0x0/0x2e [<ffffffff8803179a>] :jbd:journal_stop+0x1d3/0x203 [<ffffffff8002f997>] __writeback_single_inode+0x1dd/0x31c [<ffffffff800e3fac>] do_readv_writev+0x26e/0x291 [<ffffffff800f73d2>] sync_inode+0x24/0x33 [<ffffffff8804c37e>] :ext3:ext3_sync_file+0xce/0xf8 [<ffffffff8004fe8d>] do_fsync+0x52/0xa4 [<ffffffff800e4839>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task irqbalance:2347 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. irqbalance D ffff810001004420 0 2347 1 2380 2267 (NOTLB) ffff81012e88fc58 0000000000000082 ffff810000019c10 ffff81012e88fc38 0000000000000000 0000000000000007 ffff81013ad3c0c0 ffffffff80319b60 00000141452226fa 0000000000019fc1 ffff81013ad3c2a8 0000000000000000 Call Trace: [<ffffffff80063c63>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063cad>] .text.lock.mutex+0xf/0x14 [<ffffffff8000d055>] do_lookup+0xd1/0x227 [<ffffffff8000a29d>] __link_path_walk+0x9f4/0xf39 [<ffffffff8000eb31>] link_path_walk+0x45/0xb8 [<ffffffff8000ce04>] do_path_lookup+0x294/0x310 [<ffffffff80130a4b>] selinux_file_alloc_security+0x2a/0x54 [<ffffffff8002384b>] __path_lookup_intent_open+0x56/0x97 [<ffffffff8001b120>] open_namei+0x73/0x6c0 [<ffffffff8004e22e>] lock_hrtimer_base+0x26/0x4c [<ffffffff80027607>] do_filp_open+0x1c/0x38 [<ffffffff80063cf9>] do_nanosleep+0x47/0x70 [<ffffffff80019fd3>] do_sys_open+0x44/0xbe [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task pcscd:2598 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. pcscd D ffff810001004420 0 2598 1 2602 2586 (NOTLB) ffff81012e653dd8 0000000000000086 ffffffff00000257 ffffffff00000000 0000000300000001 0000000000000007 ffff81013aa5e820 ffffffff80319b60 0000014326023349 0000000000014d02 ffff81013aa5ea08 000000003aa5e820 Call Trace: [<ffffffff8013057f>] inode_has_perm+0x56/0x63 [<ffffffff80064a1b>] __down+0xc3/0xd8 [<ffffffff8008ee72>] default_wake_function+0x0/0xe [<ffffffff800646d9>] __down_failed+0x35/0x3a [<ffffffff80064cdd>] .text.lock.kernel_lock+0xf/0x1f [<ffffffff80041ebc>] do_ioctl+0x3d/0x6b [<ffffffff8002ff2d>] vfs_ioctl+0x457/0x4b9 [<ffffffff8004c26c>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 (In reply to comment #12) > hi avi, > get the following call trace when pin two guest vcpu to one physical cpu, > and run dd, guest not hang > This is with kvm-83-251.el5? (In reply to comment #13) > (In reply to comment #12) > > hi avi, > > get the following call trace when pin two guest vcpu to one physical cpu, > > and run dd, guest not hang > > > > This is with kvm-83-251.el5? yes If you pin vcpus, this is somewhat expected. If the host is overloaded, this could happen as well. Maybe we should use steal time to detect that the guest is not running all the time. Copying Rik for an opinion. 120 seconds should not be CPU scheduler induced latency. (In reply to comment #17) > 120 seconds should not be CPU scheduler induced latency. Suppose we're overcommitted, and smp_calling_function_single() within a spinlock. Without PLE, the IPI latency is magnified to the time slice. I don't think this is the case here (and steal time won't help if it were) though. The time slice is on the order of 30 milliseconds max. That means a 120 second timeout is at least 3600 full time slices. That is why I doubt that the CPU scheduler is causing this 120 second latency. |