rawhide host machine, with a number of kvm guests (using libvirt). When running the kernel-2.6.24.3-50.fc8.x86_64 kernel, everything works great. When running kernel-2.6.25-0.136.rc6.git5.fc9.x86_64, guests usually boot ok, but then when put under load (mock builds or the like), they start spewing oopses and get hung processes and then eventually stop responding to anything. An example oops: BUG: soft lockup - CPU#2 stuck for 61s! [sh:2889] CPU 2: Modules linked in: rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop parport_pc parport floppy pcspkr e1000 button sg sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 2889, comm: sh Not tainted 2.6.25-0.121.rc5.git4.fc9 #1 RIP: 0010:[<ffffffff8113bb95>] [<ffffffff8113bb95>] copy_page_c+0x5/0x10 RSP: 0018:ffff81001c9e1c20 EFLAGS: 00010286 RAX: ffff810000000000 RBX: ffff81001c9e1c88 RCX: 0000000000000200 RDX: aaaaaaaaaaaaaaab RSI: ffff81001b567000 RDI: ffff81001b31b000 RBP: ffff810000000000 R08: 000000001c8ae000 R09: ffff810000000000 R10: 0000000000015468 R11: 0000000000000001 R12: 00003ffffffff000 R13: ffff8100000096c8 R14: ffff81001c9e0000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff81007fb5e578(0063) knlGS:00000000f7fb16c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: ffff81001b31b000 CR3: 0000000026198000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8108be96>] ? do_wp_page+0x35a/0x54b [<ffffffff8108d7ac>] ? handle_mm_fault+0x685/0x703 [<ffffffff812a4414>] ? do_page_fault+0x3f2/0x8b9 [<ffffffff812a44cc>] ? do_page_fault+0x4aa/0x8b9 [<ffffffff81053aab>] ? debug_check_no_locks_freed+0x120/0x12f [<ffffffff81053967>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff8103202d>] ? __mmdrop+0x92/0x9b [<ffffffff810a347a>] ? check_object+0x159/0x209 [<ffffffff810a4de3>] ? __slab_free+0x28b/0x2d1 [<ffffffff812a238d>] ? error_exit+0x0/0xa9 [<ffffffff8113c3d0>] ? __put_user_4+0x20/0x30 [<ffffffff810316d9>] ? schedule_tail+0x57/0x5b [<ffffffff8100bf0c>] ? ret_from_fork+0xc/0x25 [<ffffffff812a1656>] ? trace_hardirqs_on_thunk+0x35/0x3a Note, I went back and tried to figure out which rawhide kernel this behavior started in. (The above oops is not from the latest kernel, but I can get one from there if you like). I went back to kernel-2.6.25-0.40.rc1.git2.fc9.x86_64 and saw the behavior there as well. ;( Happy to try other kernels, debug booting, provide info, whatever.
2.6.25-rc7 has a bunch of kvm fixes.
ok. Tried with kernel-2.6.25-0.150.rc6.git7.fc9.x86_64 today. My 4 vcpu guest boots fine, but then under load (mockbuilding some packages), it just stops responding. I can virt-viewer into the console and move the mouse pointer in gdm, but nothing I do there has any other effect. Then, I did a fresh install of fedora 9 Beta with vcpus=1. This guest works fine. No lockups under load and no weird oopses or the like. So, it sounds to me like somehow the kvm smp code is having some issue. I am trying another test with a centos5 guest with 2 cpus to rule out some issue with the rawhide kernel as guest. I will also try 2.6.25-rc7 per comment #1. ;)
Interesting: centos5 guest with vcpus=2: Fine. rawhide guest with vcpus=2: weird oopses/instability and a spew of: BUG: soft lockup - CPU#0 stuck for 61s! [configure:2742] CPU 0: Modules linked in: rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop parport_pc parport floppy pcspkr e1000 i2c_piix4 i2c_core button sg sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 2742, comm: configure Not tainted 2.6.25-0.150.rc6.git7.fc9 #1 RIP: 0010:[<ffffffff8113e685>] [<ffffffff8113e685>] copy_page_c+0x5/0x10 RSP: 0000:ffff8100495abcf0 EFLAGS: 00010286 RAX: ffff810000000000 RBX: ffff8100495abd58 RCX: 0000000000000200 RDX: aaaaaaaaaaaaaaab RSI: ffff81005415e000 RDI: ffff8100700a3000 RBP: ffff810000000000 R08: 0000000059d4c000 R09: ffff810000000000 R10: 0000000000036ead R11: 0000000000000001 R12: 00003ffffffff000 R13: ffff81000000ac00 R14: ffff8100495aa000 R15: 0000000000000001 FS: 00002ae3d85d2f70(0000) GS:ffffffff8141a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8100700a3000 CR3: 000000004e938000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8108e67a>] ? do_wp_page+0x35a/0x54b [<ffffffff8108ff90>] ? handle_mm_fault+0x685/0x703 [<ffffffff812a6934>] ? do_page_fault+0x3f2/0x8b9 [<ffffffff812a69ec>] ? do_page_fault+0x4aa/0x8b9 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8100c68f>] ? restore_args+0x0/0x30 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a48ad>] ? error_exit+0x0/0xa9 Will try rc7 as soon as it's done building in koji. ;)
On 2.6.25-0.161.rc7.fc9.x86_64 on the host: guest with vcpus=2: locks up and stops responding under load same guest with vcpus=1, works fine. So, something with a >1 vcpu and a recent kernel (newer than centos5 at least, which wouldn't be hard) seems to cause the issues.
Well, just built the new kvm-65 here and updated to the current rawhide kernel: kernel-2.6.25-0.200.rc8.git3.fc9.x86_64 The problem seems solved... did several compile loops with a 4 vcpu guest and it seems nice and stable. ;) I will keep pounding on it, but it seems it might be solved with this combo. We should probibly look at upgrading kvm...
Oddly, looking again, I do see more oopses... but the machine doesn't lock up or get stuck processess anymore. BUG: soft lockup - CPU#2 stuck for 61s! [df:3723] CPU 2: Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop ppdev parport_pc parport button floppy i2c_piix4 e1000 i2c_core pcspkr sr_mod sg cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 3723, comm: df Not tainted 2.6.25-0.200.rc8.git3.fc9.x86_64 #1 RIP: 0010:[<ffffffff8108e06f>] [<ffffffff8108e06f>] __do_fault+0x31a/0x3f5 RSP: 0000:ffff810065d79cc8 EFLAGS: 00010246 RAX: aaaaaaaaaaaaaaab RBX: ffff810065d79d58 RCX: 0000000000000000 RDX: 000000007f6f6025 RSI: ffff8100010de600 RDI: ffff810021538048 RBP: ffffffff81586930 R08: ffffffff819b4088 R09: 0000000000000000 R10: ffffffff8108e013 R11: 0000000000000000 R12: ffff810065d48d80 R13: 0000003b2104b1ae R14: 000000017c01e258 R15: 0000000000000001 FS: 00007f06d30556f0(0000) GS:ffff81007fb3e578(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003b2104b1ae CR3: 00000000580a8000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8108e013>] ? __do_fault+0x2be/0x3f5 [<ffffffff8108fb57>] ? handle_mm_fault+0x340/0x703 [<ffffffff812a86b6>] ? do_page_fault+0x3f2/0x8b9 [<ffffffff812a876e>] ? do_page_fault+0x4aa/0x8b9 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108 [<ffffffff810120aa>] ? native_sched_clock+0x50/0x6d [<ffffffff8103e53b>] ? sys_rt_sigprocmask+0xab/0xd7 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108 [<ffffffff812a5eb1>] ? _spin_unlock_irq+0x2b/0x30 [<ffffffff812a58ff>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105362f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a663d>] ? error_exit+0x0/0xa9
With the latest rawhide kernel ( 2.6.25-0.234.rc9.git1.fc9.x86_64 ) on the host, things seem quite a bit more stable. I was able to do a number of small mockbuilds with no problems... then I fired off a openoffice.org mockbuild and got: BUG: soft lockup - CPU#2 stuck for 61s! [swapper:0] CPU 2: Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop ppdev parport_pc floppy parport pcspkr e1000 i2c_piix4 i2c_core sg button sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 0, comm: swapper Not tainted 2.6.25-0.218.rc8.git7.fc9.x86_64 #1 RIP: 0010:[<ffffffff8100b166>] [<ffffffff8100b166>] default_idle+0x39/0x5f RSP: 0018:ffff81007fbc1e88 EFLAGS: 00000282 RAX: 00000ae256faaaef RBX: ffff81007fbc1e98 RCX: 0000000000002ebf RDX: 00000ae256faaaef RSI: 000000000e31f4ef RDI: ffff81007fbc1e68 RBP: ffff81007fbc1e08 R08: 0000000000000000 R09: 000000000100b1f7 R10: 0000000000000001 R11: ffff81007fbb13a8 R12: 0000000000000000 R13: 00000000000f2202 R14: 0000000000000002 R15: ffff810001021f80 FS: 0000000000000000(0000) GS:ffff81007fb43280(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f24bf48b000 CR3: 0000000040f78000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8100b161>] ? default_idle+0x34/0x5f [<ffffffff8100b12d>] ? default_idle+0x0/0x5f [<ffffffff8100b0e5>] ? cpu_idle+0xa0/0xe8 [<ffffffff8128aab1>] ? start_secondary+0x3fc/0x40b This looks like a different oops? Also, the guest is still responding ok, and has no stuck processes... Will keep testing it for a few days here.
I am still seeing the oopeses from comment #7 (identical in all cases except which cpu), but the guest has been up for a number of days now without any stuck processes. I'm going to say this is solved now...