Bug 438617
Summary: | rawhide host kernel causes unstable kvm guests | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Fenzi <kevin> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | rawhide | CC: | avi, farrellj, katzj |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-04-25 19:11:44 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Kevin Fenzi
2008-03-23 00:08:13 UTC
2.6.25-rc7 has a bunch of kvm fixes. ok. Tried with kernel-2.6.25-0.150.rc6.git7.fc9.x86_64 today. My 4 vcpu guest boots fine, but then under load (mockbuilding some packages), it just stops responding. I can virt-viewer into the console and move the mouse pointer in gdm, but nothing I do there has any other effect. Then, I did a fresh install of fedora 9 Beta with vcpus=1. This guest works fine. No lockups under load and no weird oopses or the like. So, it sounds to me like somehow the kvm smp code is having some issue. I am trying another test with a centos5 guest with 2 cpus to rule out some issue with the rawhide kernel as guest. I will also try 2.6.25-rc7 per comment #1. ;) Interesting: centos5 guest with vcpus=2: Fine. rawhide guest with vcpus=2: weird oopses/instability and a spew of: BUG: soft lockup - CPU#0 stuck for 61s! [configure:2742] CPU 0: Modules linked in: rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop parport_pc parport floppy pcspkr e1000 i2c_piix4 i2c_core button sg sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 2742, comm: configure Not tainted 2.6.25-0.150.rc6.git7.fc9 #1 RIP: 0010:[<ffffffff8113e685>] [<ffffffff8113e685>] copy_page_c+0x5/0x10 RSP: 0000:ffff8100495abcf0 EFLAGS: 00010286 RAX: ffff810000000000 RBX: ffff8100495abd58 RCX: 0000000000000200 RDX: aaaaaaaaaaaaaaab RSI: ffff81005415e000 RDI: ffff8100700a3000 RBP: ffff810000000000 R08: 0000000059d4c000 R09: ffff810000000000 R10: 0000000000036ead R11: 0000000000000001 R12: 00003ffffffff000 R13: ffff81000000ac00 R14: ffff8100495aa000 R15: 0000000000000001 FS: 00002ae3d85d2f70(0000) GS:ffffffff8141a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8100700a3000 CR3: 000000004e938000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8108e67a>] ? do_wp_page+0x35a/0x54b [<ffffffff8108ff90>] ? handle_mm_fault+0x685/0x703 [<ffffffff812a6934>] ? do_page_fault+0x3f2/0x8b9 [<ffffffff812a69ec>] ? do_page_fault+0x4aa/0x8b9 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8100c68f>] ? restore_args+0x0/0x30 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a48ad>] ? error_exit+0x0/0xa9 Will try rc7 as soon as it's done building in koji. ;) On 2.6.25-0.161.rc7.fc9.x86_64 on the host: guest with vcpus=2: locks up and stops responding under load same guest with vcpus=1, works fine. So, something with a >1 vcpu and a recent kernel (newer than centos5 at least, which wouldn't be hard) seems to cause the issues. Well, just built the new kvm-65 here and updated to the current rawhide kernel: kernel-2.6.25-0.200.rc8.git3.fc9.x86_64 The problem seems solved... did several compile loops with a 4 vcpu guest and it seems nice and stable. ;) I will keep pounding on it, but it seems it might be solved with this combo. We should probibly look at upgrading kvm... Oddly, looking again, I do see more oopses... but the machine doesn't lock up or get stuck processess anymore. BUG: soft lockup - CPU#2 stuck for 61s! [df:3723] CPU 2: Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop ppdev parport_pc parport button floppy i2c_piix4 e1000 i2c_core pcspkr sr_mod sg cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 3723, comm: df Not tainted 2.6.25-0.200.rc8.git3.fc9.x86_64 #1 RIP: 0010:[<ffffffff8108e06f>] [<ffffffff8108e06f>] __do_fault+0x31a/0x3f5 RSP: 0000:ffff810065d79cc8 EFLAGS: 00010246 RAX: aaaaaaaaaaaaaaab RBX: ffff810065d79d58 RCX: 0000000000000000 RDX: 000000007f6f6025 RSI: ffff8100010de600 RDI: ffff810021538048 RBP: ffffffff81586930 R08: ffffffff819b4088 R09: 0000000000000000 R10: ffffffff8108e013 R11: 0000000000000000 R12: ffff810065d48d80 R13: 0000003b2104b1ae R14: 000000017c01e258 R15: 0000000000000001 FS: 00007f06d30556f0(0000) GS:ffff81007fb3e578(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003b2104b1ae CR3: 00000000580a8000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8108e013>] ? __do_fault+0x2be/0x3f5 [<ffffffff8108fb57>] ? handle_mm_fault+0x340/0x703 [<ffffffff812a86b6>] ? do_page_fault+0x3f2/0x8b9 [<ffffffff812a876e>] ? do_page_fault+0x4aa/0x8b9 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108 [<ffffffff810120aa>] ? native_sched_clock+0x50/0x6d [<ffffffff8103e53b>] ? sys_rt_sigprocmask+0xab/0xd7 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108 [<ffffffff812a5eb1>] ? _spin_unlock_irq+0x2b/0x30 [<ffffffff812a58ff>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8105362f>] ? trace_hardirqs_on+0xf1/0x115 [<ffffffff812a663d>] ? error_exit+0x0/0xa9 With the latest rawhide kernel ( 2.6.25-0.234.rc9.git1.fc9.x86_64 ) on the host, things seem quite a bit more stable. I was able to do a number of small mockbuilds with no problems... then I fired off a openoffice.org mockbuild and got: BUG: soft lockup - CPU#2 stuck for 61s! [swapper:0] CPU 2: Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop ppdev parport_pc floppy parport pcspkr e1000 i2c_piix4 i2c_core sg button sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 0, comm: swapper Not tainted 2.6.25-0.218.rc8.git7.fc9.x86_64 #1 RIP: 0010:[<ffffffff8100b166>] [<ffffffff8100b166>] default_idle+0x39/0x5f RSP: 0018:ffff81007fbc1e88 EFLAGS: 00000282 RAX: 00000ae256faaaef RBX: ffff81007fbc1e98 RCX: 0000000000002ebf RDX: 00000ae256faaaef RSI: 000000000e31f4ef RDI: ffff81007fbc1e68 RBP: ffff81007fbc1e08 R08: 0000000000000000 R09: 000000000100b1f7 R10: 0000000000000001 R11: ffff81007fbb13a8 R12: 0000000000000000 R13: 00000000000f2202 R14: 0000000000000002 R15: ffff810001021f80 FS: 0000000000000000(0000) GS:ffff81007fb43280(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f24bf48b000 CR3: 0000000040f78000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8100b161>] ? default_idle+0x34/0x5f [<ffffffff8100b12d>] ? default_idle+0x0/0x5f [<ffffffff8100b0e5>] ? cpu_idle+0xa0/0xe8 [<ffffffff8128aab1>] ? start_secondary+0x3fc/0x40b This looks like a different oops? Also, the guest is still responding ok, and has no stuck processes... Will keep testing it for a few days here. I am still seeing the oopeses from comment #7 (identical in all cases except which cpu), but the guest has been up for a number of days now without any stuck processes. I'm going to say this is solved now... |