Bug 438617

Summary:	rawhide host kernel causes unstable kvm guests
Product:	[Fedora] Fedora	Reporter:	Kevin Fenzi <kevin>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	low	Docs Contact:
Priority:	low
Version:	rawhide	CC:	avi, farrellj, katzj
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-04-25 19:11:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Fenzi 2008-03-23 00:08:13 UTC

rawhide host machine, with a number of kvm guests (using libvirt). 

When running the kernel-2.6.24.3-50.fc8.x86_64 kernel, everything works great. 

When running kernel-2.6.25-0.136.rc6.git5.fc9.x86_64, guests usually boot ok,
but then when put under load (mock builds or the like), they start spewing
oopses and get hung processes and then eventually stop responding to anything. 

An example oops: 
BUG: soft lockup - CPU#2 stuck for 61s! [sh:2889]
CPU 2:
Modules linked in: rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4
iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state
nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop parport_pc parport
floppy pcspkr e1000 button sg sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod
ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd
ohci_hcd ehci_hcd [last unloaded: freq_table]
Pid: 2889, comm: sh Not tainted 2.6.25-0.121.rc5.git4.fc9 #1
RIP: 0010:[<ffffffff8113bb95>]  [<ffffffff8113bb95>] copy_page_c+0x5/0x10
RSP: 0018:ffff81001c9e1c20  EFLAGS: 00010286
RAX: ffff810000000000 RBX: ffff81001c9e1c88 RCX: 0000000000000200
RDX: aaaaaaaaaaaaaaab RSI: ffff81001b567000 RDI: ffff81001b31b000
RBP: ffff810000000000 R08: 000000001c8ae000 R09: ffff810000000000
R10: 0000000000015468 R11: 0000000000000001 R12: 00003ffffffff000
R13: ffff8100000096c8 R14: ffff81001c9e0000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81007fb5e578(0063) knlGS:00000000f7fb16c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: ffff81001b31b000 CR3: 0000000026198000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8108be96>] ? do_wp_page+0x35a/0x54b
 [<ffffffff8108d7ac>] ? handle_mm_fault+0x685/0x703
 [<ffffffff812a4414>] ? do_page_fault+0x3f2/0x8b9
 [<ffffffff812a44cc>] ? do_page_fault+0x4aa/0x8b9
 [<ffffffff81053aab>] ? debug_check_no_locks_freed+0x120/0x12f
 [<ffffffff81053967>] ? trace_hardirqs_on+0xf1/0x115
 [<ffffffff8103202d>] ? __mmdrop+0x92/0x9b
 [<ffffffff810a347a>] ? check_object+0x159/0x209
 [<ffffffff810a4de3>] ? __slab_free+0x28b/0x2d1
 [<ffffffff812a238d>] ? error_exit+0x0/0xa9
 [<ffffffff8113c3d0>] ? __put_user_4+0x20/0x30
 [<ffffffff810316d9>] ? schedule_tail+0x57/0x5b
 [<ffffffff8100bf0c>] ? ret_from_fork+0xc/0x25
 [<ffffffff812a1656>] ? trace_hardirqs_on_thunk+0x35/0x3a

Note, I went back and tried to figure out which rawhide kernel this behavior
started in. (The above oops is not from the latest kernel, but I can get one
from there if you like). 

I went back to kernel-2.6.25-0.40.rc1.git2.fc9.x86_64 and saw the behavior there
as well. ;( 

Happy to try other kernels, debug booting, provide info, whatever.

Comment 1 Chuck Ebbert 2008-03-26 21:17:53 UTC

2.6.25-rc7 has a bunch of kvm fixes.

Comment 2 Kevin Fenzi 2008-03-26 21:28:11 UTC

ok. Tried with kernel-2.6.25-0.150.rc6.git7.fc9.x86_64 today. 

My 4 vcpu guest boots fine, but then under load (mockbuilding some packages), it
just stops responding. I can virt-viewer into the console and move the mouse
pointer in gdm, but nothing I do there has any other effect. 

Then, I did a fresh install of fedora 9 Beta with vcpus=1. 
This guest works fine. No lockups under load and no weird oopses or the like. 

So, it sounds to me like somehow the kvm smp code is having some issue. 
I am trying another test with a centos5 guest with 2 cpus to rule out some issue
with the rawhide kernel as guest. 

I will also try 2.6.25-rc7 per comment #1. ;)

Comment 3 Kevin Fenzi 2008-03-26 22:06:08 UTC

Interesting: 

centos5 guest with vcpus=2: Fine. 
rawhide guest with vcpus=2: weird oopses/instability and a spew of: 

BUG: soft lockup - CPU#0 stuck for 61s! [configure:2742]
CPU 0:
Modules linked in: rfcomm l2cap bluetooth ipt_REJECT nf_conntrack_ipv4
iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state
nf_conntrack ip6table_filter ip6_tables x_tables ipv6 loop parport_pc parport
floppy pcspkr e1000 i2c_piix4 i2c_core button sg sr_mod cdrom dm_snapshot
dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod
ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
Pid: 2742, comm: configure Not tainted 2.6.25-0.150.rc6.git7.fc9 #1
RIP: 0010:[<ffffffff8113e685>]  [<ffffffff8113e685>] copy_page_c+0x5/0x10
RSP: 0000:ffff8100495abcf0  EFLAGS: 00010286
RAX: ffff810000000000 RBX: ffff8100495abd58 RCX: 0000000000000200
RDX: aaaaaaaaaaaaaaab RSI: ffff81005415e000 RDI: ffff8100700a3000
RBP: ffff810000000000 R08: 0000000059d4c000 R09: ffff810000000000
R10: 0000000000036ead R11: 0000000000000001 R12: 00003ffffffff000
R13: ffff81000000ac00 R14: ffff8100495aa000 R15: 0000000000000001
FS:  00002ae3d85d2f70(0000) GS:ffffffff8141a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8100700a3000 CR3: 000000004e938000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8108e67a>] ? do_wp_page+0x35a/0x54b
 [<ffffffff8108ff90>] ? handle_mm_fault+0x685/0x703
 [<ffffffff812a6934>] ? do_page_fault+0x3f2/0x8b9
 [<ffffffff812a69ec>] ? do_page_fault+0x4aa/0x8b9
 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115
 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8100c68f>] ? restore_args+0x0/0x30
 [<ffffffff812a3b6f>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8105370f>] ? trace_hardirqs_on+0xf1/0x115
 [<ffffffff812a48ad>] ? error_exit+0x0/0xa9

Will try rc7 as soon as it's done building in koji. ;)

Comment 4 Kevin Fenzi 2008-03-26 23:54:54 UTC

On 2.6.25-0.161.rc7.fc9.x86_64 on the host: 

guest with vcpus=2: locks up and stops responding under load
same guest with vcpus=1, works fine. 

So, something with a >1 vcpu and a recent kernel (newer than centos5 at least,
which wouldn't be hard) seems to cause the issues.

Comment 5 Kevin Fenzi 2008-04-07 03:34:49 UTC

Well, just built the new kvm-65 here and updated to the current rawhide kernel: 
kernel-2.6.25-0.200.rc8.git3.fc9.x86_64

The problem seems solved... did several compile loops with a 4 vcpu guest and it
seems nice and stable. ;) 

I will keep pounding on it, but it seems it might be solved with this combo. 
We should probibly look at upgrading kvm...

Comment 6 Kevin Fenzi 2008-04-07 03:37:08 UTC

Oddly, looking again, I do see more oopses... but the machine doesn't lock up or
get stuck processess anymore. 

BUG: soft lockup - CPU#2 stuck for 61s! [df:3723]
CPU 2:
Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT
nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp
nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6
loop ppdev parport_pc parport button floppy i2c_piix4 e1000 i2c_core pcspkr
sr_mod sg cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi
ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
[last unloaded: freq_table]
Pid: 3723, comm: df Not tainted 2.6.25-0.200.rc8.git3.fc9.x86_64 #1
RIP: 0010:[<ffffffff8108e06f>]  [<ffffffff8108e06f>] __do_fault+0x31a/0x3f5
RSP: 0000:ffff810065d79cc8  EFLAGS: 00010246
RAX: aaaaaaaaaaaaaaab RBX: ffff810065d79d58 RCX: 0000000000000000
RDX: 000000007f6f6025 RSI: ffff8100010de600 RDI: ffff810021538048
RBP: ffffffff81586930 R08: ffffffff819b4088 R09: 0000000000000000
R10: ffffffff8108e013 R11: 0000000000000000 R12: ffff810065d48d80
R13: 0000003b2104b1ae R14: 000000017c01e258 R15: 0000000000000001
FS:  00007f06d30556f0(0000) GS:ffff81007fb3e578(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003b2104b1ae CR3: 00000000580a8000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8108e013>] ? __do_fault+0x2be/0x3f5
 [<ffffffff8108fb57>] ? handle_mm_fault+0x340/0x703
 [<ffffffff812a86b6>] ? do_page_fault+0x3f2/0x8b9
 [<ffffffff812a876e>] ? do_page_fault+0x4aa/0x8b9
 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108
 [<ffffffff810120aa>] ? native_sched_clock+0x50/0x6d
 [<ffffffff8103e53b>] ? sys_rt_sigprocmask+0xab/0xd7
 [<ffffffff81051b6d>] ? lock_release_holdtime+0x1e/0x108
 [<ffffffff812a5eb1>] ? _spin_unlock_irq+0x2b/0x30
 [<ffffffff812a58ff>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8105362f>] ? trace_hardirqs_on+0xf1/0x115
 [<ffffffff812a663d>] ? error_exit+0x0/0xa9

Comment 7 Kevin Fenzi 2008-04-16 15:51:47 UTC

With the latest rawhide kernel ( 2.6.25-0.234.rc9.git1.fc9.x86_64 ) on the host,
things seem quite a bit more stable. I was able to do a number of small
mockbuilds with no problems... then I fired off a openoffice.org mockbuild and got: 

BUG: soft lockup - CPU#2 stuck for 61s! [swapper:0]
CPU 2:
Modules linked in: bridge bnep rfcomm l2cap bluetooth ipt_REJECT
nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp
nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6
loop ppdev parport_pc floppy parport pcspkr e1000 i2c_piix4 i2c_core sg button
sr_mod cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic
libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last
unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.25-0.218.rc8.git7.fc9.x86_64 #1
RIP: 0010:[<ffffffff8100b166>]  [<ffffffff8100b166>] default_idle+0x39/0x5f
RSP: 0018:ffff81007fbc1e88  EFLAGS: 00000282
RAX: 00000ae256faaaef RBX: ffff81007fbc1e98 RCX: 0000000000002ebf
RDX: 00000ae256faaaef RSI: 000000000e31f4ef RDI: ffff81007fbc1e68
RBP: ffff81007fbc1e08 R08: 0000000000000000 R09: 000000000100b1f7
R10: 0000000000000001 R11: ffff81007fbb13a8 R12: 0000000000000000
R13: 00000000000f2202 R14: 0000000000000002 R15: ffff810001021f80
FS:  0000000000000000(0000) GS:ffff81007fb43280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f24bf48b000 CR3: 0000000040f78000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8100b161>] ? default_idle+0x34/0x5f
 [<ffffffff8100b12d>] ? default_idle+0x0/0x5f
 [<ffffffff8100b0e5>] ? cpu_idle+0xa0/0xe8
 [<ffffffff8128aab1>] ? start_secondary+0x3fc/0x40b

This looks like a different oops?
Also, the guest is still responding ok, and has no stuck processes... 
Will keep testing it for a few days here.

Comment 8 Kevin Fenzi 2008-04-25 19:11:44 UTC

I am still seeing the oopeses from comment #7 (identical in all cases except
which cpu), but the guest has been up for a number of days now without any stuck
processes. 

I'm going to say this is solved now...