Created attachment 367545 [details] console-dump of crash Description of problem: Kernel freezes with KVM vms running [usually during startup] Version-Release number of selected component (if applicable): kernel-2.6.30.9-90.fc11.x86_64 qemu-0.10.6-9.fc11.x86_64 virt-manager-0.7.0-7.fc11.x86_64 How reproducible: Happens pretty frequently Steps to Reproduce: 1. Install F11 with latest patches 2. Install a bunch of VMs [kvm] windows, ubuntu, freebsd, opensolaris 3. set the VMs to start at reboot Actual results: kernel panic? with blinking numlock/caps lock at various times. - sometimes when the VMs are booting - sometimes when a VM is restarted Expected results: no crash Additional info: - Ran memtest86 overnight and no errors here. - I suspect it happens with multiple VMs running - esp with OpenSolaris in the mix. - Some of these VMs are carried over from F10 - and during this migration OpenSolaris VM never worked. Recently [perhaps a couple of months back - I reinstalled the OpenSolaris VM]. Also until then, the windows VM was the primary active VM. But since then - I was attempting to run all the VMs [windows,freebsd, ubuntu, opensolaris] simultaneously - and have seen constant crashes. - so the crashes have been cosntant during the past few kernel updates - and qemu, virt-manager updates. a picture of one of the kernel dumps is attached
Created attachment 367547 [details] lspci; cat /proc/meminfo; cat /proc/cpuinfo; dmesg
We really need to see the beginning of that oops report.
I'm not sure how to get the complete stack trace. All I can do is take pics of the console output. I have the following update since my previous report: I've been running a single [Windows] VM since then - and it was stable. So the issue is with multiple VMs - usually its triggered during the boot sequence of one of them. I've upgraded to F12 now - and the crashes persist. I have the oops from 2 different crashes. Kernel: 2.6.31.6-145.fc12.x86_64 - First one when all 4 VMs are booted together at startup [of host F12] - Second one - with 3 VMs booted together at startup [without OpenSolaris]. Here the initial boot went fine. But on rebooting one of the VMs [ubuntu] a panic was triggered. For this panic - I have the stack trace from the begining. I also get the following output on a ssh terminal connection [to the F12 host] from a different machine. >>>>>>>>>>>>>>>> [root@maverick ~]# Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:general protection fault: 0000 [#1] SMP Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:last sysfs file: /sys/kernel/mm/ksm/run Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:Stack: Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:Call Trace: Message from syslogd@maverick at Dec 2 14:20:01 ... kernel: <IRQ> Message from syslogd@maverick at Dec 2 14:20:01 ... kernel: <EOI> Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:Code: 38 0f b6 d2 48 01 d0 74 30 48 8b 58 28 eb 13 48 89 df e8 03 f9 ff ff 48 89 df e8 cf f8 ff ff 4c 89 e3 48 85 db 74 12 48 8d 7b 78 <4c> 8b 23 e8 fa ee cb ff 85 c0 74 e8 eb d6 5b 41 5c c9 c3 55 48 Message from syslogd@maverick at Dec 2 14:20:01 ... kernel:Kernel panic - not syncing: Fatal exception in interrupt asterix:/home/balay>
Created attachment 375591 [details] crash with 4 VMs started together
Created attachment 375592 [details] crash with 3 VMs started together, and then one of the VMs was rebooted
Ok - I've disabled ipv6 on this machine [because the stack trace has references to it] - and now the VMs are lot more stable. I've tried a few things - theF12 host hasn't crashed yet. I'll see if this stays stable [with all the 4VMs running concurrently]. BTW: should have mentioned: I use bridge networking for the VMs [and it is also listed in the stack trace]. So perhaps the combination of bridge networking with ipv6 is the trigger for the crash..
Am seeing something fairly similar, reported in https://bugzilla.redhat.com/show_bug.cgi?id=545851
It seems I can confirm the ipv6 part of the anecdote. This is on a friends AMD box, which doesn't have a VT-d knob in the bios. I can't test myself but the issue is quite reproducible there.
Thanks to a lead from the Fedora Forums I found this bug report that mirrors recent problems I have seen on both Fedora 11 and Fedora 12 64-bit KVM systems that are using bridge networks. If Autostart is enabled for at least one VM, the systems are hard locking at reboot. However, if ipv6 is disabled, the host boots normally and the VMs autostart up as normal. I have disabled ipv6 by editing /etc/modprobe.d/blacklist and adding the line: install ipv6 /bin/true If I remove all autostart options and re-enable ipv6, the KVM host starts fine, and the VMs can be manually started without any problems. Hence, it appears there is a conflict (possibly just for systems using bridge networks) when ipv6 is enabled and VMs are configured to autostart.
Just an update: [after disabling ipv6] The machine now has been stable for the past 2 weeks [even with some reboots of the guest OSes] [root@maverick ~]# uname -srv Linux 2.6.31.6-145.fc12.x86_64 #1 SMP Sat Nov 21 15:57:45 EST 2009 [root@maverick ~]# uptime 12:09:12 up 13 days, 19:12, 1 user, load average: 0.31, 0.29, 0.21
I have had a similar experience, although I could get the kernel to crash even when KVM VM machines were not running (and libvirtd was disabled at startup). Originally thought it was issue with Spanning Tree, but disabling had no effect. The issue for me involved Windows 7 clients with TCP/IPv6 active on their NIC profile. As soon as the NIC initialised (or reset for that matter), the Fedora 12 server would freeze. Disabling TCP/IPv6 driver in Win7 clients resolved issue. WinXP clients no problem, as TCP/IPv6 not installed. For safety, turned off any explicit TCP/IPv6 settings on Fedora too, although even with IPV6INIT=no, Fedora still assigned auto IP6 address. Network config: HP ProCurve managed switch, with two ports configured in LACP (802.3ad) mode, no VLAN. Fedora server config: eth0+eth1 -> bond0 -> br0 /etc/sysconfig/network-scripts/ifcfg-br0: ... IPV6INIT=no IPV6_AUTOCONF=no DHCPV6=no ... ncftool> dumpxml br0 <?xml version="1.0"?> <interface type="bridge" name="br0"> <start mode="onboot"/> <protocol family="ipv4"> <ip address="10.16.182.254" prefix="24"/> <route gateway="10.16.182.1"/> </protocol> <bridge stp="on"> <interface type="bond" name="bond0"> <bond mode="802.3ad"> <miimon freq="100" updelay="100" carrier="ioctl"/> <interface type="ethernet" name="eth0"> <mac address="00:23:7D:FB:FE:35"/> </interface> <interface type="ethernet" name="eth1"> <mac address="00:23:7D:A8:EE:CC"/> </interface> </bond> </interface> </bridge> </interface> ncftool> dumpxml --live br0 <?xml version="1.0"?> <interface name="br0" type="bridge"> <bridge> <interface name="bond0" type="bond"> <bond> <interface name="eth0" type="ethernet"> <mac address="00:23:7d:fb:fe:35"/> </interface> <interface name="eth1" type="ethernet"> <mac address="00:23:7d:fb:fe:35"/> </interface> </bond> </interface> <interface name="vnet0" type="ethernet"> <mac address="e2:30:33:b3:84:78"/> </interface> </bridge> <protocol family="ipv4"> <ip address="10.16.182.254" prefix="24"/> </protocol> <protocol family="ipv6"> <ip address="fe80::223:7dff:fefb:fe35" prefix="64"/> </protocol> </interface>
I disable IPV6 on the F12 box by doing the following: - edit /etc/sysconfig/network and add the line NETWORKING_IPV6=no - create a file /etc/modprobe.d/disable-ipv6.conf with the line install ipv6 /bin/true (In reply to comment #11) > For safety, turned off any explicit TCP/IPv6 settings on Fedora too, although > even with IPV6INIT=no, Fedora still assigned auto IP6 address.
I spent the whole weekend learning the netfilter code and SLUB debugging to find this problem: http://lkml.org/lkml/2010/2/2/272 There should be a patch soon.
*** Bug 545851 has been marked as a duplicate of this bug. ***
Confirmed that the following hack prevents the issue (real fix is being worked on): void nf_conntrack_destroy(struct nf_conntrack *nfct) { void (*destroy)(struct nf_conntrack *); if ((struct nf_conn *)nfct == &nf_conntrack_untracked) { printk("JCM: nf_conntrack_destroy: trying to destroy nf_conntrack_untracked! CONTINUING...\n"); //panic("JCM: nf_conntrack_destroy: trying to destroy nf_conntrack_untracked!\n"); return; /* refuse to free nf_conntrack_untracked */ } rcu_read_lock(); destroy = rcu_dereference(nf_ct_destroy); BUG_ON(destroy == NULL); destroy(nfct); rcu_read_unlock(); } EXPORT_SYMBOL(nf_conntrack_destroy); The issue is that with multiple namespaces, we wind up decreasing the use count on the untracked static ct to zero and trying to free it, which is bad. Patrick should have a fix tomorrow using per-namespace untracked ct's. Jon.
This hack is harmless, but in an ideal world we wouldn't try freeing the untracked ct in the first place.
*** Bug 521362 has been marked as a duplicate of this bug. ***
I see Kyle is already making a test kernel with this.
Yeah, builds are in progress on all the targets I think.
*** Bug 520108 has been marked as a duplicate of this bug. ***
This has been fixed and confirmed.
Is this bug fixed only in rawhide?
No, it's been committed to F-11 and F-12 too.
kernel-2.6.31.12-174.2.17.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/kernel-2.6.31.12-174.2.17.fc12
kernel-2.6.31.12-174.2.19.fc12 has been pushed to the Fedora 12 stable repository. If problems still persist, please make note of it in this bug report.
*** Bug 681917 has been marked as a duplicate of this bug. ***