Bug 533087 - kernel crash with kvm virtual machines
Summary: kernel crash with kvm virtual machines
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 12
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Jon Masters
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 520108 521362 545851 681917 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-11-04 22:50 UTC by Satish Balay
Modified: 2011-11-04 14:53 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-02-08 14:02:51 UTC


Attachments (Terms of Use)
console-dump of crash (653.01 KB, image/jpeg)
2009-11-04 22:50 UTC, Satish Balay
no flags Details
lspci; cat /proc/meminfo; cat /proc/cpuinfo; dmesg (41.72 KB, application/octet-stream)
2009-11-04 22:56 UTC, Satish Balay
no flags Details
crash with 4 VMs started together (3.10 MB, image/jpeg)
2009-12-02 20:50 UTC, Satish Balay
no flags Details
crash with 3 VMs started together, and then one of the VMs was rebooted (3.05 MB, image/jpeg)
2009-12-02 20:51 UTC, Satish Balay
no flags Details

Description Satish Balay 2009-11-04 22:50:08 UTC
Created attachment 367545 [details]
console-dump of crash

Description of problem:

Kernel freezes with KVM vms running [usually during startup]

Version-Release number of selected component (if applicable):

kernel-2.6.30.9-90.fc11.x86_64
qemu-0.10.6-9.fc11.x86_64
virt-manager-0.7.0-7.fc11.x86_64

How reproducible:

Happens pretty frequently

Steps to Reproduce:
1. Install F11 with latest patches
2. Install a bunch of VMs [kvm] windows, ubuntu, freebsd, opensolaris
3. set the VMs to start at reboot
  
Actual results:

kernel panic? with blinking  numlock/caps lock at various times.

- sometimes when the VMs are booting
- sometimes when a VM is restarted

Expected results:

no crash

Additional info:

- Ran memtest86 overnight and no errors here.
- I suspect it happens with multiple VMs running - esp with OpenSolaris in the mix.
- Some of these VMs are carried over from F10 - and during this migration
OpenSolaris VM never worked. Recently [perhaps a couple of months back - I reinstalled the OpenSolaris VM]. Also until then, the windows VM was the primary active VM. But since then - I was attempting to run all the VMs  [windows,freebsd, ubuntu, opensolaris] simultaneously - and have seen constant crashes.
- so the crashes have been cosntant during the past few kernel updates - and qemu, virt-manager updates.

a picture of one of the kernel dumps is attached

Comment 1 Satish Balay 2009-11-04 22:56:01 UTC
Created attachment 367547 [details]
 lspci; cat /proc/meminfo; cat /proc/cpuinfo; dmesg

Comment 2 Chuck Ebbert 2009-11-17 20:22:22 UTC
We really need to see the beginning of that oops report.

Comment 3 Satish Balay 2009-12-02 20:48:59 UTC
I'm not sure how to get the complete stack trace. All I can do is take pics of the console output.

I have the following update since my previous report:

I've been running a single [Windows] VM since then - and it was stable. So the issue is with multiple VMs - usually its triggered during the boot sequence of one of them.

I've upgraded to F12 now - and the crashes persist. I have the oops from 2 different crashes.

Kernel: 2.6.31.6-145.fc12.x86_64

- First one when all 4 VMs are booted together at startup [of host F12]

- Second one - with 3 VMs booted  together at startup [without OpenSolaris]. Here the initial boot went fine. But on rebooting one of the VMs [ubuntu] a panic was triggered.

For this panic - I have the stack trace from the begining. I also get the following output on a ssh terminal connection [to the F12 host] from a different machine.

>>>>>>>>>>>>>>>>
[root@maverick ~]# 
Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:general protection fault: 0000 [#1] SMP 

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:last sysfs file: /sys/kernel/mm/ksm/run

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:Stack:

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:Call Trace:

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel: <IRQ> 

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel: <EOI> 

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:Code: 38 0f b6 d2 48 01 d0 74 30 48 8b 58 28 eb 13 48 89 df e8 03 f9 ff ff 48 89 df e8 cf f8 ff ff 4c 89 e3 48 85 db 74 12 48 8d 7b 78 <4c> 8b 23 e8 fa ee cb ff 85 c0 74 e8 eb d6 5b 41 5c c9 c3 55 48 

Message from syslogd@maverick at Dec  2 14:20:01 ...
 kernel:Kernel panic - not syncing: Fatal exception in interrupt
asterix:/home/balay>

Comment 4 Satish Balay 2009-12-02 20:50:15 UTC
Created attachment 375591 [details]
crash with 4 VMs started together

Comment 5 Satish Balay 2009-12-02 20:51:34 UTC
Created attachment 375592 [details]
crash with 3 VMs started together, and then one of the VMs was rebooted

Comment 6 Satish Balay 2009-12-02 23:25:56 UTC
Ok - I've disabled ipv6 on this machine [because the stack trace has references to it] - and now the VMs are lot more stable. I've tried a few things - theF12 host hasn't crashed yet.

I'll see if this stays stable [with all the 4VMs running concurrently].

BTW: should have mentioned: I use bridge networking for the VMs [and it is also listed in the stack trace]. So perhaps the combination of bridge networking with ipv6 is the trigger for the crash..

Comment 7 Adam Huffman 2009-12-09 14:02:22 UTC
Am seeing something fairly similar, reported in https://bugzilla.redhat.com/show_bug.cgi?id=545851

Comment 8 Yanko Kaneti 2009-12-12 03:24:19 UTC
It seems I can confirm the ipv6 part of the anecdote. This is on a friends AMD box, which doesn't have a VT-d knob in the bios. I can't test myself but the issue is quite reproducible there.

Comment 9 David Cartwright 2009-12-16 09:59:38 UTC
Thanks to a lead from the Fedora Forums I found this bug report that mirrors recent problems I have seen on both Fedora 11 and Fedora 12 64-bit KVM systems that are using bridge networks.

If Autostart is enabled for at least one VM, the systems are hard locking at reboot.

However, if ipv6 is disabled, the host boots normally and the VMs autostart up as normal.

I have disabled ipv6 by editing /etc/modprobe.d/blacklist and adding the line:

install ipv6 /bin/true

If I remove all autostart options and re-enable ipv6, the KVM host starts fine, and the VMs can be manually started without any problems.

Hence, it appears there is a conflict (possibly just for systems using bridge networks) when ipv6 is enabled and VMs are configured to autostart.

Comment 10 Satish Balay 2009-12-16 18:10:47 UTC
Just an update: [after disabling ipv6] The machine now has been stable for the past 2 weeks [even with some reboots of the guest OSes]

[root@maverick ~]# uname -srv
Linux 2.6.31.6-145.fc12.x86_64 #1 SMP Sat Nov 21 15:57:45 EST 2009
[root@maverick ~]# uptime
 12:09:12 up 13 days, 19:12,  1 user,  load average: 0.31, 0.29, 0.21

Comment 11 Scott Marshall 2010-01-30 07:18:01 UTC
I have had a similar experience, although I could get the kernel to crash even when KVM VM machines were not running (and libvirtd was disabled at startup).

Originally thought it was issue with Spanning Tree, but disabling had no effect.

The issue for me involved Windows 7 clients with TCP/IPv6 active on their NIC profile.
As soon as the NIC initialised (or reset for that matter), the Fedora 12 server would freeze.

Disabling TCP/IPv6 driver in Win7 clients resolved issue.
WinXP clients no problem, as TCP/IPv6 not installed.

For safety, turned off any explicit TCP/IPv6 settings on Fedora too, although even with IPV6INIT=no, Fedora still assigned auto IP6 address.

Network config:
HP ProCurve managed switch, with two ports configured in LACP (802.3ad) mode, no VLAN.

Fedora server config:
eth0+eth1 -> bond0 -> br0

/etc/sysconfig/network-scripts/ifcfg-br0:
...
IPV6INIT=no
IPV6_AUTOCONF=no
DHCPV6=no
...

ncftool> dumpxml br0
<?xml version="1.0"?>
<interface type="bridge" name="br0">
  <start mode="onboot"/>
  <protocol family="ipv4">
    <ip address="10.16.182.254" prefix="24"/>
    <route gateway="10.16.182.1"/>
  </protocol>
  <bridge stp="on">
    <interface type="bond" name="bond0">
      <bond mode="802.3ad">
        <miimon freq="100" updelay="100" carrier="ioctl"/>
        <interface type="ethernet" name="eth0">
          <mac address="00:23:7D:FB:FE:35"/>
        </interface>
        <interface type="ethernet" name="eth1">
          <mac address="00:23:7D:A8:EE:CC"/>
        </interface>
      </bond>
    </interface>
  </bridge>
</interface>

ncftool> dumpxml --live br0
<?xml version="1.0"?>
<interface name="br0" type="bridge">
  <bridge>
    <interface name="bond0" type="bond">
      <bond>
        <interface name="eth0" type="ethernet">
          <mac address="00:23:7d:fb:fe:35"/>
        </interface>
        <interface name="eth1" type="ethernet">
          <mac address="00:23:7d:fb:fe:35"/>
        </interface>
      </bond>
    </interface>
    <interface name="vnet0" type="ethernet">
      <mac address="e2:30:33:b3:84:78"/>
    </interface>
  </bridge>
  <protocol family="ipv4">
    <ip address="10.16.182.254" prefix="24"/>
  </protocol>
  <protocol family="ipv6">
    <ip address="fe80::223:7dff:fefb:fe35" prefix="64"/>
  </protocol>
</interface>

Comment 12 Satish Balay 2010-01-30 13:48:09 UTC
I disable IPV6 on the F12 box by doing the following:

- edit /etc/sysconfig/network and add the line
NETWORKING_IPV6=no

- create a file /etc/modprobe.d/disable-ipv6.conf with the line
install ipv6 /bin/true

(In reply to comment #11)

> For safety, turned off any explicit TCP/IPv6 settings on Fedora too, although
> even with IPV6INIT=no, Fedora still assigned auto IP6 address.

Comment 13 Jon Masters 2010-02-02 17:29:23 UTC
I spent the whole weekend learning the netfilter code and SLUB debugging to find this problem: http://lkml.org/lkml/2010/2/2/272

There should be a patch soon.

Comment 14 Jon Masters 2010-02-02 17:31:44 UTC
*** Bug 545851 has been marked as a duplicate of this bug. ***

Comment 15 Jon Masters 2010-02-02 18:46:01 UTC
Confirmed that the following hack prevents the issue (real fix is being worked on):

void nf_conntrack_destroy(struct nf_conntrack *nfct)
{
        void (*destroy)(struct nf_conntrack *);

        if ((struct nf_conn *)nfct == &nf_conntrack_untracked) {
                printk("JCM: nf_conntrack_destroy: trying to destroy
nf_conntrack_untracked! CONTINUING...\n");
                //panic("JCM: nf_conntrack_destroy: trying to destroy
nf_conntrack_untracked!\n");
                return; /* refuse to free nf_conntrack_untracked */
        }

        rcu_read_lock();
        destroy = rcu_dereference(nf_ct_destroy);
        BUG_ON(destroy == NULL);
        destroy(nfct);
        rcu_read_unlock();
}
EXPORT_SYMBOL(nf_conntrack_destroy);

The issue is that with multiple namespaces, we wind up decreasing the use count on the untracked static ct to zero and trying to free it, which is bad. Patrick should have a fix tomorrow using per-namespace untracked ct's.

Jon.

Comment 16 Jon Masters 2010-02-02 18:47:18 UTC
This hack is harmless, but in an ideal world we wouldn't try freeing the untracked ct in the first place.

Comment 17 Jon Masters 2010-02-03 08:45:35 UTC
*** Bug 521362 has been marked as a duplicate of this bug. ***

Comment 18 Jon Masters 2010-02-03 20:07:54 UTC
I see Kyle is already making a test kernel with this.

Comment 19 Kyle McMartin 2010-02-03 20:43:15 UTC
Yeah, builds are in progress on all the targets I think.

Comment 20 Jon Masters 2010-02-08 14:01:21 UTC
*** Bug 520108 has been marked as a duplicate of this bug. ***

Comment 21 Jon Masters 2010-02-08 14:02:21 UTC
This has been fixed and confirmed.

Comment 22 Mihai Harpau 2010-02-08 14:13:09 UTC
Is this bug fixed only in rawhide?

Comment 23 Kyle McMartin 2010-02-08 15:16:15 UTC
No, it's been committed to F-11 and F-12 too.

Comment 24 Fedora Update System 2010-02-09 22:15:29 UTC
kernel-2.6.31.12-174.2.17.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.31.12-174.2.17.fc12

Comment 25 Fedora Update System 2010-02-16 13:18:59 UTC
kernel-2.6.31.12-174.2.19.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 26 Rik van Riel 2011-11-04 14:53:32 UTC
*** Bug 681917 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.