Description of problem: After updating to latest versions (with "yum update") virtualisation has become unstable. Before update the system was very stable (I don't recall it ever crashing). The system has 2 virtual machines running: # xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 2268 1 r----- 54.9 Hammerkit 4 256 1 ------ 15.3 sanakirja 3 512 1 ------ 659.6 When the system has been running for couple of hours, VM "sanakirja" goes into zombie state and VM "Hammerkit" looses network connectivity (it can still be reached with "xm console"). Networking in Dom0 is unaffected. The zombie VM can not be destroyed, nor is it possible to restart it. Normal operation can be restored for a while by rebooting Dom0. Version-Release number of selected component (if applicable): Dom0: xen-3.0.2-3.FC5 kernel-xen0-2.6.17-1.2174_FC5 DomU: (both VMs) kernel-xenU-2.6.17-1.2174_FC5 How reproducible: Crashes in less than 24 hours after rebooting the system.
Created attachment 135120 [details] Output from xen-bugtool
I'd like to "vote" for this bug as well. I've seen this one too with the 2.6.18-rc3 based XEN-enabled kernel, on x86_64 (single CPU). One domU crashed without obvious reason, network to the other two went dead, and xend would fail to restart the domain (restarting too fast). When logging in "xm list" told me "Domain 0 not connected". A restart of xend made me see the two other domU's, attaching console worked. Killing them left the vif* devices laying around, and starting new domains was impossible due to "hotplug problems". Before the incident hotplug worked fine though. Only a reboot helped the machine out of that state. I've seen this on two distinct hosts, one host even showed the phenomenon three times a day, after two weeks without problems. I looked into all logs, but nothing special there, looks identical to the xenbug.tar.gz posted here. On the xen-users mailing list, Adrian Chadd <adrian.au> also saw this bug. I find random crashes and an afterwards unusable xend rather worrying. So, I'd propose to raise the severity to high.
Just adding that I'm seeing identical behaviour with the 2.6.17-1.2187_FC5 kernel. Hardware is an Athlon XP 2200 with an RTL-8169 Gig NIC.
I'm curious if people have left their 'xm console' on for the relevant XenU's and seen if there was a kernel panic before things went to a Zombie? It might be that this is a duplicate of https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199944 . Please leave your consoles on and see if the Zombie is caused by a bug in Xennet. See if last lines say something like, "EIP: [<e10f9206>] network_tx_buf_gc+0xc4/0x1b7 [xennet] SS:ESP 0069:c0651edc <0>Kernel panic - not syncing: Fatal exception in interrupt"
Yes, that seems to be the bug. The description matches exactly (I haven't had a console open though). The problems always started on an Apache machine, while the mail server machines worked flawlessly. Since the update to a -rc6 based kernel from the development tree, I didn't see the crash any more. I have seen the Zombie thing though after aggressively destroying all DomU's at once (which needed a reboot). I think someone should also look into why it's possible for a DomU to take the whole networking down and for Xen to get into an inconsistent state.
Created attachment 136944 [details] Console output from other domain when one went to zombie
Created attachment 137006 [details] Console output from crashing domain
Created attachment 137145 [details] Console output from crashing domain, 2dn time
Created attachment 137153 [details] Console output from crashing domain, 3rd time
Is there any plans to fix this, by for example reverting back to older version of is the FC5 & Xen combination doomed? Currently it's not even usable for testing as one domU will crash the whole system so that 24 hour uptime is a miracle.
kernel-2.6.18-1.2189.fc5 is available in testing but the changelog reports that some new xen userspace tools are needed which will be available shortly.
*** This bug has been marked as a duplicate of 199944 ***