Bug 204468 - xen VM goes zombie
Summary: xen VM goes zombie
Keywords:
Status: CLOSED DUPLICATE of bug 199944
Alias: None
Product: Fedora
Classification: Fedora
Component: xen
Version: 5
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Herbert Xu
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-29 12:32 UTC by Jussi Siponen
Modified: 2007-11-30 22:11 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-09-27 10:35:12 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output from xen-bugtool (57.33 KB, application/octet-stream)
2006-08-29 12:32 UTC, Jussi Siponen
no flags Details
Console output from other domain when one went to zombie (3.25 KB, text/plain)
2006-09-22 14:39 UTC, Ville Lindfors
no flags Details
Console output from crashing domain (3.13 KB, text/plain)
2006-09-23 19:35 UTC, Ville Lindfors
no flags Details
Console output from crashing domain, 2dn time (3.06 KB, text/plain)
2006-09-26 14:49 UTC, Ville Lindfors
no flags Details
Console output from crashing domain, 3rd time (3.06 KB, text/plain)
2006-09-26 17:13 UTC, Ville Lindfors
no flags Details

Description Jussi Siponen 2006-08-29 12:32:13 UTC
Description of problem:

After updating to latest versions (with "yum update") virtualisation has become
unstable. Before update the system was very stable (I don't recall it ever
crashing).

The system has 2 virtual machines running:

# xm list
Name                              ID Mem(MiB) VCPUs State  Time(s)
Domain-0                           0     2268     1 r-----    54.9
Hammerkit                          4      256     1 ------    15.3
sanakirja                          3      512     1 ------   659.6

When the system has been running for couple of hours, VM "sanakirja" goes into
zombie state and VM "Hammerkit" looses network connectivity (it can still be
reached with "xm console"). Networking in Dom0 is unaffected.

The zombie VM can not be destroyed, nor is it possible to restart it. Normal
operation can be restored for a while by rebooting Dom0.


Version-Release number of selected component (if applicable):

Dom0:
xen-3.0.2-3.FC5
kernel-xen0-2.6.17-1.2174_FC5

DomU: (both VMs)
kernel-xenU-2.6.17-1.2174_FC5


How reproducible:

Crashes in less than 24 hours after rebooting the system.

Comment 1 Jussi Siponen 2006-08-29 12:32:13 UTC
Created attachment 135120 [details]
Output from xen-bugtool

Comment 2 Christophe Saout 2006-09-19 21:37:28 UTC
I'd like to "vote" for this bug as well.

I've seen this one too with the 2.6.18-rc3 based XEN-enabled kernel, on x86_64
(single CPU). One domU crashed without obvious reason, network to the other two
went dead, and xend would fail to restart the domain (restarting too fast). When
logging in "xm list" told me "Domain 0 not connected". A restart of xend made me
see the two other domU's, attaching console worked. Killing them left the vif*
devices laying around, and starting new domains was impossible due to "hotplug
problems". Before the incident hotplug worked fine though. Only a reboot helped
the machine out of that state.

I've seen this on two distinct hosts, one host even showed the phenomenon three
times a day, after two weeks without problems. I looked into all logs, but
nothing special there, looks identical to the xenbug.tar.gz posted here.

On the xen-users mailing list, Adrian Chadd <adrian.au> also saw
this bug. I find random crashes and an afterwards unusable xend rather worrying.

So, I'd propose to raise the severity to high.


Comment 3 Kwan Lowe 2006-09-21 01:15:05 UTC
Just adding that I'm seeing identical behaviour with the 2.6.17-1.2187_FC5
kernel. Hardware is an Athlon XP 2200 with an RTL-8169 Gig NIC.   

Comment 4 Russell McOrmond 2006-09-21 15:21:50 UTC
I'm curious if people have left their 'xm console' on for the relevant XenU's
and seen if there was a kernel panic before things went to a Zombie?

It might be that this is a duplicate of
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199944  .  Please leave
your consoles on and see if the Zombie is caused by a bug in 
Xennet.

See if last lines say something like, "EIP: [<e10f9206>]
network_tx_buf_gc+0xc4/0x1b7 [xennet] SS:ESP 0069:c0651edc
 <0>Kernel panic - not syncing: Fatal exception in interrupt"



Comment 5 Christophe Saout 2006-09-21 16:26:22 UTC
Yes, that seems to be the bug. The description matches exactly (I haven't had a
console open though). The problems always started on an Apache machine, while
the mail server machines worked flawlessly. Since the update to a -rc6 based
kernel from the development tree, I didn't see the crash any more. I have seen
the Zombie thing though after aggressively destroying all DomU's at once (which
needed a reboot). I think someone should also look into why it's possible for a
DomU to take the whole networking down and for Xen to get into an inconsistent
state.

Comment 6 Ville Lindfors 2006-09-22 14:39:13 UTC
Created attachment 136944 [details]
Console output from other domain when one went to zombie

Comment 7 Ville Lindfors 2006-09-23 19:35:15 UTC
Created attachment 137006 [details]
Console output from crashing domain

Comment 8 Ville Lindfors 2006-09-26 14:49:06 UTC
Created attachment 137145 [details]
Console output from crashing domain, 2dn time

Comment 9 Ville Lindfors 2006-09-26 17:13:08 UTC
Created attachment 137153 [details]
Console output from crashing domain, 3rd time

Comment 10 Ville Lindfors 2006-09-26 17:41:24 UTC
Is there any plans to fix this, by for example reverting back to older version
of is the FC5 & Xen combination doomed? Currently it's not even usable for
testing as one domU will crash the whole system so that 24 hour uptime is a miracle.

Comment 11 Kwan Lowe 2006-09-26 17:52:32 UTC
kernel-2.6.18-1.2189.fc5 is available in testing but the changelog reports that
some new xen userspace tools are needed which will be available shortly.

Comment 12 Herbert Xu 2006-09-27 10:35:12 UTC

*** This bug has been marked as a duplicate of 199944 ***


Note You need to log in before you can comment on or make changes to this bug.