Bug 199944

Summary: XenU has Kernel panic (xennet?)
Product: [Fedora] Fedora Reporter: Russell McOrmond <russell>
Component: xenAssignee: Herbert Xu <herbert.xu>
Status: CLOSED NEXTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: bench, bstein, bugs-redhat, christophe, cpaul, jacob, jan.roehrich, jussi.siponen, katzj, managed, ozone, quentin, quintela, spam, ssnodgra, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-18 11:32:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
[NET] back: Initialise first fragment properly none

Description Russell McOrmond 2006-07-24 15:14:30 UTC
Description of problem:

I have a dual-core Athlon box running Xen0 and one XenU.

While machine was installed on July 11, it ran fine until the mornings of July
23 and 24'th when the XenU was found to be in a Zombie state.  While posting a
message to xen-users.com the XenU crashed again, this time while
I was capturing the output of 'xm console' and thus was able to see the kernel
message.

Version-Release number of selected component (if applicable):

on XenU: 2.6.17-1.2145_FC5xenU and kernel-xenU.i686 2.6.17-1.2157_FC5 
on Xen0: xen.i386 3.0.2-3.FC5 , kernel-xen0.i686 2.6.17-1.2157_FC5


How reproducible:

It happened yesterday morning (The evening before, but I only noticed in the
morning), and twice this morning.

This is a machine live on the net running mailing lists and websites (
http://calcutta.flora.ca ).  The timing makes me wonder if there is some sort of
DOS which can cause the problem (weird packet, etc).

Additional info:
Here is what I captured from the 'xm console':


BUG: unable to handle kernel NULL pointer dereference at virtual address 0000009a
^M printing eip:
^Me10fd1ad
^M*pde = ma 08f98067 pa 17077067
^M*pte = ma 00000000 pa fffff000
^MOops: 0002 [#1]
^MSMP
^MModules linked in: ipv6 xennet ipt_REJECT xt_tcpudp iptable_filter
ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables
dm_mirror dm_mod
^MCPU:    0
^MEIP:    0061:[<e10fd1ad>]    Not tainted VLI
^MEFLAGS: 00010046   (2.6.17-1.2157_FC5xenU #1)
^MEIP is at network_tx_buf_gc+0xc4/0x1b7 [xennet]
^Meax: 00000011   ebx: 0000000c   ecx: d9fc8cfc   edx: 00000000
^Mesi: 00000001   edi: d9fc8400   ebp: 0000000a   esp: c0651d90
^Mds: 007b   es: 007b   ss: 0069
^MProcess swapper (pid: 0, threadinfo=c0650000 task=c05f1800)
^MStack: <0>d9fc8cfc 00000000 00000000 00000004 d9fc8000 0000f002 0000f003 0000effc
^M       00000000 d9fc8488 d9fc8400 d9fc8000 e10fe150 dba603e0 00000000 00000000
^M       00000108 c043a57d 00000108 d9fc8000 c0651e3c c0651e3c 00000108 c0643800
^MCall Trace:
^M <e10fe150> netif_int+0x24/0x66 [xennet]  <c043a57d> handle_IRQ_event+0x42/0x85
^M <c043a64d> __do_IRQ+0x8d/0xdc  <c040665a> do_IRQ+0x1a/0x25
^M <c0519efd> evtchn_do_upcall+0x66/0x9f  <c0404d79> hypervisor_callback+0x3d/0x48
^M <e10fd9ca> network_alloc_rx_buffers+0x2c3/0x30b [xennet]  <e10fe9ac>
netif_poll+0x639/0x784 [xennet]
^M <c055a3c5> net_rx_action+0xcd/0x1fe  <c041d5bb> __do_softirq+0x70/0xef
^M <c041d67a> do_softirq+0x40/0x67  <c040665f> do_IRQ+0x1f/0x25
^M <c0519efd> evtchn_do_upcall+0x66/0x9f  <c0404d79> hypervisor_callback+0x3d/0x48
^M <c0407a6a> safe_halt+0x84/0xa7  <c0402bde> xen_idle+0x46/0x4e
^M <c0402cfd> cpu_idle+0x94/0xad  <c0655772> start_kernel+0x346/0x34c
^MCode: b4 9f 00 09 00 00 50 e8 9d d5 41 df c7 84 9f 00 09 00 00 00 00 00 00 8b
87 f4 00 00 00 89 84 9f f4 00 00 00 89 9f f4 00 00 00 90 <ff> 8d 90 00 00 00 0f
94 c0 83 c4 10 84 c0 74 62 bb 00 e0 ff ff
^MEIP: [<e10fd1ad>] network_tx_buf_gc+0xc4/0x1b7 [xennet] SS:ESP 0069:c0651d90
^M <0>Kernel panic - not syncing: Fatal exception in interrupt
^M ESC_root@westbengal:~ESC\[root@westbengal ~]#

Comment 1 Russell McOrmond 2006-07-27 13:48:30 UTC
I don't know if it is related, but I've experienced problems with programs that
were listening on a port no longer accepting connections.  Restarting the
program allows it to listen again.  The program is still running, but the
listen() seems to be disconnected.   It's not any specific program (happens with
sendmail, Apache, OpenLDAP, Cyrus-IMAPD).

I have taken one of my two Xen machines and backed out to 2.6.17-1.2145_FC5xen0
(and xenU in the user domains) to see if this will help.   Since the problem is
so intermittent it is hard to diagnose.


Comment 2 adrian chadd 2006-07-29 00:53:20 UTC
I'm seeing the same problem; albeit i haven't yet managed to obtain a kernel trace, I'm suffering the 
same symptoms.

The specifics:

* running 2.6.17-1-2157_FC5xenU in DomU's, all running debian
* running 2.6.17-1-2157_FC5xen0 in Dom0, under FC5
* one domain has iptables setup; the others weren't using it
* this domain is the one which crashes and ends up in a Zombie state - xend.log says this:

[2006-07-29 00:41:53 xend.XendDomainInfo] ERROR (XendDomainInfo:1577) VM (VMNAME) restarting 
too fast (14.881988 seconds since the last restart).  Refusing to restart to avoid loops.

I've downgraded the xen0 kernels to a locally-compiled version (2.6.16.3) with no iptables modules. I'll 
see how things go; then upgrade to the FC5xenU kernels without iptables and see how that goes.


Comment 3 Russell McOrmond 2006-08-07 16:18:06 UTC
I attempted to back up to an earlier version of the XenU kernels, as well as
ensure that the various TLS/nosegneg issues are dealt with
http://www.flora.ca/status/322

kernel-xenU-2.6.17-1.2145_FC5

This did not fix the problem, and intermittant problems where XenU's turn to
Zombie's continues.


Comment 4 Russell McOrmond 2006-08-09 12:42:46 UTC
I'm not sure that these are useful, but the output of 'xm console' when the XenU
switches to a Zombie state is as follows.

----

Fedora Core release 5 (Bordeaux)
Kernel 2.6.17-1.2145_FC5xenU on an i686

newdelhi login: BUG: unable to handle kernel NULL pointer dereference at virtual
address 000000ba
 printing eip:
e10d21a0
*pde = ma 10fdd067 pa 1021c067
*pte = ma 00000000 pa fffff000
Oops: 0002 [#1]
SMP
Modules linked in: ipv6 autofs4 xennet ipt_REJECT xt_tcpudp iptable_filter
ip_tables x_tables dm_mirror dm_mod
CPU:    0
EIP:    0061:[<e10d21a0>]    Not tainted VLI
EFLAGS: 00010046   (2.6.17-1.2145_FC5xenU #1)
EIP is at network_tx_buf_gc+0xb7/0x1aa [xennet]
eax: 00000027   ebx: 0000000d   ecx: decc0cfc   edx: 00000000
esi: 00000001   edi: decc0400   ebp: 0000002a   esp: c064dedc
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, threadinfo=c064c000 task=c05ef800)
Stack: <0>decc0cfc 00000000 00000000 00000004 decc0000 00128782 00128783 0012877c
       00000000 decc0488 decc0400 decc0000 e10d30ea df4a0cc0 00000000 00000000
       00000108 c0439ed9 00000108 decc0000 c064df88 c064df88 00000108 c0640800
Call Trace:
 <e10d30ea> netif_int+0x24/0x66 [xennet]  <c0439ed9> handle_IRQ_event+0x42/0x85
 <c0439fa9> __do_IRQ+0x8d/0xdc  <c040662a> do_IRQ+0x1a/0x25
 <c0518f4c> evtchn_do_upcall+0x66/0x9f  <c0404d49> hypervisor_callback+0x3d/0x48
<c0407a2f> safe_halt+0x79/0x9c  <c0402bde> xen_idle+0x46/0x4e
 <c0402cfd> cpu_idle+0x94/0xad  <c0651772> start_kernel+0x346/0x34c
Code: b4 9f 00 09 00 00 50 e8 5a 75 44 df c7 84 9f 00 09 00 00 00 00 00 00 8b 87
f4 00 00 00 89 84 9f f4 00 00 00 89 9f f4 00 00 00 90 <ff> 8d 90 00 00 00 0f 94
c0 83 c4 10 84 c0 74 62 bb 00 e0 ff ff
EIP: [<e10d21a0>] network_tx_buf_gc+0xb7/0x1aa [xennet] SS:ESP 0069:c064dedc
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <c0418394> panic+0x3c/0x188  <c04057a0> die+0x246/0x27b
 <c040ee98> do_page_fault+0x0/0x70f  <c040f4a7> do_page_fault+0x60f/0x70f
 <c040ee98> do_page_fault+0x0/0x70f  <c0404d07> error_code+0x2b/0x30
 <e10d21a0> network_tx_buf_gc+0xb7/0x1aa [xennet]  <e10d30ea>
netif_int+0x24/0x66 [xennet]
 <c0439ed9> handle_IRQ_event+0x42/0x85  <c0439fa9> __do_IRQ+0x8d/0xdc
 <c040662a> do_IRQ+0x1a/0x25  <c0518f4c> evtchn_do_upcall+0x66/0x9f
 <c0404d49> hypervisor_callback+0x3d/0x48  <c0407a2f> safe_halt+0x79/0x9c
 <c0402bde> xen_idle+0x46/0x4e  <c0402cfd> cpu_idle+0x94/0xad
 <c0651772> start_kernel+0x346/0x34c
 [root@westbengal ~]# xm list
Name                              ID Mem(MiB) VCPUs State  Time(s)
Domain-0                           0      605     2 r-----   829.2
Zombie-newdelhi                    7      512     1 ----cd  2844.6
calcutta                           5      512     1 -b----  1801.6
[root@westbengal ~]#
----

While the 'calcutta' server is listed as fine, it is also dead.  I have to do a
reboot of the entire machine.

Is there some new kernel (Xen0 and/or XenU) that I should be testing?  It seems
that .2145 and .2157 both react the same in this case.



Comment 5 Herbert Xu 2006-08-15 10:24:25 UTC
*** Bug 201504 has been marked as a duplicate of this bug. ***

Comment 6 Herbert Xu 2006-08-15 10:26:48 UTC
Just an update.  I'm yet to reproduce this problem here so that's why it's
taking a bit of time to resolve.  There's also a bug fix patch that may be
relevant which should make it into an FC5 kernel soon that I'd like you guys to
test.

Comment 7 Herbert Xu 2006-08-16 15:57:31 UTC
Created attachment 134318 [details]
[NET] back: Initialise first fragment properly

This patch should fix the problem.  It'll enter the kernels soon.

Comment 8 Russell McOrmond 2006-08-21 13:31:01 UTC
I hate to be anxious, but do you know when there might be a kernel we can test?
 A XenU kernel panic'd late last night and it was down again this morning (same
"EIP: [<e10d21a0>] network_tx_buf_gc+0xb7/0x1aa [xennet]" as last night).


Comment 9 Herbert Xu 2006-08-21 13:54:20 UTC
I've been told that the merge should be done on Monday or Tuesday.

Comment 10 Ben 2006-09-11 17:22:36 UTC
What's the status on this? (And how would I find out, short of asking this way?)
I've been bitten many, many times by a bug that appears to be this one over the
weekend.

Comment 12 Russell McOrmond 2006-09-18 12:28:16 UTC
I updated to the latest kernel package which I assumed would have the fix in it,
but it Zombie'd again this morning.

It may also be useful to note that once this happens, any attempt to shutdown
other XenU's stops at the unloading of iptables.  It eventually times out with
that XenU also being a Zombie.


---

Fedora Core release 5 (Bordeaux)
Kernel 2.6.17-1.2187_FC5xenU on an i686

pune login: BUG: unable to handle kernel NULL pointer dereference at virtual
address 000000d1
 printing eip:
e10f9206
*pde = ma 15531067 pa 13c93067
*pte = ma 00000000 pa fffff000
Oops: 0002 [#1]
SMP
Modules linked in: ipv6 autofs4 xennet ip_conntrack_netbios_ns ip_conntrack
nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables dm_snapshot
dm_zero dm_mirror dm_mod
CPU:    0
EIP:    0061:[<e10f9206>]    Not tainted VLI
EFLAGS: 00210046   (2.6.17-1.2187_FC5xenU #1)
EIP is at network_tx_buf_gc+0xc4/0x1b7 [xennet]
eax: 00000051   ebx: 00000063   ecx: deb38cf8   edx: 00000000
esi: 00000001   edi: deb38400   ebp: 00000041   esp: c0651edc
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, threadinfo=c0650000 task=c05f2800)
Stack: <0>deb38cf8 00000000 00000000 00000004 deb38000 0001eb44 0001eb46 0001eb39
       00000000 deb38488 deb38400 deb38000 e10fa3b7 c0991360 00000000 00000000
       00000107 c043a63d 00000107 deb38000 c0651f88 c0651f88 00000107 c0644780
Call Trace:
 <e10fa3b7> netif_int+0x24/0x66 [xennet]  <c043a63d> handle_IRQ_event+0x42/0x85
 <c043a70d> __do_IRQ+0x8d/0xdc  <c040665a> do_IRQ+0x1a/0x25
 <c051a159> evtchn_do_upcall+0x66/0x9f  <c0404d79> hypervisor_callback+0x3d/0x48
<c0407aad> safe_halt+0x84/0xa7  <c0402bde> xen_idle+0x46/0x4e
 <c0402cfd> cpu_idle+0x94/0xad  <c0655772> start_kernel+0x346/0x34c
Code: b4 9f fc 08 00 00 50 e8 a0 17 42 df c7 84 9f fc 08 00 00 00 00 00 00 8b 87
f4 00 00 00 89 84 9f f4 00 00 00 89 9f f4 00 00 00 90 <ff> 8d 90 00 00 00 0f 94
c0 83 c4 10 84 c0 74 62 bb 00 e0 ff ff
EIP: [<e10f9206>] network_tx_buf_gc+0xc4/0x1b7 [xennet] SS:ESP 0069:c0651edc
 <0>Kernel panic - not syncing: Fatal exception in interrupt



Comment 13 Russell McOrmond 2006-09-19 11:53:42 UTC
The newer kernel seems to be far worse, and I'm backing out of it.

Fedora Core release 5 (Bordeaux)
Kernel 2.6.17-1.2187_FC5xenU on an i686

calcutta login: ------------[ cut here ]------------
kernel BUG at net/core/dev.c:1206!
invalid opcode: 0000 [#1]
SMP
Modules linked in: ipv6 xennet ipt_REJECT xt_tcpudp iptable_filter
ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables
dm_mirror dm_mod
CPU:    0
EIP:    0061:[<c055821a>]    Not tainted VLI
EFLAGS: 00210297   (2.6.17-1.2187_FC5xenU #1)
EIP is at skb_gso_segment+0x29/0xc9
eax: 00000000   ebx: ded26ba4   ecx: 00050003   edx: c05f7700
esi: ded26ba4   edi: 00000008   ebp: df220000   esp: dbb83c60
ds: 007b   es: 007b   ss: 0069
Process httpd (pid: 1021, threadinfo=dbb82000 task=c01f3870)
Stack: <0>00000001 ded26ba4 c0899300 c055938b ded26ba4 00050003 00000001 df220000
       ded26ba4 df220180 00000000 c0564e1e ded26ba4 df220000 c010d200 00000000
       df220000 dbb82000 ded26ba4 c055af7e df220000 df2f8754 df2f8774 c08992cc
Call Trace:
 <c055938b> dev_hard_start_xmit+0x174/0x203  <c0564e1e> __qdisc_run+0xe0/0x19a
 <c055af7e> dev_queue_xmit+0x1ce/0x2cc  <c0575232> ip_output+0x1b6/0x1ec
 <c0574ace> ip_queue_xmit+0x374/0x3b3  <c040734f> monotonic_clock+0x30/0x70
 <c0440df0> get_page_from_freelist+0x99/0x463  <c05821f3>
tcp_transmit_skb+0x5d2/0x602
 <c058192f> tcp_snd_test+0x17/0xcc  <c0583d08> tcp_push_one+0xb2/0xd4
 <c057a466> tcp_sendmsg+0x7a1/0x9cc  <c054f6b5> do_sock_write+0xa3/0xac
 <c055171c> sock_writev+0xab/0xc3  <c042a45b> autoremove_wake_function+0x0/0x3a
 <c044111b> get_page_from_freelist+0x3c4/0x463  <c045a1d8>
do_readv_writev+0x148/0x23a
 <c040f5b1> do_page_fault+0x414/0x8c1  <c045a74b> sys_writev+0x3b/0x97
 <c0404ba7> syscall_call+0x7/0xb
Code: eb a6 57 56 53 8b 5c 24 10 8b 83 a0 00 00 00 0f b7 7b 76 83 78 10 00 74 08
0f 0b b5 04 a0 b3 5d c0 8a 43 74 83 e0 0c 3c 04 74 08 <0f> 0b b6 04 a0 b3 5d c0
8b 83 98 00 00 00 8b 53 20 89 43 24 29
EIP: [<c055821a>] skb_gso_segment+0x29/0xc9 SS:ESP 0069:dbb83c60
 <0>Kernel panic - not syncing: Fatal exception in interrupt


Comment 14 Ben 2006-09-19 16:01:56 UTC
Well, I don't want to jinx it for myself, but I have yet to have a problem with
the new kernel crashing my domU.

(It did crash my dom0 once, and I haven't figured that one out yet, but at least
my domU seems stable now.)

Comment 15 Russell McOrmond 2006-09-19 16:25:45 UTC
I wonder if there is a pattern we can figure out to help trace this bug.

I have three servers I'm involved with:

One is a "production" server (although with this reliability we risk loosing
customers) with 6 XenU's.   When the machine starts going to Zombies it has
nearly always started with a specific XenU which is a busy LAMP server.

The second is a personal box with two XenU's which are LAMP.

The third is a personal box with two XenU's which have BIND/SendMail/Cyrus-IMAPD
on one and OpenLDAP on the other.

The first and second boxes go into Zombie mode often, but the third box (knock
on wood) has not experienced many problems at all.  I receive a lot of email
traffic, and there are times of the day where the Email server is more busy than
the LAMP servers, and yet it is the LAMP servers that are crashing.


Another friend runs a Xen box in the same colocation facility as the first. 
This box is primarily running DNS and various shell utilities, and from what I'm
told he has not had any of the problems I've observed.

I wonder if there is something in what Apache or MySQL does with the network
that makes it more likely with servers running these applications to have problems.


Comment 16 Herbert Xu 2006-09-20 00:21:47 UTC
Actually this problem is already fixed in rawhide.  It's just that Xen hasn't
been updated in FC5 for quite a while due to the effort being focused on FC6 and
RHEL5.

Comment 17 Russell McOrmond 2006-09-20 00:26:24 UTC
Do you have suggested binaries to try that will otherwise integrate with a FC5
environment?

Comment 18 Herbert Xu 2006-09-20 00:31:50 UTC
I've just been told that there really is going to be an FC5 Xen update now. 
It's going to be tested tomorrow.

Comment 19 Russell McOrmond 2006-09-20 02:03:08 UTC
This is great news.  I tried doing a "--enablerepo=development" and while I saw
an update to the regular kernel I didn't see a xen0 or xenU kernel update.

I did notice:
xen.i386                                 3.0.2-33               development

If you can point us to the binaries to test when they are available (even before
the mirrors get it) it would be helpful.  For whatever reason we seem to have
just the right conditions to be hitting this specific bug.

Comment 20 Herbert Xu 2006-09-21 01:10:54 UTC
I just checked the FC5 branch and unfortunately the update isn't there yet.

Comment 21 Russell McOrmond 2006-09-21 11:58:08 UTC
Thanks for checking.

Just for kicks I tried the kernel-xen, xen and dependencies from the
'development' yum repository.  When trying to create the XenU's it gave me a
"Error: (22, 'Invalid argument')".

Is this enough of a change that I need to install a different kernel in each of
the XenU's as well, or did that indicate something else is wrong/missing?  The
xend logs didn't seem to help much in figuring out the problem. What kernels
should be in the xenU's now?  Is there a different setup for FC6 (and these
development packages) than what was there for FC5?  



Comment 22 Herbert Xu 2006-09-21 13:37:57 UTC
If your domU kernel does not have PAE then you'll need to replace it with one
that does have PAE.

Comment 23 Chris Langlands 2006-09-26 06:26:18 UTC
I experienced similar problems after I moved my xen host and guetst to 2187.

I seem to have stablised my situation by rolling the host back to
2.6.17-1.2145_FC5xen0 and leaving the guest at kernel-xenU-2.6.17-1.2187_FC5

This is on x86_64 anyway.

Herbert, would be great to know when we can expect that FC5 update ?




Comment 24 Quentin Stafford-Fraser 2006-09-26 19:10:17 UTC
Chris - 

I'm having the same problem, but am a bit new to the Fedora world - can you tell me where I can get 
2145?  Or did you mean 2045 that comes with the base distribution?

Many thanks!

Comment 25 Chris Langlands 2006-09-27 02:53:57 UTC
Quentin, the above trick did not hold for long.  Looks like I might need to
migrate services out of Xen until there's an update.

Would be interested to know if others here have stabilised their FC5 Xen setups,
and how?

Comment 26 Ben 2006-09-27 02:57:37 UTC
No stabilization here. I'm contemplating moving to FC6t3 because this is
supposed to be fixed there.... but I really don't want to go there if I don't
have to.

That said, I can go days without issues.

(Kinda sad "going days without issues" is considered good, though....)

Comment 27 Anchor Systems Managed Hosting 2006-09-27 08:24:54 UTC
We are experiencing the same kernel panic on the 2.6.17-1.2187 kernel on FC5,
but we are not using Xen at all, just a plain machine with the standard kernel
and no virtual machines.

The problem occurs when a significant amount of outbound network IO is
processed, such as running 'dmesg' from an ssh session to the machine.

Interestingly if ip_conntrack and all of its dependent modules are removed and
then added back in, the problem goes away temporarily.

Comment 28 Anchor Systems Managed Hosting 2006-09-27 08:26:49 UTC
We are experiencing the same kernel panic on the 2.6.17-1.2187 kernel on FC5,
but we are not using Xen at all, just a plain machine with the standard kernel
and no virtual machines.

The problem occurs when a significant amount of outbound network IO is
processed, such as running 'dmesg' from an ssh session to the machine.

Interestingly if ip_conntrack and all of its dependent modules are removed and
then added back in, the problem goes away temporarily.

Comment 29 Herbert Xu 2006-09-27 10:20:24 UTC
There is now a 2.6.18 kernel (2189) in FC5 testing that should resolve this bug.

Comment 30 Herbert Xu 2006-09-27 10:35:57 UTC
*** Bug 204468 has been marked as a duplicate of this bug. ***

Comment 31 Quentin Stafford-Fraser 2006-09-27 13:25:27 UTC
Many thanks, Herbert - the 2.6.18 kernel is looking good so far...


Comment 32 Chris Langlands 2006-09-27 13:28:24 UTC
Managed to get the 2189 dom0 kernel installed, it hasn't crashed yet.  

But the 2189 domU kernel crashed within minutes, with very little load.  Since
the crash the xend service will not restart.

Comment 33 Russell McOrmond 2006-09-27 13:41:57 UTC
There seemed to be a mis-match with xen-3.0.2-3.FC5 , so I reverted to an older
xen0.   The xenU's didn't come up properly, outputting messages that possibly
related to the mismatched xen0.

4gb seg fixup, process httpd (pid 1079), cs:ip 73:0085d6b8
printk: 10446 messages suppressed.
4gb seg fixup, process sendmail (pid 1017), cs:ip 73:0032143e
printk: 7266 messages suppressed.
4gb seg fixup, process httpd (pid 1128), cs:ip 73:0085d6b8
printk: 37 messages suppressed.

I have reverted back to  2.6.17-1.2174

Comment 34 Jacob Boswell 2006-09-27 22:26:32 UTC
on, I installed the new 'testing' kernel 2.6.18-1.2189.fc5xen0 however when
trying to run an xm command I get the following

[root@xen1 ~]# xm li
Error: Error connecting to xend: No such file or directory.  Is xend running?

Is there a new xen package to go along with the new kernel?

Comment 35 Anchor Systems Managed Hosting 2006-09-28 01:14:01 UTC
That 2189 testing kernel has resolved the problem for me.

Comment 36 Chris Langlands 2006-09-28 02:44:51 UTC
For me, service xend start generates the following in /var/log/xend.log:

[2006-09-28 12:43:17 xend] INFO (SrvDaemon:283) Xend Daemon started
[2006-09-28 12:43:17 xend] INFO (SrvDaemon:287) Xend changeset: unavailable .
[2006-09-28 12:43:17 xend] ERROR (SrvDaemon:297) Exception starting xend ((38,
'Function not implemented'))
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDaemon.py", line
291, in run
    servers = SrvServer.create()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvServer.py", line
108, in create
    root.putChild('xend', SrvRoot())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvRoot.py", line 40,
in __init__
    self.get(name)
  File "/usr/lib64/python2.4/site-packages/xen/web/SrvDir.py", line 82, in get
    val = val.getobj()
  File "/usr/lib64/python2.4/site-packages/xen/web/SrvDir.py", line 52, in getobj
    self.obj = klassobj()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDomainDir.py",
line 39, in __init__
    self.xd = XendDomain.instance()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 609, in
instance
    inst.init()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 76, in init
    self._add_domain(
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 139, in
xen_domains
    domlist = xc.domain_getinfo()
Error: (38, 'Function not implemented')


Comment 37 Herbert Xu 2006-09-28 08:52:40 UTC
For those that are having trouble using the new kernel, please make sure you've
upgraded both dom0 and domU kernels.  You also need the new xen package since
the manage ABI has changed. If you're still having start-up problems with the
latest packages, please file a new bug on your problem.  Thanks.

Comment 38 Chris Langlands 2006-09-28 09:29:04 UTC
Hope you don't mind me continuing this thread, but I don't know where a new xen
package can be found.  I seem to have the latest available, but still no luck. 
Am I the only one still having problems?

$ uname -r
2.6.18-1.2189.fc5xen0

$ yum --enablerepo=updates-testing list xen
xen.x86_64           3.0.2-3.FC5            installed

# service  xend status
xend is running

# xm list
Error: Error connecting to xend: No such file or directory.  Is xend running?

# service  xend stop
Stopping xend:                                             [  OK  ]
[root@syd3 ~]# service  xend start
Starting xend: [2006-09-28 19:26:27 xend] INFO (SrvDaemon:283) Xend Daemon started
[2006-09-28 19:26:27 xend] INFO (SrvDaemon:287) Xend changeset: unavailable .
[2006-09-28 19:26:27 xend] ERROR (SrvDaemon:297) Exception starting xend ((38,
'Function not implemented'))
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDaemon.py", line
291, in run
    servers = SrvServer.create()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvServer.py", line
108, in create
    root.putChild('xend', SrvRoot())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvRoot.py", line 40,
in __init__
    self.get(name)
  File "/usr/lib64/python2.4/site-packages/xen/web/SrvDir.py", line 82, in get
    val = val.getobj()
  File "/usr/lib64/python2.4/site-packages/xen/web/SrvDir.py", line 52, in getobj
    self.obj = klassobj()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDomainDir.py",
line 39, in __init__
    self.xd = XendDomain.instance()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 609, in
instance
    inst.init()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 76, in init
    self._add_domain(
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 139, in
xen_domains
    domlist = xc.domain_getinfo()
Error: (38, 'Function not implemented')


Comment 39 Herbert Xu 2006-09-28 09:43:20 UTC
OK, please file a new bug report since this is a different problem.

Comment 40 Chris Langlands 2006-09-28 10:04:44 UTC
Herbert, I think this is stil relevant, and might explain why others have had
success.  There's another kernel in updates-testing, a -xen (not -xen0):

  kernel-xen.x86_64 0:2.6.18-1.2189.fc5

I was about to boot this but noticed it boots a PAE kernel by default:

title Fedora Core (2.6.18-1.2189.fc5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-1.2189.fc5-PAE
        module /vmlinuz-2.6.18-1.2189.fc5xen ro root=/dev/md1
        module /initrd-2.6.18-1.2189.fc5xen.img

There are a number of other kernels provided in the package, that seem to be
non-PAE kernels.  If I choose a non-PAE variant thus:

title Fedora Core (2.6.18-1.2189.fc5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-1.2189.fc5
        module /vmlinuz-2.6.18-1.2189.fc5xen ro root=/dev/md1
        module /initrd-2.6.18-1.2189.fc5xen.img

Is the system likely to boot?   I don't have physical access to the system, nor
do I have an x86_64 machine here to test conclusively.

Or should I just stay away from the -xen (non -xen0) package?

Comment 41 Herbert Xu 2006-09-28 11:12:21 UTC
The PAE xen.gz file is the only one for 64-bit.  In fact
/xen.gz-2.6.18-1.2189.fc5 (without the PAE suffix) does not exist in the x86-64
package.  64-bit always uses PAE.

Comment 42 Russell McOrmond 2006-09-28 11:24:13 UTC
Is it possible that the needed updated 'xen' is the one in the development
repository?  I can't test right now, but someone else might want to.

Installed Packages
xen.i386                                 3.0.2-3.FC5            installed
Available Packages
xen.i386                                 3.0.2-36               development
xen.i386                                 3.0.1-4                core

If that is the problem, file a bug to indicate that there should be a dependency
listed, and the new xen also added to the 'updates-testing' so that it will be
sent to 'updates' at the same time as the new kernels.



Comment 43 Chris Langlands 2006-09-28 11:35:00 UTC
Thanks Herbert.  The kernel-xen package produced the same result as the
kernel-xen0 package.  On boot service xend status says xend is running, but xm
list fails "Error: Error connecting to xend: No such file or directory.  Is xend
running?"  Guess I'll wait and see if the eventual kernel update proper works.

Comment 44 Herbert Xu 2006-09-29 04:48:42 UTC
Please file a new bug report on this.  Thanks.

Comment 45 e 2006-09-29 07:29:25 UTC
Herbert, a gotcha that needs to be addressed before this kernel is released:

With the change to 2.6.18, the raid5 module has been renamed to raid456. mkinitrd
is not aware of this resulting in a broken initrd for systems which have / on
raid 5. So a dependency on a newer version of mkinitrd will most definitely be
needed.

Comment 46 e 2006-09-29 07:52:14 UTC
I have the same problem as Chris with the update. Details in bug 208529.

Comment 47 Herbert Xu 2006-09-29 11:37:36 UTC
Please bring up new issues like the raid456 module in new bug reports.  Thanks.

Comment 48 Chris Langlands 2006-10-13 01:59:32 UTC
Is there a timeframe for the new kernel (and the thus far unreleased xen
package) to be pushed to updates?  

Comment 49 Herbert Xu 2006-10-13 12:02:03 UTC
*** Bug 209910 has been marked as a duplicate of this bug. ***

Comment 50 Andre Pang 2006-10-14 02:10:18 UTC
We're also seeing this problem with a four-processor (dual-processor dual-core) Intel Xeon.  We had a 
very stable machine running for months with zero problems, but recently it's been crashing once per day.  
Currently running 2.6.17-1.2187_FC5xen0, although I believe the problem also occurred with 
2.6.16-1.2096_FC5xen0 (the domUs kernel being in sync with the dom0 both times).


Comment 51 Andre Pang 2006-10-17 13:07:03 UTC
For those who are awaiting a fix, we just downgraded our Xen host to 2.6.15-1.2054_FC5xen0 (and 
downgraded all the domUs to the corresponding xenU kernel) and haven't had a single crash yet.

Comment 52 Russell McOrmond 2006-10-17 13:21:42 UTC
Andre,

Watch out for a different Bug 203122 when running one of the older Xen kernels.
This seemed to have been fixed in the newer kernels, with the newer kernels
having different issues which are believed to be fixed in the most recent (which
requires the updated 'xen' package).



Comment 53 Stephen Tweedie 2007-03-16 15:08:38 UTC
Is this still reproducible on the latest stable release+updates?  Thanks.


Comment 54 Ben 2007-03-16 15:41:33 UTC
Not by me...

Comment 55 Russell McOrmond 2007-04-11 21:08:12 UTC
This can be closed, as I have not seen this bug for months now.  I know it is
easier to know a bug is there than know it is not, but the lack of problems over
a long period makes me confident.