Red Hat Bugzilla – Bug 452923
Network stall after xen migration
Last modified: 2008-09-26 02:18:29 EDT
Description of problem:
After migrating a domU from one host to the other outbound (from domU) network
Version-Release number of selected component (if applicable):
Both dom0 and domU are running 5.2 i386.
Steps to Reproduce:
1. Create 5.2 PV domU on 5.2 dom0
2. migrate (stall or live) to another 5.2 dom0
3a. from domU ping any external host
3b. attempt to do yum list updates
You get one ping sent and received and then nothing with the ping. If you try a
yum list updates you don't get anything and yum hangs/stalls.
Noticed as an amanda backup failed to respond to backup request. Log files
indicate that the amanda client actually attempted to respond 2+ hours after
initiating connection and failed because the other side had closed the pipe.
Ping/yum/amanda function as they should and connections don't stall.
If you do a ping -f (flood) many packets will go out with 0% loss. Also it
appears that existing connections (existing ssh connection) continue to function
as expected. Most times outbound ssh connections will also complete as expected
where most other connections will stall.
*** This bug has been marked as a duplicate of 453526 ***
That is great to mark the bug duplicate but it is restricted. Is there any way
I can get access to that bug to follow what is going on with it?
I've just updated to the latest 5.2 kernel (2.6.18-92.1.13.el5xen) that was released and supposedly fixed this bug (or at least a clone of the duped bug). However this still occurs.
I've also noticed something that I didn't notice before that might be related. After migrating the time stops. Running top shows that the processes continue to run but the time in the upper left never advances. Also running date from the cmdline shows that the date is stalled and doesn't advance.
After some given time the date finally catches up and starts advancing again. At this point the network (at least ping) begins to work as expected. This may not be a network thing at all but instead may be related to a clock or interrupt.
This bug was only in 2.6.18-107.el5. Can you still reproduce this if you run that version in both dom0 and domU? Thanks!
The following bug has been fixed:
This bug was copied from bug #453526 which this bug was duplicated too.
I figured with that bug fixed that this one should be as well. I've provided additional information about what I've observed with the 92.1.13 kernel installed.
Do you have a place I can download the 107 kernel from and I'll give it a try and see if I can still duplicate the issue?
Ah, I see. The patches you need have been proposed in
Unfortunately I don't know how you can get to the kernels before they're released for errata. Perhaps Bill can chime in on that one?
There are two things going on here. The first is the network loss, which you are still experiencing. This patch was needed, but there were others needed as well; namely, the patch from BZ 458934, which is what we are tracking for 5.3.
Your second problem is another bug that we have fixed in RHEL-5, having to do with not setting up the timers properly after a save/restore or live migrate. That one is BZ 426861.
In any case, this BZ is still a duplicate of 458934, so I'm going to close it as a dup for that. You should watch that BZ (and 5.3) for the further fixes for this issue.
*** This bug has been marked as a duplicate of bug 458934 ***