Bug 452923

Summary: Network stall after xen migration
Product: Red Hat Enterprise Linux 5 Reporter: Shad L. Lords <slords>
Component: kernel-xenAssignee: Herbert Xu <herbert.xu>
Status: CLOSED DUPLICATE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: low    
Version: 5.2CC: bburns, clalance, xen-maint
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-26 06:18:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shad L. Lords 2008-06-25 21:28:08 UTC
Description of problem:

After migrating a domU from one host to the other outbound (from domU) network
connections stall.

Version-Release number of selected component (if applicable):

Both dom0 and domU are running 5.2 i386.

How reproducible:

90%

Steps to Reproduce:
1. Create 5.2 PV domU on 5.2 dom0
2. migrate (stall or live) to another 5.2 dom0
3a. from domU ping any external host
3b. attempt to do yum list updates
  
Actual results:

You get one ping sent and received and then nothing with the ping.  If you try a
yum list updates you don't get anything and yum hangs/stalls.

Noticed as an amanda backup failed to respond to backup request.  Log files
indicate that the amanda client actually attempted to respond 2+ hours after
initiating connection and failed because the other side had closed the pipe.

Expected results:

Ping/yum/amanda function as they should and connections don't stall.

Additional info:

If you do a ping -f (flood) many packets will go out with 0% loss.  Also it
appears that existing connections (existing ssh connection) continue to function
as expected.  Most times outbound ssh connections will also complete as expected
where most other connections will stall.

Comment 2 Herbert Xu 2008-07-24 03:47:47 UTC

*** This bug has been marked as a duplicate of 453526 ***

Comment 3 Shad L. Lords 2008-07-24 04:52:10 UTC
That is great to mark the bug duplicate but it is restricted.  Is there any way
I can get access to that bug to follow what is going on with it?

Comment 4 Shad L. Lords 2008-09-26 02:11:58 UTC
I've just updated to the latest 5.2 kernel (2.6.18-92.1.13.el5xen) that was released and supposedly fixed this bug (or at least a clone of the duped bug).  However this still occurs.

I've also noticed something that I didn't notice before that might be related.  After migrating the time stops.  Running top shows that the processes continue to run but the time in the upper left never advances.  Also running date from the cmdline shows that the date is stalled and doesn't advance.

After some given time the date finally catches up and starts advancing again.  At this point the network (at least ping) begins to work as expected.  This may not be a network thing at all but instead may be related to a clock or interrupt.

Comment 5 Herbert Xu 2008-09-26 02:25:07 UTC
This bug was only in 2.6.18-107.el5.  Can you still reproduce this if you run that version in both dom0 and domU? Thanks!

Comment 6 Shad L. Lords 2008-09-26 03:12:51 UTC
According to:

https://rhn.redhat.com/errata/RHSA-2008-0885.html

The following bug has been fixed:

https://bugzilla.redhat.com/show_bug.cgi?id=458783

This bug was copied from bug #453526 which this bug was duplicated too.

I figured with that bug fixed that this one should be as well.  I've provided additional information about what I've observed with the 92.1.13 kernel installed.

Do you have a place I can download the 107 kernel from and I'll give it a try and see if I can still duplicate the issue?

Comment 7 Herbert Xu 2008-09-26 04:11:50 UTC
Ah, I see.  The patches you need have been proposed in

https://bugzilla.redhat.com/show_bug.cgi?id=461457

Unfortunately I don't know how you can get to the kernels before they're released for errata.  Perhaps Bill can chime in on that one?

Comment 8 Chris Lalancette 2008-09-26 06:18:29 UTC
There are two things going on here.  The first is the network loss, which you are still experiencing.  This patch was needed, but there were others needed as well; namely, the patch from BZ 458934, which is what we are tracking for 5.3.

Your second problem is another bug that we have fixed in RHEL-5, having to do with not setting up the timers properly after a save/restore or live migrate.  That one is BZ 426861.

In any case, this BZ is still a duplicate of 458934, so I'm going to close it as a  dup for that.  You should watch that BZ (and 5.3) for the further fixes for this issue.

Chris Lalancette

*** This bug has been marked as a duplicate of bug 458934 ***