Bug 503139

Summary: [RHEL5.4 Xen]: Save/restore between 5.3 and 5.4 host broken
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: herbert.xu, llim, rlerch, sghosh, xen-maint
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 504623 (view as bug list) Environment:
Last Closed: 2009-06-17 07:08:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Lalancette 2009-05-29 08:17:55 UTC
Description of problem:
I've recently tested doing a save on a RHEL-5.3 Xen dom0, and then a restore on a RHEL-5.4 Xen dom0.  This is a common upgrade scenario for customers.  Unfortunately, this is currently broken.  What happens is that on restore on the 5.4 dom0, the restored guest has lost all network connection, and gets the following in dmesg:

netfront: device eth0 has copying receive path.
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295
netfront: rx->offset: 0, size: 4294967295

On the dom0, I see messages from the HV like this:

(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.
(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.

I bisected it down, and this started happening with the -145 kernel.  Save/restoring prior to that worked just fine.  This probably started happening because of the change from a flipping interface to a copying interface by default.

Comment 2 Chris Lalancette 2009-05-29 15:59:23 UTC
OK, I just wanted to confirm this, but reverting the flipping->copying by default patch does in fact fix the problem.  Now to do some more debugging.

Chris Lalancette

Comment 3 Chris Lalancette 2009-06-02 15:21:27 UTC
OK, I think I see what is going on here.  Still need to confirm 100%, but...

This is unfortunately a bug in netfront.  While netback and netfront do automatic feature re-negotiation on every boot or save/restore cycle, the problem is that netfront doesn't entirely clean up after itself.  What happens is that it leaves a bunch of "flipping" grant references on the shared ring.  When it is resumed, these grant references are still there, and are the first things that netback sees.  So netback grabs one of these (now invalid) grant references off of the ring, and tries to map it with a grant operation.  But that now fails, since these references haven't been granted through the flip interface again, they've now been granted through the copy interface.

What makes it doubly unfortunate is that we can't just fix netfront, since we still have legacy kernels to worry about.  Two workarounds for the dom0 come to mind:

1)  Don't do rx-copy by default.  This would obviously be very easy, but would have less testing, and might not be what herbert wants to do.
2)  Use xenbus.  That is, in the dom0, as part of the save protocol, we write out whether this guest is using flipping or not.  On a resume, if it already was using flipping, we advertise only flipping to it again, and therefore it will continue to flip and be happy.  Newly booted guests will use the copying interface, and things will be OK.

I'm going to try to implement 2), but if that doesn't work we might have to fall back to 1).

Chris Lalancette

Comment 4 Chris Lalancette 2009-06-05 13:48:00 UTC
OK, trying to organize my thoughts here:

From the post above, 1) is definitely an option, but it is the fallback option.  If all other options fail, then we can switch back from copying to a flipping interface by default.  I'm reluctant to do it except as a last measure, though, since we would have partners testing a different path for beta than we would eventually ship the final 5.4.  On the other hand, flipping has served us quite well since RHEL 5.0, so it's not like it's a new code path.

That being said, I investigated option 2) above.  While technically possible, it will introduce some kind of hairy code into the tools and into the kernel.  The reason why it turns out to be such a problem is because of how earlier dom0's communicated their intentions to the guests.  In a RHEL-5 or RHEL-4 PV guest, by default, netfront prefers the flipping interface.  Only if the flipping interface is specifically shut off does it use the copying interface.  Prior to 5.4, in netback, we turned on "feature-rx-copy", but we did not set *any* value for "feature-rx-flip".  This means that the front-end preferred flipping.  Now, for 5.4, we are explicitly shutting off feature-rx-flip, meaning that netfront is now preferring copy.  The problem is when you try synthesize that with a save/restore across 5.3/5.4.  When you save on 5.3, there is no information telling you what interface the frontend is preferring.  So on restore on the 5.4 dom0, we can't tell which interface to give back to the domain.  Now, we could modify things so that on save, we put "feature-rx-*" information into the save file.  Save/restore on 5.4/5.4 would always see that information and restore it to what it was.  Save/restore from 5.3 to 5.4 wouldn't have that feature information, though, so it could just assume flipping, but this then means that we need some knowledge in the device configuration about whether we are restoring or starting a guest for the first time.  Technically possible, but more dangerous than I would like.

So, that being said, I went back to try to understand the problem a bit better and to see what else could be done about it.  We start with the initial problem; after a restore on 5.4, we can no longer transfer network traffic out of the domU.  The first clue comes from the message on the HV, namely:

(XEN) grant_table.c:1145:d0 destination frame ffffffff invalid.

Now, digging through the code, this comes from common/grant_table.c:__gnttab_copy().  The reason we saw this is because the frame returned from __acquire_grant_for_copy() had a bogus mfn.  In _acquire_grant_for_copy() we see that the returned frame number comes out of the shared grant area for the domain.  In turn, the offset into the shared grant area comes from the gref, which is passed down by netback.

In the dom0 kernel, netback finds the gref it is going to use by looking into the shared ring with the domU.  Digging through netfront, the problem seems to be in the re-setup of the ring in netfront.  During resume, netfront does a network_connect(); at that point, it passes a slew of grant references down to the hypervisor to be later used by netback.  Unfortunately, it looks like there is a bug in that we try to assign grant references *before* actually allocating the skbs that we need.  Therefore, all of the frame numbers that we pass down to the hypervisor are actually completely bogus.  Later one we then actually do the allocation, but it's too late by this point; the damage has been done.  To be honest, I haven't proved to myself why this works at all; it seems like this whole thing should fail on bootup and on save/restore across flipping and copying paths.

However, that last thing being said, it's looking like a netfront bug, and we can't fix that for all older clients.  I think I'm going to propose going back to a flipping interface; at least we know what we are getting with that.

Chris Lalancette

Comment 5 Herbert Xu 2009-06-05 14:10:26 UTC
Yes I agree we should revert to flipping for now.  Although we do need to find a solution for this at some point as otherwise we'd be stuck with flipping forever.  Thanks!

Comment 6 Herbert Xu 2009-06-05 14:14:43 UTC
A simple solution would be add another feature negotiation flag that differentiated the buggy netfronts from the non-buggy ones (when we fix it) so that only the non-buggy ones can switch to copying.

Comment 7 Chris Lalancette 2009-06-05 15:24:19 UTC
OK, thanks Herbert.  For now, I'll put a patch together that changes us back to a flipping interface.  Then I'll open up another bug to track the flipping->copying problem in netfront.

Chris Lalancette

Comment 8 Chris Lalancette 2009-06-17 07:08:54 UTC
The patch to revert to a flipping interface was committed to 5.4 as part of 479754.  Also, I've opened up bz 504623 to track fixing the netfront side of it.  Therefore, we don't need this BZ open any more.  I'll close it as a dup of 479754.

Chris Lalancette

*** This bug has been marked as a duplicate of bug 479754 ***