Bug 513335 - PV guest with VFB device stops running when xm save fails
Summary: PV guest with VFB device stops running when xm save fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Michal Novotny
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 578452 (view as bug list)
Depends On: 486157
Blocks: 494811 514499
TreeView+ depends on / blocked
 
Reported: 2009-07-23 06:02 UTC by Yufang Zhang
Modified: 2014-02-02 22:37 UTC (History)
10 users (show)

Fixed In Version: xen-3.0.3-110.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 22:17:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xend.log (8.91 KB, text/plain)
2009-07-23 06:03 UTC, Yufang Zhang
no flags Details
XenD save on device with insufficient space fix (4.95 KB, patch)
2009-08-18 13:50 UTC, Michal Novotny
no flags Details | Diff
Make XenD restore domain on failed save (7.33 KB, patch)
2009-08-19 15:16 UTC, Michal Novotny
no flags Details | Diff
PV guest restore on failed save (8.02 KB, patch)
2009-08-20 09:56 UTC, Michal Novotny
no flags Details | Diff
Fix PVFB devices removal (9.99 KB, patch)
2010-01-21 14:16 UTC, Michal Novotny
no flags Details | Diff
Resume PV guest with PVFB devices when save fails (776 bytes, patch)
2010-05-05 12:53 UTC, Michal Novotny
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0031 0 normal SHIPPED_LIVE xen bug fix and enhancement update 2011-01-12 15:59:24 UTC

Description Yufang Zhang 2009-07-23 06:02:08 UTC
Description of problem:
If a PV guest have VFB device, it will stop running when xm save fails. 

Version-Release number of selected component (if applicable):
xen-3.0.3-91.el5

How reproducible:
always

Steps to Reproduce:
(1)start a paravirtualized guest with 512MB memory
  (with a line: vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] in the config file)
(2)mount a 100MB disk partition on /mnt
(3)run
     # xm save <guest> /mnt/<guest>.save
    then save will fail with:
     Error: /usr/lib/xen/bin/xc_save 22 5 0 0 0 failed
     Usage: xm save <Domain> <CheckpointFile>

     Save a domain state to restore later.


Actual results:
     run:
       #xm list
     shows:
       domain1 5 511 1 ---s-- 11.3
The guest remains shutdown and can not run again.This quite like the case in https://bugzilla.redhat.com/show_bug.cgi?id=486157.
If the PV guest doesn`t have VFB device,it just work fine when xm save fails. 

Expected results:
The guest should remain running as if no xm save command was ever issued


Additional info:
xend.log uploaded.

Comment 1 Yufang Zhang 2009-07-23 06:03:04 UTC
Created attachment 354815 [details]
xend.log

Comment 2 Paolo Bonzini 2009-07-23 11:52:30 UTC
From  comment #16 of bug 486157: "32-bit PV guest on x86_64 host still remains shutdown when 'xm save' failed [...] A 32-bit PV on 32-bit host works just fine when xm save failed,as if no xm save command was ever issued."

Can you please confirm that this bug only occurs for 32-on-64?

Comment 3 Jiri Denemark 2009-07-23 12:04:06 UTC
Nope, you mixed two bugs. This one is a general one... The one you're mentioning in comment #2 is a different bug, which has not been filed yet.

Comment 4 Paolo Bonzini 2009-07-23 12:10:27 UTC
I am confused because the description in this bug matches comment #16 of bug 486157 almost word-by-word.  Anyway I can reproduce this one for 64-on-64 too, indeed.

Comment 5 Yufang Zhang 2009-07-23 14:13:59 UTC
(In reply to comment #4)
> I am confused because the description in this bug matches comment #16 of bug
> 486157 almost word-by-word.  Anyway I can reproduce this one for 64-on-64 too,
> indeed.  

This bug(513335):
It will occur on both 32-on-64 and 64-on-64 case,as long as the PV guest has a VFB device.

comment #16 of bug 486157:
The PV guest doesn`t have a VFB device.A 32-bit PV on 32-bit host works just fine when xm save failed,as if no xm save command was ever issued.But 32-bit PV guest on x86_64 host still remains shutdown when 'xm save' failed.This is new one different from bug 513335.I will report it as a bug in BZ soon.

Comment 6 Michal Novotny 2009-08-18 13:50:42 UTC
Created attachment 357801 [details]
XenD save on device with insufficient space fix

Hi,
this is the patch for saving domain on device with insufficient disk space. This caused errors like written in comment #0 for both PV & HVM domains on x86_64 (my workstation environment). I've tried to make it working to resume the domain but it was not working at all so this patch restarts the domain after the save failed. I know this is not ideal in production environment but it's better than to have domain shutdown/destroyed with no automatic restart. The error occured on both PV and HVM domains but the approach is not entirely the same - there are few differences, mainly need for removing 'image' from dominfo for PV guest which have to be set (not removed) for HVM domains. The testing has been done using saving both HVM and PV domains to both location with sufficient space (with following restore) and insufficient space. In all tests, it was working fine.

Michal

Comment 9 Michal Novotny 2009-08-19 15:16:19 UTC
Created attachment 357951 [details]
Make XenD restore domain on failed save

Well, the previous patch was not good because of restarting the guests. This is the better version of this patch because it doesn't restart the domain at all but it restores/resumes the domain only. It's been tested with 64 bit domU on 64 bit dom0 and it was working fine.

Comment 10 Paolo Bonzini 2009-08-19 15:25:26 UTC
> +                    if self.getName().startswith("migrating-"):
> +                        tmp = self.getName().split('-')
> +                        del tmp[0]
> +                        newName = ""
> +                        for i in tmp:
> +                            if len(newName) > 0:
> +                                newName = newName + '-' + i
> +                            else:
> +                                newName = i
> +
> +                        self.setName( newName )

Rather:

+                    if self.getName().startswith("migrating-"):
+                        tmp = self.getName()
+                        self.setName(tmp[10:])

Comment 11 Michal Novotny 2009-08-20 09:56:39 UTC
Created attachment 358060 [details]
PV guest restore on failed save

Well, this is the new version of this patch with some nit picks fixed. Since I found out there was already a BZ about 'migrating-' prefix remaining at the domain name after failed save (BZ #494811), after applying this fix you have to apply fix for BZ #494811 as well. It's been tested on x86_64 dom0 with x86_64 PV guest and it was working fine...

Comment 15 Michal Novotny 2010-01-21 14:16:12 UTC
Created attachment 385926 [details]
Fix PVFB devices removal

I made a new patch for stale PVFB device cleanup and I *think* I may have found a root cause which is fixed. Also, zombieDeviceCleanup() method is preserved with check for stale backends and run if found some.

The root cause is that when releasing devices using _releaseDevices() the backend devices are not released here and it *may* be released when the frontends are released which is not working for PVFB devices... So a new code has been added to _releaseDevices() method.

The patch has been tested on x86_64 dom0 for RHEL-5 PV guests (2 guests concurrently running, 32-bit and 64-bit) and after domain destroy/shutdown there was no evidence of stale backend devices so please review this version...

Thanks, 
Michal

Comment 18 Miroslav Rezanina 2010-02-24 11:04:26 UTC
As patch was rejected exception is canceled and bz is moved to 5.6

Comment 21 Michal Novotny 2010-04-01 11:57:51 UTC
*** Bug 578452 has been marked as a duplicate of this bug. ***

Comment 35 Michal Novotny 2010-04-27 13:13:52 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > First, why are all the comments in this bug private?
> > Second, I'm still not really sure what's needed from me at this point, but I'll
> > give a response a try.
> > 

Well, I've been looking to this one again and there's no rebinding code for virq_port. This is being bound in domain_init() function which is being called in xenstored_core.c on xenstore daemon startup which means that domain_init() is basically initialization for dom0. There is no port rebinding or anything similar. Event channel is not giving the virq_port value as it does before the resume was done. Xenstore daemon is not being restarted when XenD is so this is setting something strange but what XenD does on the domain restart is the initialization of all domains, including dom0. Since the reinitialization of dom0 seems to be done isn't it possible that event channel remembers some flag (meaning something like virq_port_used or anything similar) to prevent the current port (which are basically data read from the event channel using read on /dev/xen/evtchn device) to be set to the value of virq_port ?

I'd study it myself but unfortunately I don't know how does the event channel (or anything in kernel) provide data to be accessible as from a character device, therefore what function does it use to provide data that are readable by user-space read() operation on /dev/xen/evtchn device.

Michal

Comment 36 Michal Novotny 2010-05-05 07:44:50 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > (In reply to comment #33)
> > > First, why are all the comments in this bug private?
> > > Second, I'm still not really sure what's needed from me at this point, but I'll
> > > give a response a try.
> > > 
> 
> Well, I've been looking to this one again and there's no rebinding code for
> virq_port. This is being bound in domain_init() function which is being called
> in xenstored_core.c on xenstore daemon startup which means that domain_init()
> is basically initialization for dom0. There is no port rebinding or anything
> similar. Event channel is not giving the virq_port value as it does before the
> resume was done. Xenstore daemon is not being restarted when XenD is so this is
> setting something strange but what XenD does on the domain restart is the
> initialization of all domains, including dom0. Since the reinitialization of
> dom0 seems to be done isn't it possible that event channel remembers some flag
> (meaning something like virq_port_used or anything similar) to prevent the
> current port (which are basically data read from the event channel using read
> on /dev/xen/evtchn device) to be set to the value of virq_port ?
> 
> I'd study it myself but unfortunately I don't know how does the event channel
> (or anything in kernel) provide data to be accessible as from a character
> device, therefore what function does it use to provide data that are readable
> by user-space read() operation on /dev/xen/evtchn device.
> 
> Michal    

This could be workarounded by not showing the domains that are in dying state but this is not a right way to go I guess so I was not thinking of it until now since the right way is to fix the root cause which is to make libxc do not read this dying domain id and to provide proper cleanup which is connected to/triggered by the port coming from from the event channel equivalent to the value bound to virq_port on domain0 initialization. The workaround to just ignore dying domains in the libxc code seems really very ugly to me.

Michal

Comment 37 Michal Novotny 2010-05-05 12:53:19 UTC
Created attachment 411584 [details]
Resume PV guest with PVFB devices when save fails

Hi,
this is the patch to resume a PV guest with PVFB devices attached in case when xm save fails. It's been tested on x86_64 dom0 for 64-bit RHEL-5 PV guest with both PVFB devices attached and not attached. Also testing with HVM guest was done. It was working fine for all of the cases except one issue revealed after the this patch is applied - when you try to shutdown the PV guest it's in dying state but HV is giving the information about this domain directly to libxc so this can be either workarounded in libxc/xend to ignore guests that are in dying state but this workaround seems to be very ugly to me since it's not the root cause. The investigation is showing that one value expected from the event channel is not going from the event channel and therefore the domain cannot be shutdown. Since this seems to be the event channel issue which belongs to the hypervisor/kernel-xen component I've filed a new bug to kernel-xen area, bug 589123 to reference this.

Michal

Comment 42 Linqing Lu 2010-08-25 08:29:58 UTC
This bug has been verified in xen-3.0.3-115.el5
The PV guest runs WELL after "xm save" failed.

But this guest will turn into zombie when trying "xm shut" it, just as the same situation as Bug 589123.

Comment 44 errata-xmlrpc 2011-01-13 22:17:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html


Note You need to log in before you can comment on or make changes to this bug.