Description of problem: If a PV guest have VFB device, it will stop running when xm save fails. Version-Release number of selected component (if applicable): xen-3.0.3-91.el5 How reproducible: always Steps to Reproduce: (1)start a paravirtualized guest with 512MB memory (with a line: vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] in the config file) (2)mount a 100MB disk partition on /mnt (3)run # xm save <guest> /mnt/<guest>.save then save will fail with: Error: /usr/lib/xen/bin/xc_save 22 5 0 0 0 failed Usage: xm save <Domain> <CheckpointFile> Save a domain state to restore later. Actual results: run: #xm list shows: domain1 5 511 1 ---s-- 11.3 The guest remains shutdown and can not run again.This quite like the case in https://bugzilla.redhat.com/show_bug.cgi?id=486157. If the PV guest doesn`t have VFB device,it just work fine when xm save fails. Expected results: The guest should remain running as if no xm save command was ever issued Additional info: xend.log uploaded.
Created attachment 354815 [details] xend.log
From comment #16 of bug 486157: "32-bit PV guest on x86_64 host still remains shutdown when 'xm save' failed [...] A 32-bit PV on 32-bit host works just fine when xm save failed,as if no xm save command was ever issued." Can you please confirm that this bug only occurs for 32-on-64?
Nope, you mixed two bugs. This one is a general one... The one you're mentioning in comment #2 is a different bug, which has not been filed yet.
I am confused because the description in this bug matches comment #16 of bug 486157 almost word-by-word. Anyway I can reproduce this one for 64-on-64 too, indeed.
(In reply to comment #4) > I am confused because the description in this bug matches comment #16 of bug > 486157 almost word-by-word. Anyway I can reproduce this one for 64-on-64 too, > indeed. This bug(513335): It will occur on both 32-on-64 and 64-on-64 case,as long as the PV guest has a VFB device. comment #16 of bug 486157: The PV guest doesn`t have a VFB device.A 32-bit PV on 32-bit host works just fine when xm save failed,as if no xm save command was ever issued.But 32-bit PV guest on x86_64 host still remains shutdown when 'xm save' failed.This is new one different from bug 513335.I will report it as a bug in BZ soon.
Created attachment 357801 [details] XenD save on device with insufficient space fix Hi, this is the patch for saving domain on device with insufficient disk space. This caused errors like written in comment #0 for both PV & HVM domains on x86_64 (my workstation environment). I've tried to make it working to resume the domain but it was not working at all so this patch restarts the domain after the save failed. I know this is not ideal in production environment but it's better than to have domain shutdown/destroyed with no automatic restart. The error occured on both PV and HVM domains but the approach is not entirely the same - there are few differences, mainly need for removing 'image' from dominfo for PV guest which have to be set (not removed) for HVM domains. The testing has been done using saving both HVM and PV domains to both location with sufficient space (with following restore) and insufficient space. In all tests, it was working fine. Michal
Created attachment 357951 [details] Make XenD restore domain on failed save Well, the previous patch was not good because of restarting the guests. This is the better version of this patch because it doesn't restart the domain at all but it restores/resumes the domain only. It's been tested with 64 bit domU on 64 bit dom0 and it was working fine.
> + if self.getName().startswith("migrating-"): > + tmp = self.getName().split('-') > + del tmp[0] > + newName = "" > + for i in tmp: > + if len(newName) > 0: > + newName = newName + '-' + i > + else: > + newName = i > + > + self.setName( newName ) Rather: + if self.getName().startswith("migrating-"): + tmp = self.getName() + self.setName(tmp[10:])
Created attachment 358060 [details] PV guest restore on failed save Well, this is the new version of this patch with some nit picks fixed. Since I found out there was already a BZ about 'migrating-' prefix remaining at the domain name after failed save (BZ #494811), after applying this fix you have to apply fix for BZ #494811 as well. It's been tested on x86_64 dom0 with x86_64 PV guest and it was working fine...
Created attachment 385926 [details] Fix PVFB devices removal I made a new patch for stale PVFB device cleanup and I *think* I may have found a root cause which is fixed. Also, zombieDeviceCleanup() method is preserved with check for stale backends and run if found some. The root cause is that when releasing devices using _releaseDevices() the backend devices are not released here and it *may* be released when the frontends are released which is not working for PVFB devices... So a new code has been added to _releaseDevices() method. The patch has been tested on x86_64 dom0 for RHEL-5 PV guests (2 guests concurrently running, 32-bit and 64-bit) and after domain destroy/shutdown there was no evidence of stale backend devices so please review this version... Thanks, Michal
As patch was rejected exception is canceled and bz is moved to 5.6
*** Bug 578452 has been marked as a duplicate of this bug. ***
(In reply to comment #34) > (In reply to comment #33) > > First, why are all the comments in this bug private? > > Second, I'm still not really sure what's needed from me at this point, but I'll > > give a response a try. > > Well, I've been looking to this one again and there's no rebinding code for virq_port. This is being bound in domain_init() function which is being called in xenstored_core.c on xenstore daemon startup which means that domain_init() is basically initialization for dom0. There is no port rebinding or anything similar. Event channel is not giving the virq_port value as it does before the resume was done. Xenstore daemon is not being restarted when XenD is so this is setting something strange but what XenD does on the domain restart is the initialization of all domains, including dom0. Since the reinitialization of dom0 seems to be done isn't it possible that event channel remembers some flag (meaning something like virq_port_used or anything similar) to prevent the current port (which are basically data read from the event channel using read on /dev/xen/evtchn device) to be set to the value of virq_port ? I'd study it myself but unfortunately I don't know how does the event channel (or anything in kernel) provide data to be accessible as from a character device, therefore what function does it use to provide data that are readable by user-space read() operation on /dev/xen/evtchn device. Michal
(In reply to comment #35) > (In reply to comment #34) > > (In reply to comment #33) > > > First, why are all the comments in this bug private? > > > Second, I'm still not really sure what's needed from me at this point, but I'll > > > give a response a try. > > > > > Well, I've been looking to this one again and there's no rebinding code for > virq_port. This is being bound in domain_init() function which is being called > in xenstored_core.c on xenstore daemon startup which means that domain_init() > is basically initialization for dom0. There is no port rebinding or anything > similar. Event channel is not giving the virq_port value as it does before the > resume was done. Xenstore daemon is not being restarted when XenD is so this is > setting something strange but what XenD does on the domain restart is the > initialization of all domains, including dom0. Since the reinitialization of > dom0 seems to be done isn't it possible that event channel remembers some flag > (meaning something like virq_port_used or anything similar) to prevent the > current port (which are basically data read from the event channel using read > on /dev/xen/evtchn device) to be set to the value of virq_port ? > > I'd study it myself but unfortunately I don't know how does the event channel > (or anything in kernel) provide data to be accessible as from a character > device, therefore what function does it use to provide data that are readable > by user-space read() operation on /dev/xen/evtchn device. > > Michal This could be workarounded by not showing the domains that are in dying state but this is not a right way to go I guess so I was not thinking of it until now since the right way is to fix the root cause which is to make libxc do not read this dying domain id and to provide proper cleanup which is connected to/triggered by the port coming from from the event channel equivalent to the value bound to virq_port on domain0 initialization. The workaround to just ignore dying domains in the libxc code seems really very ugly to me. Michal
Created attachment 411584 [details] Resume PV guest with PVFB devices when save fails Hi, this is the patch to resume a PV guest with PVFB devices attached in case when xm save fails. It's been tested on x86_64 dom0 for 64-bit RHEL-5 PV guest with both PVFB devices attached and not attached. Also testing with HVM guest was done. It was working fine for all of the cases except one issue revealed after the this patch is applied - when you try to shutdown the PV guest it's in dying state but HV is giving the information about this domain directly to libxc so this can be either workarounded in libxc/xend to ignore guests that are in dying state but this workaround seems to be very ugly to me since it's not the root cause. The investigation is showing that one value expected from the event channel is not going from the event channel and therefore the domain cannot be shutdown. Since this seems to be the event channel issue which belongs to the hypervisor/kernel-xen component I've filed a new bug to kernel-xen area, bug 589123 to reference this. Michal
This bug has been verified in xen-3.0.3-115.el5 The PV guest runs WELL after "xm save" failed. But this guest will turn into zombie when trying "xm shut" it, just as the same situation as Bug 589123.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html