Bug 730750 - libvirt error in restoring domain with corrupt managedsave image
libvirt error in restoring domain with corrupt managedsave image
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.1
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Eric Blake
Virtualization Bugs
:
Depends On:
Blocks: 638510
  Show dependency treegraph
 
Reported: 2011-08-15 12:05 EDT by Grant Williamson
Modified: 2011-12-06 06:26 EST (History)
9 users (show)

See Also:
Fixed In Version: libvirt-0.9.4-8.el6
Doc Type: Bug Fix
Doc Text:
Cause Libvirt would attempt to load a managed save file in preference to starting a domain from scratch, even if the managed save file was damaged and could not be loaded. Consequence Users were complaining about the inability to start domains, not realizing that the domain had a corrupt managed save image that was being retried in a loop, and without realizing an obscure 'virsh managedsave-remove' could resolve the problem. Fix Libvirt introduced 'virsh start --force-boot', as well as some improved logic in ensuring that a managed save file would not be tried if it was corrupt, to make it less likely that a corrupted managed save file can interfere with guest startup. Result Use of managed save images is less likely to cause confusion due to a corrupted image.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-12-06 06:26:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Grant Williamson 2011-08-15 12:05:36 EDT
Description of problem:
If a managed save image cannot be restored, user is presented with the following error message.
"Error restoring domain: cannot send monitor command '{"execute":"qmp_capabilities"}': Connection reset by peer"

Version-Release number of selected component (if applicable):
libvirt 0.8.7-18

How reproducible:
- Power on a Windows XP guest using virt-manager.

- Start to save the image using Virtual Manager, Shutdown, Save.

- Before the save file is complete, make a copy of it. Then cancel the save process.
  i.e.
  cp /var/lib/libvirt/qemu/save/winxp.raw /root/winxp.raw
  This simulates a corrupt image.

- Shutdown the windows xp guest

- Copy the incomplete file back
  i.e.
  cp /root/winxp.raw /var/lib/libvirt/qemu/save/winxp.raw 

- Now power on windows xp image, it will quit with the error message shown above. The machine will not power on/boot successfully until this corrupt file is removed.
  

Expected results:
libvirt or virt-manager should determine the save file is corrupt either continue to boot or prompt the user if they would like to remove, before continuing to boot.

Additional info:
Comment 2 Grant Williamson 2011-08-16 03:14:56 EDT
So I found this thread.
http://www.redhat.com/archives/libvir-list/2011-April/msg00385.html

Red Hat's view on this  - if the restore fails, data loss may occur when/if the saved state is removed. I agree.

However for desktop KVM users, they get confused by cryptic error messages. Would it be possible for virt-manager to handle this in some fashion by prompting the user, on failure to remove or retry the restore?
Comment 3 Osier Yang 2011-08-17 08:56:39 EDT
I'm not sure we can add some feild to the header of save image, such as "complete".
So that can check the save image at restoring/starting. But this is only way as far I can get.
Comment 4 Satya Komaragiri 2011-08-30 04:04:03 EDT
Invalid (or missing) info:
     * Version field: '['6.1']'
     * Platform field (Architecture): 'Unspecified'
Please set valid values for above.
Once values are set,  please change status back to 'NEW'.
Regards,
Comment 5 Eric Blake 2011-08-30 12:05:03 EDT
(In reply to comment #3)
> I'm not sure we can add some feild to the header of save image, such as
> "complete".
> So that can check the save image at restoring/starting. But this is only way as
> far I can get.

Upstream has tackled this problem on two fronts:

1. Yes, we can, and we should, modify the save image header to mark incomplete images.  Back-compatibility says that the best way to do this is by modifying the magic number - an unknown or missing value will treat the file as unknown and refuse to use it, a special number treats the file as incomplete (and managed save will know to warn about the incomplete managed save image, then proceed to boot normally), and the existing magic number is only written in on completion (safe to use).
https://www.redhat.com/archives/libvir-list/2011-August/msg00854.html

2. Expose the capability of deleting (failed) managed save images more prominently.  Done with this upstream commit:
commit 27c85260532f879be5674a4eed0811c21fd34f94
Author: Eric Blake <eblake@redhat.com>
Date:   Sat Aug 27 17:07:18 2011 -0600

    start: allow discarding managed save
    
    There have been several instances of people having problems with
    a broken managed save file, and not aware that they could use
    'virsh managedsave-remove dom' to fix things.  Making it possible
    to do this as part of starting a domain makes the same functionality
    easier to find, and one less API call.
    
    * include/libvirt/libvirt.h.in (VIR_DOMAIN_START_FORCE_BOOT): New
    flag.
    * src/libvirt.c (virDomainCreateWithFlags): Document it.
    * src/qemu/qemu_driver.c (qemuDomainObjStart): Alter signature.
    (qemuAutostartDomain, qemuDomainStartWithFlags): Update callers.
    * tools/virsh.c (cmdStart): Expose it in virsh.
    * tools/virsh.pod (start): Document it.

as well as this followup to make the virsh capability work even with older servers:
https://www.redhat.com/archives/libvir-list/2011-August/msg01440.html

I think both approaches need to be backported into RHEL before we can call this issue complete (which implies that approach 1 still needs to be coded and accepted upstream, and that patch 2/1 of approach 2 still needs ack upstream).
Comment 6 Eric Blake 2011-08-30 17:09:54 EDT
approach 1 also posted upstream:
https://www.redhat.com/archives/libvir-list/2011-August/msg01458.html
https://www.redhat.com/archives/libvir-list/2011-August/msg01459.html

Additionally, at least one of my pending snapshot patches want to use the refactored qemuOpenFile() method from msg01458, so I'm marking this as a prereq to bug 638510 support for live snapshots via the snapshot_blkdev qemu monitor command.
Comment 9 dyuan 2011-09-07 03:32:45 EDT
Reproduced this bug on libvirt-0.9.4-7.el6, domain will start fail with the incomplete save file. Verified PASS with libvirt-0.9.4-9.el6, domain will boot normally and remove the incomplete save file.
Comment 10 dyuan 2011-09-07 03:41:15 EDT
(In reply to comment #9)
> Reproduced this bug on libvirt-0.9.4-7.el6, domain will start fail with the
> incomplete save file. Verified PASS with libvirt-0.9.4-9.el6, domain will boot
> normally and remove the incomplete save file.

Also get the following libvirtd.log:
15:30:36.635: 10074: warning : qemuDomainObjStart:4857 : Ignoring incomplete managed state /var/lib/libvirt/qemu/save/rhel6.save
Comment 11 Daniel Veillard 2011-09-07 05:24:41 EDT
yep, that's normal :-)

Daniel
Comment 12 John Walicki 2011-09-07 10:07:39 EDT
Many thanks for the patch. 
Will this fix be included in RHEL 6.2?
Comment 13 Daniel Veillard 2011-10-13 10:28:17 EDT
c.f. comment #12, yes definitely,

Daniel
Comment 14 Eric Blake 2011-11-10 18:56:44 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    Libvirt would attempt to load a managed save file in preference to starting a domain from scratch, even if the managed save file was damaged and could not be loaded.
Consequence
    Users were complaining about the inability to start domains, not realizing that the domain had a corrupt managed save image that was being retried in a loop, and without realizing an obscure 'virsh managedsave-remove' could resolve the problem.
Fix
    Libvirt introduced 'virsh start --force-boot', as well as some improved logic in ensuring that a managed save file would not be tried if it was corrupt, to make it less likely that a corrupted managed save file can interfere with guest startup.
Result
    Use of managed save images is less likely to cause confusion due to a corrupted image.
Comment 15 errata-xmlrpc 2011-12-06 06:26:41 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html

Note You need to log in before you can comment on or make changes to this bug.