Bug 730750

Summary: libvirt error in restoring domain with corrupt managedsave image
Product: Red Hat Enterprise Linux 6 Reporter: Grant Williamson <grant_williamson>
Component: libvirtAssignee: Eric Blake <eblake>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: dallan, dyuan, eblake, malittle, mzhan, rwu, veillard, walicki, yupzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.9.4-8.el6 Doc Type: Bug Fix
Doc Text:
Cause Libvirt would attempt to load a managed save file in preference to starting a domain from scratch, even if the managed save file was damaged and could not be loaded. Consequence Users were complaining about the inability to start domains, not realizing that the domain had a corrupt managed save image that was being retried in a loop, and without realizing an obscure 'virsh managedsave-remove' could resolve the problem. Fix Libvirt introduced 'virsh start --force-boot', as well as some improved logic in ensuring that a managed save file would not be tried if it was corrupt, to make it less likely that a corrupted managed save file can interfere with guest startup. Result Use of managed save images is less likely to cause confusion due to a corrupted image.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 11:26:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 638510    

Description Grant Williamson 2011-08-15 16:05:36 UTC
Description of problem:
If a managed save image cannot be restored, user is presented with the following error message.
"Error restoring domain: cannot send monitor command '{"execute":"qmp_capabilities"}': Connection reset by peer"

Version-Release number of selected component (if applicable):
libvirt 0.8.7-18

How reproducible:
- Power on a Windows XP guest using virt-manager.

- Start to save the image using Virtual Manager, Shutdown, Save.

- Before the save file is complete, make a copy of it. Then cancel the save process.
  i.e.
  cp /var/lib/libvirt/qemu/save/winxp.raw /root/winxp.raw
  This simulates a corrupt image.

- Shutdown the windows xp guest

- Copy the incomplete file back
  i.e.
  cp /root/winxp.raw /var/lib/libvirt/qemu/save/winxp.raw 

- Now power on windows xp image, it will quit with the error message shown above. The machine will not power on/boot successfully until this corrupt file is removed.
  

Expected results:
libvirt or virt-manager should determine the save file is corrupt either continue to boot or prompt the user if they would like to remove, before continuing to boot.

Additional info:

Comment 2 Grant Williamson 2011-08-16 07:14:56 UTC
So I found this thread.
http://www.redhat.com/archives/libvir-list/2011-April/msg00385.html

Red Hat's view on this  - if the restore fails, data loss may occur when/if the saved state is removed. I agree.

However for desktop KVM users, they get confused by cryptic error messages. Would it be possible for virt-manager to handle this in some fashion by prompting the user, on failure to remove or retry the restore?

Comment 3 Osier Yang 2011-08-17 12:56:39 UTC
I'm not sure we can add some feild to the header of save image, such as "complete".
So that can check the save image at restoring/starting. But this is only way as far I can get.

Comment 4 Satya Komaragiri 2011-08-30 08:04:03 UTC
Invalid (or missing) info:
     * Version field: '['6.1']'
     * Platform field (Architecture): 'Unspecified'
Please set valid values for above.
Once values are set,  please change status back to 'NEW'.
Regards,

Comment 5 Eric Blake 2011-08-30 16:05:03 UTC
(In reply to comment #3)
> I'm not sure we can add some feild to the header of save image, such as
> "complete".
> So that can check the save image at restoring/starting. But this is only way as
> far I can get.

Upstream has tackled this problem on two fronts:

1. Yes, we can, and we should, modify the save image header to mark incomplete images.  Back-compatibility says that the best way to do this is by modifying the magic number - an unknown or missing value will treat the file as unknown and refuse to use it, a special number treats the file as incomplete (and managed save will know to warn about the incomplete managed save image, then proceed to boot normally), and the existing magic number is only written in on completion (safe to use).
https://www.redhat.com/archives/libvir-list/2011-August/msg00854.html

2. Expose the capability of deleting (failed) managed save images more prominently.  Done with this upstream commit:
commit 27c85260532f879be5674a4eed0811c21fd34f94
Author: Eric Blake <eblake>
Date:   Sat Aug 27 17:07:18 2011 -0600

    start: allow discarding managed save
    
    There have been several instances of people having problems with
    a broken managed save file, and not aware that they could use
    'virsh managedsave-remove dom' to fix things.  Making it possible
    to do this as part of starting a domain makes the same functionality
    easier to find, and one less API call.
    
    * include/libvirt/libvirt.h.in (VIR_DOMAIN_START_FORCE_BOOT): New
    flag.
    * src/libvirt.c (virDomainCreateWithFlags): Document it.
    * src/qemu/qemu_driver.c (qemuDomainObjStart): Alter signature.
    (qemuAutostartDomain, qemuDomainStartWithFlags): Update callers.
    * tools/virsh.c (cmdStart): Expose it in virsh.
    * tools/virsh.pod (start): Document it.

as well as this followup to make the virsh capability work even with older servers:
https://www.redhat.com/archives/libvir-list/2011-August/msg01440.html

I think both approaches need to be backported into RHEL before we can call this issue complete (which implies that approach 1 still needs to be coded and accepted upstream, and that patch 2/1 of approach 2 still needs ack upstream).

Comment 6 Eric Blake 2011-08-30 21:09:54 UTC
approach 1 also posted upstream:
https://www.redhat.com/archives/libvir-list/2011-August/msg01458.html
https://www.redhat.com/archives/libvir-list/2011-August/msg01459.html

Additionally, at least one of my pending snapshot patches want to use the refactored qemuOpenFile() method from msg01458, so I'm marking this as a prereq to bug 638510 support for live snapshots via the snapshot_blkdev qemu monitor command.

Comment 9 dyuan 2011-09-07 07:32:45 UTC
Reproduced this bug on libvirt-0.9.4-7.el6, domain will start fail with the incomplete save file. Verified PASS with libvirt-0.9.4-9.el6, domain will boot normally and remove the incomplete save file.

Comment 10 dyuan 2011-09-07 07:41:15 UTC
(In reply to comment #9)
> Reproduced this bug on libvirt-0.9.4-7.el6, domain will start fail with the
> incomplete save file. Verified PASS with libvirt-0.9.4-9.el6, domain will boot
> normally and remove the incomplete save file.

Also get the following libvirtd.log:
15:30:36.635: 10074: warning : qemuDomainObjStart:4857 : Ignoring incomplete managed state /var/lib/libvirt/qemu/save/rhel6.save

Comment 11 Daniel Veillard 2011-09-07 09:24:41 UTC
yep, that's normal :-)

Daniel

Comment 12 John Walicki 2011-09-07 14:07:39 UTC
Many thanks for the patch. 
Will this fix be included in RHEL 6.2?

Comment 13 Daniel Veillard 2011-10-13 14:28:17 UTC
c.f. comment #12, yes definitely,

Daniel

Comment 14 Eric Blake 2011-11-10 23:56:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    Libvirt would attempt to load a managed save file in preference to starting a domain from scratch, even if the managed save file was damaged and could not be loaded.
Consequence
    Users were complaining about the inability to start domains, not realizing that the domain had a corrupt managed save image that was being retried in a loop, and without realizing an obscure 'virsh managedsave-remove' could resolve the problem.
Fix
    Libvirt introduced 'virsh start --force-boot', as well as some improved logic in ensuring that a managed save file would not be tried if it was corrupt, to make it less likely that a corrupted managed save file can interfere with guest startup.
Result
    Use of managed save images is less likely to cause confusion due to a corrupted image.

Comment 15 errata-xmlrpc 2011-12-06 11:26:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html