Bug 692998

Summary: data loss if restoring libvirt domain encounters transient error
Product: Red Hat Enterprise Linux 6 Reporter: Eric Blake <eblake>
Component: libvirtAssignee: Osier Yang <jyang>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: dallan, eblake, mzhan, syeghiay, yoyzhang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.8.7-17.el6 Doc Type: Bug Fix
Doc Text:
libvirt removed the managed state file (created by "virsh managedsave dom") even if it failed to restore and start the domain using that file. This caused data loss. The managed state file is now removed only if the restore operation succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:29:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Blake 2011-04-01 22:30:29 UTC
Description of problem:
https://www.redhat.com/archives/libvir-list/2011-April/msg00071.html
Libvirt blindly unlink()s a saved domain image when completing 'virsh restore file' (or, with managedsave, 'virsh start' from the managed save location).  However, if qemu failed to start (particularly if the failure is transient, such as lack of memory due to other pressures on the system that can be resolved before retrying the restore), this is a form of data loss.

Version-Release number of selected component (if applicable):
libvirt-0.8.7-15.el6

How reproducible:
not sure how to reliably reproduce qemu not starting, but it seems like it should be possible to come up with some scenarios

Steps to Reproduce:
1. save a running qemu domain (either with 'virsh managedsave dom' or 'virsh save dom file'
2. set up conditions where qemu will fail to start (possibly by reverting to a known-buggy qemu, or by intentionally allocating enough memory in some other program that the qemu memory request will be denied)
3. try to start the saved domain ('virsh start' or 'virsh restore file', accordingly)
4. revert the temporary conditions (restore qemu to a non-buggy version, or release memory back to the system...)
5. try to start the saved domain again
  
Actual results:
step 3 unlinked the save file, even though the restore failed, losing all state from the guest's memory at the time it was saved

Expected results:
step 3 should fail, but leave the save file intact.  step 5 should then succeed.


Additional info:
No upstream patch available yet, but data loss is severe, hence requesting exception for inclusion in RHEL 6.1.

Comment 2 Osier Yang 2011-04-05 06:50:38 UTC
The problem exists only for "virsh managedsave dom; virsh start dom",  "virsh save dom dom.save; virsh restore dom.save" works fine, as it doesn't trys to unlink the saved state.

patch posted to upstream:
http://www.redhat.com/archives/libvir-list/2011-April/msg00215.html

Comment 6 zhanghaiyan 2011-04-18 10:04:45 UTC
Verified this bug pass with libvirt-0.8.7-17.el6.x86_64
1. # virsh start rhel61
Do some operation in guest, for example open a document
2. # virsh managedsave rhel61
# ll /var/lib/libvirt/qemu/save/rhel61.save 
-rw-------. 1 root root 497719393 Apr 18 05:58 /var/lib/libvirt/qemu/save/rhel61.save
3. # rpm -e qemu-kvm --nodeps
4. # # virsh start rhel61
error: Failed to start domain rhel61
error: Cannot find QEMU binary /usr/libexec/qemu-kvm: No such file or directory
# ll /var/lib/libvirt/qemu/save/rhel61.save 
-rw-------. 1 root root 497719393 Apr 18 05:58 /var/lib/libvirt/qemu/save/rhel61.save
5. # rpm -ivh qemu-kvm-0.12.1.2-2.158.el6.x86_64.rpm 
# service libvirtd restart
6. # virsh start rhel61
The guest is restored from the save file, and the document is still open in the guest

Reproduced this bug with libvirt-0.8.7-16.el6.x86_64
In step 4, the save file is deleted
In step 6, the guest is boot up freshly

Comment 7 Osier Yang 2011-05-03 07:06:52 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
libvirt removes the managed state file (created by "virsh managedsave dom") even if it fails to start up the domain from restoring with the managed state file, which causes data loss, with this update, it removes the managed state file only if restoring from the managed state file succeeded.

Comment 10 Laura Bailey 2011-05-04 04:17:24 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-libvirt removes the managed state file (created by "virsh managedsave dom") even if it fails to start up the domain from restoring with the managed state file, which causes data loss, with this update, it removes the managed state file only if restoring from the managed state file succeeded.+libvirt removed the managed state file (created by "virsh managedsave dom") even if it failed to restore and start the domain using that file. This caused data loss. The managed state file is now removed only if the restore operation succeeds.

Comment 11 errata-xmlrpc 2011-05-19 13:29:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html