Bug 735457

Summary: libvirt should not leave stale snapshot metadata behind after domain disappears
Product: Red Hat Enterprise Linux 6 Reporter: Eric Blake <eblake>
Component: libvirtAssignee: Eric Blake <eblake>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.2CC: ajia, dyuan, mzhan, nzhang, rwu, veillard, whuang, xhu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.9.4-13.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 11:28:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 638510, 747120    

Description Eric Blake 2011-09-02 17:25:14 UTC
Description of problem:
Libvirt was recently fixed to prevent undefine of a snapshot with managed save metadata (bug 697742).  However, this only covers part of the metadata; there is also a problem with snapshots.

Version-Release number of selected component (if applicable):
libvirt-0.9.4-7.el6

How reproducible:
I didn't actually test the scenario to see how things fail, but suspect (based on my work in the code to fix the issue, and the similarity to managed save) that the problem will show up as bogus log messages or even wholescale libvirt corruption as it tries to use the stale metadata.

Steps to Reproduce:
1. define a persistent domain with qcow2 disks (doesn't even have to run)
2. 'virsh snapshot-create dom'
3. 'virsh undefine dom'
4. ls /var/lib/libvirt/qemu/snapshot/dom
5. define a new domain with the same name, but different UUID
6. 'virsh snapshot-list dom'
  
Actual results:
step 3 succeeded, but step 4 demonstrated that it stranded /var/lib/libvirt/qemu/snapshot/dom/*.xml
steps 5 and 6 can then get confused by the stale files - it may take other steps, such as an attempt to create a new snapshot by the same name as the stale ones, before the confusion causes visible problems, but it is certainly risky

Expected results:
step 3 should be forbidden until the snapshot metadata is removed first, and virsh should be enhanced to make this easier to do (virsh undefine --snapshots-metadata dom'.  Once a domain is no longer present in 'virsh list --all', then there should not be any /var/lib/libvirt/qemu/snapshot/dom directory for that domain.

Additional info:
Upstream patch series fixes this and more:
https://www.redhat.com/archives/libvir-list/2011-September/msg00137.html

Comment 1 Eric Blake 2011-09-02 17:25:58 UTC
Getting this fixed is a prereq to bug 638510 support for live snapshots via the
snapshot_blkdev qemu monitor command.

Comment 2 Eric Blake 2011-09-02 22:49:18 UTC
Since this patch proposes blocking undefine only if metadata is present, it also becomes important to identify when metadata is present, as well as to delete metadata without affecting snapshot contents.  Additionally, it becomes important to be able to redefine metadata to the state that it was before deletion, so that snapshot hierarchy can be preserved across transient domain restart or migrated between machines.  I'm lumping all of those fixes into this bug.

Comment 3 Eric Blake 2011-09-03 04:33:08 UTC
Upstream series ending in this commit:

commit e2fb96d92b4b986a2b5732416f7bfd302a848970
Author: Eric Blake <eblake>
Date:   Fri Aug 12 13:23:09 2011 -0600

    snapshot: prevent migration from stranding snapshot data
    
    Migration is another case of stranding metadata.  And since
    snapshot metadata is arbitrarily large, there's no way to
    shoehorn it into the migration cookie of migration v3.

Comment 6 Eric Blake 2011-09-08 15:06:29 UTC
Two additional patches make it so that 'virsh snapshot-create dom --no-metadata' will print out the just-generated snapshot name rather than failing (however, directly using the first of these two patches would be an incompatible API change, so it can't be back-ported as-is):

https://www.redhat.com/archives/libvir-list/2011-September/msg00390.html

Comment 7 Eric Blake 2011-09-08 15:20:55 UTC
I posted the followup patches for the --no-metadata improvement, although I'm not yet sure whether they belong to this BZ or a new one: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-September/msg00307.html

Comment 10 Eric Blake 2011-09-20 15:51:38 UTC
Hmm, I realized that -12 fails to remove snapshot metadata for transient domains as documented; since my patches to date focused only on persistent domains.  I'm not sure whether to split that into another BZ or move this one back to assigned.

Comment 11 Eric Blake 2011-09-21 19:16:01 UTC
Pulling back to ASSIGNED while waiting for three more upstream patches to be approved:
https://www.redhat.com/archives/libvir-list/2011-September/msg00860.html

Comment 13 Huang Wenlong 2011-09-26 07:14:09 UTC
Verify it with libvirt-0.9.4-13.el6.x86_64


1. define a persistent domain named "snap"with qcow2 disks
2. create a snapshot for domain snap
# virsh snapshot-create snap
3.
virsh # snapshot-list snap
 Name                 Creation Time             State
------------------------------------------------------------
 1317020538           2011-09-26 15:02:18 +0800 shutoff

4.virsh # undefine snap
error: Failed to undefine domain snap
error: Requested operation is not valid: cannot delete inactive domain with 1 snapshots

5. virsh undefine --snapshots-metadata snap 
Domain snap has been undefined

4. check snapshot metadata (no metadata)
# ls /var/lib/libvirt/qemu/snapshot/snap
5. define a new domain with the same name, but different UUID
6. check snapshot for domain test:
# virsh snapshot-list snap
 Name                 Creation Time             State
---------------------------------------------------------

Comment 14 Eric Blake 2011-10-07 14:42:59 UTC
This patch series introduced a typo in the user-visible error message:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-October/msg00233.html

Comment 15 Alex Jia 2011-10-08 06:39:37 UTC
These issues have been resolved on rhel6 beta(2.6.32-193.el6.x86_64) with libvirt-0.9.4-16.el6.x86_64, so move the bug to VERIFIED status.

The following are some details:

# qemu-img create -f qcow2 /var/lib/libvirt/images/foo.img 10M
Formatting '/var/lib/libvirt/images/foo.img', fmt=qcow2 size=10485760 encryption=off cluster_size=65536 

# qemu-img info /var/lib/libvirt/images/foo.img 
image: /var/lib/libvirt/images/foo.img
file format: qcow2
virtual size: 10M (10485760 bytes)
disk size: 140K
cluster_size: 65536

$ cat > /root/demo.xml <<EOF
<domain type='qemu'>
  <name>demo</name>
  <memory>219200</memory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64'>hvm</type>
    <boot dev='cdrom'/>
  </os>
  <devices>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/foo.img'/>
      <target dev='vda' bus='virtio'/>
    </disk>
    <input type='mouse' bus='ps2'/>
    <graphics type='spice' autoport='yes' listen='0.0.0.0'/>
  </devices>
</domain>
EOF

# virsh define /root/demo.xml
Domain demo defined from /root/demo.xml

# virsh snapshot-list demo
 Name                 Creation Time             State
------------------------------------------------------------

# cat > /root/snap.xml <<EOF
<domainsnapshot>
  <state>shutoff</state>
</domainsnapshot>
EOF

# virsh snapshot-create demo /root/snap.xml 
Domain snapshot 1318054922 created from '/root/snap.xml'

# virsh snapshot-list demo
 Name                 Creation Time             State
------------------------------------------------------------
 1318054922           2011-10-08 14:22:02 +0800 shutoff

# virsh snapshot-list demo --parent
 Name                 Creation Time             State           Parent
------------------------------------------------------------
 1318054922           2011-10-08 14:22:02 +0800 shutoff

# virsh snapshot-list demo --roots
 Name                 Creation Time             State
------------------------------------------------------------
 1318054922           2011-10-08 14:22:02 +0800 shutoff

# virsh snapshot-list demo --parent --roots
error: --parent and --roots are mutually exclusive

Notes, without typo issue for 'exclusive' words.

# ls /var/lib/libvirt/qemu/snapshot/demo/1318054922.xml 
/var/lib/libvirt/qemu/snapshot/demo/1318054922.xml

# virsh undefine demo
error: Failed to undefine domain demo
error: Requested operation is not valid: cannot delete inactive domain with 1 snapshots

Notes, this is a expected behaviour.

# ls /var/lib/libvirt/qemu/snapshot/demo/1318054922.xml 
/var/lib/libvirt/qemu/snapshot/demo/1318054922.xml

Notes, snapshot metadata still exists, this is a expected result.

# virsh undefine --snapshots-metadata demo
Domain demo has been undefined

# ls /var/lib/libvirt/qemu/snapshot/demo/1318054922.xml 
ls: cannot access /var/lib/libvirt/qemu/snapshot/demo/1318054922.xml: No such file or directory

Notes, everything is okay now.

Comment 16 errata-xmlrpc 2011-12-06 11:28:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html