Bug 1029069

Summary: Live storage migration snapshot removal fails, probably due to unexpected qemu-img output
Product: [Retired] oVirt Reporter: Sander <bugzilla>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Haim <hateya>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3CC: abaron, acathrow, amureini, bazulay, bugzilla, iheim, mgoldboi, yeylon
Target Milestone: ---   
Target Release: 3.3.2   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-27 15:00:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
20131112 snapshot only vdsm.log
none
20131112 snapshot only libvirtd.log none

Description Sander 2013-11-11 15:32:24 UTC
Description of problem:

When I move a disk (of a running vm) to a different storage pool a live
snapshot ('Auto-generated for Live Storage Migration') is created.

This snapshot is not removed after the storage migration is completed and
cannot be removed from a running VM ( probably because live snapshot
deletion is not implemented yet?).

When I stop the VM and then remove the snapshot this fails and the snapshot
state changes to "BROKEN".

I've done some additional testing and the problem only occurs when a
preallocated disk is used.

One of the errors in vdsm.log is:

14:02:53,772::image::1164::Storage.Image::(merge) Auto shrink after
merge failed
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/image.py", line 1162, in merge
    srcVol.shrinkToOptimalSize()
  File "/usr/share/vdsm/storage/blockVolume.py", line 320, in
shrinkToOptimalSize
    qemuImg.FORMAT.QCOW2)
  File "/usr/lib64/python2.6/site-packages/vdsm/qemuImg.py", line 109, in check
    raise QImgError(rc, out, err, "unable to parse qemu-img check output")
QImgError: ecode=0, stdout=['No errors were found on the image.'],
stderr=[], message=unable to parse qemu-img check output
Thread-20::ERROR::2013-11-08

I've looked at the parsing error and the problem seems to be that the
qemuImg.check() function expects a qemu-img version that outputs a
string like: "Image end offset: 262144".

The only version I could find that does that is the one in the F19 virt-preview
repository. Stock F19 qemu-img doesn't have this output. Nor do the stock EL6
or qemu-img-rhev EL6 versions.

The only function that calls qemuImg.check is blockVolume.shrinkToOptimalSize()
this function needs (and uses) the parsed offset to shrink the LV containing the
qcow2 image.


Version-Release number of selected component (if applicable):
vdsm-4.12.1-4.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create VM with preallocated disk ( in my case on FC storage ).
2. Start VM
3. Do a live migration of the preallocated disk to a different storage domain.
4. Shut down VM
5. Remove the 'Auto-generated for Live Storage Migration' snapshot.

Actual results:
Snapshot removal fails and snapshot state changes to "BROKEN"

Expected results:
Snapshot removal succeeds.

Additional info:

Mailing list thread: http://lists.ovirt.org/pipermail/users/2013-November/017695.html
Ovirt 3.3 on EL6 (centos6).

Comment 1 Sander 2013-11-12 09:18:00 UTC
I've done some additional testing with just the live snapshot creation and removal.

1. Create VM with preallocated storage
2. Start VM
3. Make live snapshot
4. Shutdown VM
5. Try to remove snapshot

The result is the same "Failed to delete snapshot" error and a "BROKEN" snapshot.

I'll attach the full vdsm log (forced logrotate before the test).
In this log I do not see the qemu-img parse error.
I do see the following errors:

d09b8ba0-659b-4d33-906b-2ae84931c29a::ERROR::2013-11-12 10:08:04,661::task::850::TaskManager.Task::(_setError) Task=`d09b8ba0-659b-4d33-906b-2ae84931c29a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1937, in mergeSnapshots
    sdUUID, vmUUID, imgUUID, ancestor, successor, postZero)
  File "/usr/share/vdsm/storage/image.py", line 1162, in merge
    srcVol.shrinkToOptimalSize()
  File "/usr/share/vdsm/storage/blockVolume.py", line 315, in shrinkToOptimalSize
    volParams = self.getVolumeParams()
  File "/usr/share/vdsm/storage/volume.py", line 1008, in getVolumeParams
    volParams['imgUUID'] = self.getImage()
  File "/usr/share/vdsm/storage/blockVolume.py", line 494, in getImage
    return self.getVolumeTag(TAG_PREFIX_IMAGE)
  File "/usr/share/vdsm/storage/blockVolume.py", line 464, in getVolumeTag
    return _getVolumeTag(self.sdUUID, self.volUUID, tagPrefix)
  File "/usr/share/vdsm/storage/blockVolume.py", line 662, in _getVolumeTag
    tags = lvm.getLV(sdUUID, volUUID).tags
  File "/usr/share/vdsm/storage/lvm.py", line 851, in getLV
    raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
LogicalVolumeDoesNotExistError: Logical volume does not exist: ('9d4e8a43-4851-42ff-a684-f3d802527cf7/_remove_me_gxRZPitO_3d8b0fa7-03f0-4c54-a79b-15a72922b3f1',)

...

Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 331, in teardownVolumePath
    res = self.irs.teardownImage(drive['domainID'],
  File "/usr/share/vdsm/vm.py", line 1344, in __getitem__
    raise KeyError(key)
KeyError: 'domainID'
d09b8ba0-659b-4d33-906b-2ae84931c29a::WARNING::2013-11-12 10:07:43,261::lvm::594::Storage.LVM::(getLv) lv: 3d8b0fa7-03f0-4c54-a79b-15a72922b3f1_MERGE not found in lvs vg: 9d4e8a43-4851-42ff-a684-f3d802527cf7 response

...

d09b8ba0-659b-4d33-906b-2ae84931c29a::ERROR::2013-11-12 10:08:04,661::task::850::TaskManager.Task::(_setError) Task=`d09b8ba0-659b-4d33-906b-2ae84931c29a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1937, in mergeSnapshots
    sdUUID, vmUUID, imgUUID, ancestor, successor, postZero)
  File "/usr/share/vdsm/storage/image.py", line 1162, in merge
    srcVol.shrinkToOptimalSize()
  File "/usr/share/vdsm/storage/blockVolume.py", line 315, in shrinkToOptimalSize
    volParams = self.getVolumeParams()
  File "/usr/share/vdsm/storage/volume.py", line 1008, in getVolumeParams
    volParams['imgUUID'] = self.getImage()
  File "/usr/share/vdsm/storage/blockVolume.py", line 494, in getImage
    return self.getVolumeTag(TAG_PREFIX_IMAGE)
  File "/usr/share/vdsm/storage/blockVolume.py", line 464, in getVolumeTag
    return _getVolumeTag(self.sdUUID, self.volUUID, tagPrefix)
  File "/usr/share/vdsm/storage/blockVolume.py", line 662, in _getVolumeTag
    tags = lvm.getLV(sdUUID, volUUID).tags
  File "/usr/share/vdsm/storage/lvm.py", line 851, in getLV
    raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
LogicalVolumeDoesNotExistError: Logical volume does not exist: ('9d4e8a43-4851-42ff-a684-f3d802527cf7/_remove_me_gxRZPitO_3d8b0fa7-03f0-4c54-a79b-15a72922b3f1',)

Comment 2 Sander 2013-11-12 09:19:58 UTC
Created attachment 822789 [details]
20131112 snapshot only vdsm.log

Comment 3 Sander 2013-11-12 09:20:54 UTC
Created attachment 822790 [details]
20131112 snapshot only libvirtd.log

Comment 4 Sander 2013-11-12 10:15:42 UTC
Might be related to BZ 1015071 and BZ 1027464 ?

Comment 5 Sander 2013-11-12 15:13:44 UTC
The patch at http://gerrit.ovirt.org/#/c/19983/ from BZ 1015071 seems to fix this issue.

Comment 6 Sander 2013-11-27 15:00:00 UTC
Seems like this one is fixed in 3.3.1, after upgrading from 3.3.0 to 3.3.1 snapshot removal works.