Description of problem: During live merge, content from a volume 'top' is merged into a volume 'base'. If 'base' is thin provisioned it may need to be extended to accommodate the new data that will be written. Until libvirt provides a monitoring API we attempt to pre-extend 'base' when first starting the merge. The current code extends 'base' to the currently allocated size of 'top' but this heuristic is incorrect for some cases. If 'base' and 'top' have similar allocated sizes but 'top' contains lots of blocks which were not allocated in 'base' we will not extend 'base' enough and the merge will fail due to a -ENOSPC error. Version-Release number of selected component (if applicable): vdsm-4.17.0-98-g13bdaa3 How reproducible: Always Steps to Reproduce: 1. Create a VM with one 5G thin provisioned disk on an iSCSI SD 2. Boot the VM from a live CD such as TinyCorePlus 3. Write data to the first part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 4. When above command is finished create a VM snapshot 5. Write data to the second part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048 6. When the above command is finished delete the snapshot created in step 4 Actual results: The merge starts and will copy data for awhile but will end before finishing. Engine will report that the snapshot has failed to delete. Expected results: The snapshot should be deleted successfully. Additional info: The problem is that when starting the merge we calculate the extension as follows: topSize = drive.apparentsize ... self.extendDriveVolume(drive, baseVolUUID, topSize) In this scenario, baseVol is 3G and topVol is 3G. We will extend baseVol to 4G (topSize = 3G plus a 1G chunk). During merge we need to write 2G worth of data (which in a COW image will take somewhat more space than 2G due to qcow2 metadata). This results in an -ENOSPC error. I think the only solution to the problem is to extend 'base' by the allocated size of 'top' plus one extra chunk. This should cover this worst case scenario. Note that this new amount still could not be enough if heavy write activity that touches new parts of the disk occurs on 'top' during the actual merge. In that case, simply restarting the merge would be a workaround.
This is an edge case of an edge case - pushing out.
Re-targeting to 3.5.3 since this bug has not been marked as blocker for 3.5.2 and we have already released 3.5.2 Release Candidate.
Adam, is this still relevant with the pre-extension of volume you introduced?
It's still relevant until we do the actual dynamic resizing (covered by bug 1168327). Adding the dependency now.
We can fix this issue by fixing the pre-extension calculation when performing a live merge on thinly provisioned block storage. See https://gerrit.ovirt.org/#/c/44331/.
On vdsm ovirt-3.6 branch as commit 6070392aba975a0cbdbfa340fd032c7d62a48ee7
Tested with the following code: ----------------------------------- rhevm-3.6.0.3-0.1.el6.noarch vdsm-4.17.10.1-0.el7ev.noarch Verified with the following steps: ---------------------------------- Steps to Reproduce: 1. Create a VM with one 5G thin provisioned disk on an iSCSI SD 2. Boot the VM from a live CD such as TinyCorePlus 3. Write data to the first part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 >>>>> During the dd to the disks the connection is lost with the QEMU process. This happens every time. We have a bz https://bugzilla.redhat.com/show_bug.cgi?id=1279777 open for the qemu issue. If this connection was not lost at this point I did steps 4 and connection to the qemu process was lost after step 5 4. When above command is finished create a VM snapshot 5. Write data to the second part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048 So I cannot verify this bz yet. It depends on the fix for the qemu issue
I saw the qemu bug was closed. What does that mean for this bug?
(In reply to Yaniv Dary from comment #8) > I saw the qemu bug was closed. What does that mean for this bug? I performed the scenario above and it is working now. Verified with the following code: --------------------------------------- vdsm-4.17.19-0.el7ev.noarch rhevm-3.6.3-0.1.el6.noarch Verified with the following scenario: -------------------------------------- Steps to Reproduce: 1. Create a VM with one 5G thin provisioned disk on an iSCSI SD 2. Boot the VM from a live CD such as TinyCorePlus 3. Write data to the first part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 4. When above command is finished create a VM snapshot 5. Write data to the second part of the disk: Open a terminal inside the VM and run the following command dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048 6. When the above command is finished delete the snapshot created in step 4 >>>>> The snapshot is successfully deleted. Moving to VERIFIED