Bug 1170712 - Live Merge: Failed to remove snapshot on block storage due to -ENOSPC
Summary: Live Merge: Failed to remove snapshot on block storage due to -ENOSPC
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.17.0
Hardware: All
OS: Linux
high
high
Target Milestone: ovirt-3.6.0-rc
: 4.17.8
Assignee: Adam Litke
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On: 1279777
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-12-04 16:35 UTC by Adam Litke
Modified: 2016-03-10 13:45 UTC (History)
15 users (show)

Fixed In Version: v4.17.4
Clone Of:
Environment:
Last Closed: 2016-02-10 12:49:50 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.0+
ylavi: planning_ack+
amureini: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1155583 0 urgent CLOSED [Block storage] Basic Live Merge after Delete Snapshot fails 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1168327 0 medium CLOSED Live Merge: optimize internal volume size 2021-09-09 11:40:59 UTC
oVirt gerrit 44331 0 master MERGED Live Merge: Fix pre-extension calculation for chunked drives Never
oVirt gerrit 45058 0 None None None Never

Internal Links: 1155583 1168327

Description Adam Litke 2014-12-04 16:35:29 UTC
Description of problem:

During live merge, content from a volume 'top' is merged into a volume 'base'.  If 'base' is thin provisioned it may need to be extended to accommodate the new data that will be written.  Until libvirt provides a monitoring API we attempt to pre-extend 'base' when first starting the merge.  The current code extends 'base' to the currently allocated size of 'top' but this heuristic is incorrect for some cases.  If 'base' and 'top' have similar allocated sizes but 'top' contains lots of blocks which were not allocated in 'base' we will not extend 'base' enough and the merge will fail due to a -ENOSPC error.


Version-Release number of selected component (if applicable):
vdsm-4.17.0-98-g13bdaa3


How reproducible: Always


Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048
4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048
6. When the above command is finished delete the snapshot created in step 4

Actual results:
The merge starts and will copy data for awhile but will end before finishing.  Engine will report that the snapshot has failed to delete.

Expected results:
The snapshot should be deleted successfully.

Additional info:

The problem is that when starting the merge we calculate the extension as follows:

topSize = drive.apparentsize
...
self.extendDriveVolume(drive, baseVolUUID, topSize)

In this scenario, baseVol is 3G and topVol is 3G.  We will extend baseVol to 4G (topSize = 3G plus a 1G chunk).  During merge we need to write 2G worth of data (which in a COW image will take somewhat more space than 2G due to qcow2 metadata).  This results in an -ENOSPC error.

I think the only solution to the problem is to extend 'base' by the allocated size of 'top' plus one extra chunk.  This should cover this worst case scenario.  

Note that this new amount still could not be enough if heavy write activity that  touches new parts of the disk occurs on 'top' during the actual merge.  In that case, simply restarting the merge would be a workaround.

Comment 1 Allon Mureinik 2014-12-08 14:31:11 UTC
This is an edge case of an edge case - pushing out.

Comment 2 Sandro Bonazzola 2015-03-03 12:56:24 UTC
Re-targeting to 3.5.3 since this bug has not been marked as blocker for 3.5.2 and we have already released 3.5.2 Release Candidate.

Comment 3 Allon Mureinik 2015-04-06 11:35:32 UTC
Adam, is this still relevant with the pre-extension of volume you introduced?

Comment 4 Adam Litke 2015-04-06 17:57:35 UTC
It's still relevant until we do the actual dynamic resizing (covered by bug 1168327).  Adding the dependency now.

Comment 5 Adam Litke 2015-08-11 14:35:52 UTC
We can fix this issue by fixing the pre-extension calculation when performing a live merge on thinly provisioned block storage.  See https://gerrit.ovirt.org/#/c/44331/.

Comment 6 Adam Litke 2015-08-31 13:32:49 UTC
On vdsm ovirt-3.6 branch as commit 6070392aba975a0cbdbfa340fd032c7d62a48ee7

Comment 7 Kevin Alon Goldblatt 2015-11-19 09:27:58 UTC
Tested with the following code:
-----------------------------------
rhevm-3.6.0.3-0.1.el6.noarch
vdsm-4.17.10.1-0.el7ev.noarch

Verified with the following steps:
----------------------------------
Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048  >>>>> During the dd to the disks the connection is lost with the QEMU process. This happens every time. We have a bz https://bugzilla.redhat.com/show_bug.cgi?id=1279777 open for the qemu issue.
If this connection was not lost at this point I did steps 4 and connection to the qemu process was lost after step 5

4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048

So I cannot verify this bz yet. It depends on the fix for the qemu issue

Comment 8 Yaniv Lavi 2016-01-17 08:42:13 UTC
I saw the qemu bug was closed. What does that mean for this bug?

Comment 9 Kevin Alon Goldblatt 2016-02-02 15:45:53 UTC
(In reply to Yaniv Dary from comment #8)
> I saw the qemu bug was closed. What does that mean for this bug?

I performed the scenario above and it is working now. 

Verified with the following code:
---------------------------------------
vdsm-4.17.19-0.el7ev.noarch
rhevm-3.6.3-0.1.el6.noarch

Verified with the following scenario:
--------------------------------------
Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048
4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048
6. When the above command is finished delete the snapshot created in step 4 >>>>> The snapshot is successfully deleted.

Moving to VERIFIED


Note You need to log in before you can comment on or make changes to this bug.