1170712 – Live Merge: Failed to remove snapshot on block storage due to -ENOSPC

Bug 1170712 - Live Merge: Failed to remove snapshot on block storage due to -ENOSPC

Summary: Live Merge: Failed to remove snapshot on block storage due to -ENOSPC

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.17.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	4.17.8
Assignee:	Adam Litke
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1279777
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-04 16:35 UTC by Adam Litke
Modified:	2016-03-10 13:45 UTC (History)
CC List:	15 users (show)
Fixed In Version:	v4.17.4
Clone Of:
Environment:
Last Closed:	2016-02-10 12:49:50 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.0+ ylavi: planning_ack+ amureini: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1155583	urgent	CLOSED	[Block storage] Basic Live Merge after Delete Snapshot fails	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1168327	medium	CLOSED	Live Merge: optimize internal volume size	2021-09-09 11:40:59 UTC
oVirt gerrit	44331	master	MERGED	Live Merge: Fix pre-extension calculation for chunked drives	Never
oVirt gerrit	45058	None	None	None	Never

Internal Links: 1155583 1168327

Description Adam Litke 2014-12-04 16:35:29 UTC

Description of problem:

During live merge, content from a volume 'top' is merged into a volume 'base'. If 'base' is thin provisioned it may need to be extended to accommodate the new data that will be written. Until libvirt provides a monitoring API we attempt to pre-extend 'base' when first starting the merge. The current code extends 'base' to the currently allocated size of 'top' but this heuristic is incorrect for some cases. If 'base' and 'top' have similar allocated sizes but 'top' contains lots of blocks which were not allocated in 'base' we will not extend 'base' enough and the merge will fail due to a -ENOSPC error.

Version-Release number of selected component (if applicable):
vdsm-4.17.0-98-g13bdaa3

How reproducible: Always

Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
Open a terminal inside the VM and run the following command
dd if=/dev/zero of=/dev/vda bs=1M count=2048
4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
Open a terminal inside the VM and run the following command
dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048
6. When the above command is finished delete the snapshot created in step 4

Actual results:
The merge starts and will copy data for awhile but will end before finishing. Engine will report that the snapshot has failed to delete.

Expected results:
The snapshot should be deleted successfully.

Additional info:

The problem is that when starting the merge we calculate the extension as follows:

topSize = drive.apparentsize
...
self.extendDriveVolume(drive, baseVolUUID, topSize)

In this scenario, baseVol is 3G and topVol is 3G. We will extend baseVol to 4G (topSize = 3G plus a 1G chunk). During merge we need to write 2G worth of data (which in a COW image will take somewhat more space than 2G due to qcow2 metadata). This results in an -ENOSPC error.

I think the only solution to the problem is to extend 'base' by the allocated size of 'top' plus one extra chunk. This should cover this worst case scenario.

Note that this new amount still could not be enough if heavy write activity that touches new parts of the disk occurs on 'top' during the actual merge. In that case, simply restarting the merge would be a workaround.

Comment 1 Allon Mureinik 2014-12-08 14:31:11 UTC

This is an edge case of an edge case - pushing out.

Comment 2 Sandro Bonazzola 2015-03-03 12:56:24 UTC

Re-targeting to 3.5.3 since this bug has not been marked as blocker for 3.5.2 and we have already released 3.5.2 Release Candidate.

Comment 3 Allon Mureinik 2015-04-06 11:35:32 UTC

Adam, is this still relevant with the pre-extension of volume you introduced?

Comment 4 Adam Litke 2015-04-06 17:57:35 UTC

It's still relevant until we do the actual dynamic resizing (covered by bug 1168327).  Adding the dependency now.

Comment 5 Adam Litke 2015-08-11 14:35:52 UTC

We can fix this issue by fixing the pre-extension calculation when performing a live merge on thinly provisioned block storage.  See https://gerrit.ovirt.org/#/c/44331/.

Comment 6 Adam Litke 2015-08-31 13:32:49 UTC

On vdsm ovirt-3.6 branch as commit 6070392aba975a0cbdbfa340fd032c7d62a48ee7

Comment 7 Kevin Alon Goldblatt 2015-11-19 09:27:58 UTC

Tested with the following code:
-----------------------------------
rhevm-3.6.0.3-0.1.el6.noarch
vdsm-4.17.10.1-0.el7ev.noarch

Verified with the following steps:
----------------------------------
Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048  >>>>> During the dd to the disks the connection is lost with the QEMU process. This happens every time. We have a bz https://bugzilla.redhat.com/show_bug.cgi?id=1279777 open for the qemu issue.
If this connection was not lost at this point I did steps 4 and connection to the qemu process was lost after step 5

4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048

So I cannot verify this bz yet. It depends on the fix for the qemu issue

Comment 8 Yaniv Lavi 2016-01-17 08:42:13 UTC

I saw the qemu bug was closed. What does that mean for this bug?

Comment 9 Kevin Alon Goldblatt 2016-02-02 15:45:53 UTC

(In reply to Yaniv Dary from comment #8)
> I saw the qemu bug was closed. What does that mean for this bug?

I performed the scenario above and it is working now. 

Verified with the following code:
---------------------------------------
vdsm-4.17.19-0.el7ev.noarch
rhevm-3.6.3-0.1.el6.noarch

Verified with the following scenario:
--------------------------------------
Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048
4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048
6. When the above command is finished delete the snapshot created in step 4 >>>>> The snapshot is successfully deleted.

Moving to VERIFIED

Note You need to log in before you can comment on or make changes to this bug.