1198512 – Live-deleting a snapshot of preallocated disk results in a block domain using up all available space

Bug 1198512 - Live-deleting a snapshot of preallocated disk results in a block domain using up all available space

Summary: Live-deleting a snapshot of preallocated disk results in a block domain using...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Adam Litke
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1199816
TreeView+	depends on / blocked

Reported:	2015-03-04 10:12 UTC by Kevin Alon Goldblatt
Modified:	2016-03-09 19:32 UTC (History)
CC List:	15 users (show)
Fixed In Version:	v4.17.0.3
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1199816 (view as bug list)
Environment:
Last Closed:	2016-03-09 19:32:42 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	nsoffer: needinfo- amureini: needinfo-

Attachments	(Terms of Use)
server, engine and vdsm logs (1.44 MB, application/x-gzip) 2015-03-04 10:15 UTC, Kevin Alon Goldblatt	no flags	Details
vdsm log with domainxml (122.29 KB, application/x-xz) 2015-03-04 20:06 UTC, Nir Soffer	no flags	Details
vdsm log with domainxml using ovirt-3.5 (117.04 KB, application/x-xz) 2015-03-04 21:01 UTC, Nir Soffer	no flags	Details
vdsm log with domainxml using ovirt-3.5 and preallocated disk (627.66 KB, application/x-xz) 2015-03-05 18:02 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0362	0	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC
oVirt gerrit	36923	0	None	None	None	Never

Description Kevin Alon Goldblatt 2015-03-04 10:12:35 UTC

Description of problem:
Deleting a snapshot disk while the vm is live causes lv-extend to loop and use up all available space

Version-Release number of selected component (if applicable):
rhevm-3.5.1-0.1.el6ev.noarch
vdsm-4.16.12-2.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
Create vm with 2 block disks
Create a snapshot
Start the vm
Via the Storage > Disk Snapshot tab delete one of the snapshot disks
When deleting the second snapshot disk the the storage domain is reported as having run out of disk space

Actual results:
Disk ran out of disk space due to lvextend exception


Expected results:
Should not reach lvextend exception


Additional info:

engine.log.........................................................................
2015-03-03 18:36:21,262 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-36) [1804c42c] Correlation ID: 1804c42c, Job ID: c0
a7dab4-9c77-463a-9a11-f27b1335dcb9, Call Stack: null, Custom Event ID: -1, Message: Disk '1177138_Disk1' from Snapshot(s) 'snap1' of VM '1177138' deletion was initiated by admin@intern
al.
.
.
.
2015-03-03 18:38:53,928 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-56) [38d525f8] Correlation ID: 1804c42c, Job ID: c0a7dab4-9c77-463a-9a11-f27b1335dcb9, Call Stack: null, Custom Event ID: -1, Message: Disk '1177138_Disk1' from Snapshot(s) '<UNKNOWN>' of VM '1177138' deletion has been completed (User: admin@internal).





vdsm.log.........................................................................
6c683fa6-a14b-441a-8065-2f4a976357a1::ERROR::2015-03-03 18:36:42,427::storage_mailbox::172::Storage.SPM.Messages.Extend::(processRequest) processRequest: Exception caught while trying 
to extend volume: 26b70af6-e0b3-4cac-b9cb-c5fafd1ce611 in domain: ef0105d0-d3cb-403d-8f68-61a2393f9798
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 166, in processRequest
    pool.extendVolume(volume['domainID'], volume['volumeID'], size)
  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1303, in extendVolume
    sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown)
  File "/usr/share/vdsm/storage/blockSD.py", line 1315, in extendVolume
    lvm.extendLV(self.sdUUID, volumeUUID, size)  # , isShuttingDown)
  File "/usr/share/vdsm/storage/lvm.py", line 1152, in extendLV
    _resizeLV("lvextend", vgName, lvName, size)
  File "/usr/share/vdsm/storage/lvm.py", line 1146, in _resizeLV
    free_size / constants.MEGAB))
VolumeGroupSizeError: Volume Group not big enough: ('ef0105d0-d3cb-403d-8f68-61a2393f9798/26b70af6-e0b3-4cac-b9cb-c5fafd1ce611 4096 > 768 (MiB)',)


2015-03-03 18:34:22,776::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm lvextend --config ' devices

Comment 1 Kevin Alon Goldblatt 2015-03-04 10:15:30 UTC

Created attachment 997803 [details]
server, engine and vdsm logs

Adding logs

Comment 2 Allon Mureinik 2015-03-04 10:35:20 UTC

Possible duplicate of bug 1196049.
Adam/Nir - please take a look?

Comment 3 Nir Soffer 2015-03-04 19:05:44 UTC

Kevin, please attach the output of
cat /etc/redhat-release
rpm -qa

Comment 4 Nir Soffer 2015-03-04 20:06:19 UTC

Created attachment 998038 [details]
vdsm log with domainxml

Comment 5 Nir Soffer 2015-03-04 20:10:39 UTC

In the vdsm logs Kevin posted, we can see lot of extendDrivesIfNeeded calls, showing wrong allocation/physical readings from libvirt. This readings happen
when a block disk is treated as file disk by libvirt.

I could not reproduce this with master on Fedora 21:

1. Use existing vm with one block disk
2. Add 2 thin provisioned block disks, 1G each
3. Create snapshot including only the 2 new disks
4. Start vm
5. Select "Storage"/"Disk snapshots" tab
6. Delete one disk snapshot
7. Delete the other

No unusual extension request seen in the log, and disk type seems to be correct.

Kevin, can you reproduce this using the steps above?

Comment 6 Nir Soffer 2015-03-04 21:01:20 UTC

Created attachment 998048 [details]
vdsm log with domainxml using ovirt-3.5

Comment 7 Nir Soffer 2015-03-04 21:05:57 UTC

I could not reproduce it with ovirt-3.5, which is similar to rhev-3.5.1 on Fedora 21. So it is either RHEL 7.1 related, or the steps to reproduce are incorrect.

Comment 8 Adam Litke 2015-03-04 21:55:43 UTC

Unreproducable for me on RHEL-7.1 beta with the following steps:

Create vm with 2 block disks
Create a snapshot
Start the vm
Via the Storage > Disk Snapshot tab delete one of the disk snapshots
Via the Storage > Disk Snapshot tab delete the final disk snapshot

Both are removed successfully and no unwarranted extensions were observed.

Software versions:
libvirt-1.2.8-16.el7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64
vdsm-4.16.10-8.gitc937927.el7.x86_64

I'd really appreciate to know the package environment of the initial reproduction.  Also, please make sure you are testing on a fully updated system.

Comment 9 Kevin Alon Goldblatt 2015-03-05 12:22:58 UTC

(In reply to Nir Soffer from comment #3)
> Kevin, please attach the output of
> cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
> rpm -qa
vdsm-4.16.12-2.el7ev.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64
libvirt-client-1.2.8-16.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7.x86_64
libvirt-daemon-1.2.8-16.el7.x86_64
libvirt-python-1.2.8-7.el7.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7.x86_64
libvirt-lock-sanlock-1.2.8-16.el7.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7.x86_64
libvirt-daemon-kvm-1.2.8-16.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7.x86_64

Comment 10 Nir Soffer 2015-03-05 17:16:27 UTC

Evidence from vdsm logs, explaining comment 5:

1. The log do not show the creation of the disks - next time please post
   logs that show the entire tested flow

2. As soon as we finish the *first* merge, we start to get bad readings of the current
   volume watermarks:

Thread-523::INFO::2015-03-03 18:34:12,472::logUtils::47::dispatcher::(wrapper) Run and protect: imageSyncVolumeChain, Return response: None
...
Thread-523::INFO::2015-03-03 18:34:12,473::vm::6097::vm.Vm::(run) vmId=`d593af3c-f40a-4047-b543-4e28ff5e5cb2`::Synchronization completed (job 30b7ac94-beab-4bfa-80a8-e2f8c85781b5)
...
Thread-456::INFO::2015-03-03 18:34:12,576::vm::2547::vm.Vm::(extendDrivesIfNeeded) vmId=`d593af3c-f40a-4047-b543-4e28ff5e5cb2`::Requesting extension for volume 5e22f70f-84ba-4723-9ad5-34aeec7cccdf on domain ef01
05d0-d3cb-403d-8f68-61a2393f9798 (apparent: 1073741824, capacity: 3221225472, allocated: 3221225472, physical: 3221225472)
...
Thread-456::INFO::2015-03-03 18:34:16,619::vm::2547::vm.Vm::(extendDrivesIfNeeded) vmId=`d593af3c-f40a-4047-b543-4e28ff5e5cb2`::Requesting extension for volume 5e22f70f-84ba-4723-9ad5-34aeec7cccdf on domain ef0105d0-d3cb-403d-8f68-61a2393f9798 (apparent: 4294967296, capacity: 4294967296, allocated: 4294967296, physical: 4294967296)
...
Thread-456::INFO::2015-03-03 18:34:20,637::vm::2547::vm.Vm::(extendDrivesIfNeeded) vmId=`d593af3c-f40a-4047-b543-4e28ff5e5cb2`::Requesting extension for volume 5e22f70f-84ba-4723-9ad5-34aeec7cccdf on domain ef0105d0-d3cb-403d-8f68-61a2393f9798 (apparent: 5368709120, capacity: 5368709120, allocated: 5368709120, physical: 5368709120)

Note allocated==physical - it looks like the drive is treated as type="file"
but we don't have any info from libvirt to understand what happened.

Kevin, we need to reproduce this again your setup. When you reproduce,
please dump the libvirt domain xml before and after each snapshot.

To dump domain xml you can do:

virsh dumpxml vm-name > before-removing-sanpshot1.xml

Comment 11 Nir Soffer 2015-03-05 17:59:15 UTC

Ok, I managed to reproduce it on ovirt-3.5.

The correct steps to reproduce it are:

1. Create *preallocated* disk on block storage domain
2. Create snapshot including this disk
3. Start vm
4. From "Storage/Disk snapshots" remove the disk snapshot

As soon as we finish the merge we start to get bad watermark readings
leading to unlimited extension of the drive.

Looking in libvirt domain xml, we can see that the drive format
was converted to "raw":

Before the pivot operation:

    <disk type='block' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/>
      <source dev='/rhev/data-center/cca1d436-8c16-4887-8c07-2a9ecc1c0830/54ba4f22-ee1a-4851-9710-0b242ebdc289/images/e64fa26b-8c43-4090-8a9b-f7dbe293982b/2d031756-70ed-4c74-9672-121d621038d6'/>
      <backingStore type='block' index='1'>
        <format type='raw'/>
        <source dev='/rhev/data-center/cca1d436-8c16-4887-8c07-2a9ecc1c0830/54ba4f22-ee1a-4851-9710-0b242ebdc289/images/e64fa26b-8c43-4090-8a9b-f7dbe293982b/../e64fa26b-8c43-4090-8a9b-f7dbe293982b/f6c44444-2dfd-
4fe9-a989-cd48fddf9031'/>
        <backingStore/>
      </backingStore>
      <mirror type='block' job='active-commit' ready='yes'>
        <format type='raw'/>
        <source dev='/rhev/data-center/cca1d436-8c16-4887-8c07-2a9ecc1c0830/54ba4f22-ee1a-4851-9710-0b242ebdc289/images/e64fa26b-8c43-4090-8a9b-f7dbe293982b/../e64fa26b-8c43-4090-8a9b-f7dbe293982b/f6c44444-2dfd-4fe9-a989-cd48fddf9031'/>
      </mirror>
      <target dev='vdb' bus='virtio'/>
      <serial>e64fa26b-8c43-4090-8a9b-f7dbe293982b</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </disk>

After the pivot operation:

    <disk type='block' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/>
      <source dev='/rhev/data-center/cca1d436-8c16-4887-8c07-2a9ecc1c0830/54ba4f22-ee1a-4851-9710-0b242ebdc289/images/e64fa26b-8c43-4090-8a9b-f7dbe293982b/../e64fa26b-8c43-4090-8a9b-f7dbe293982b/f6c44444-2dfd-4fe9-a989-cd48fddf9031'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <serial>e64fa26b-8c43-4090-8a9b-f7dbe293982b</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </disk>

Since the disk format was changed to "raw", libvirt watermarks are incorrect.

Although disk format was changed to "raw", vdsm continue to check if the the drive
need extension:

Thread-6767::INFO::2015-03-05 12:26:23,981::vm::2547::vm.Vm::(extendDrivesIfNeeded) vmId=`5f28c331-4460-45fc-bcd1-ff9f7bc67415`::Requesting extension for volume f6c44444-2dfd-4fe9-a989-cd48fddf9031 on domain 54ba4f22-ee1a-4851-9710-0b242ebdc289 (apparent: 1073741824, capacity: 3221225472, allocated: 3221225472, physical: 3221225472)

Vdsm should stop checking such drive, but it seems that the drive format was
not updated after the pivot operation.

This issue was fixed in upstream in:

commit 2369e831b157aab3fadde64c1507410b1484198b
Author: Adam Litke <alitke>
Date:   Tue Jan 13 11:46:28 2015 -0500

    Live Merge: Update drive.format after active layer merge
    
    A merge of the active layer can change the drive format from cow -> raw
    if a snapshot was merged into a raw backing file.  In that case we must
    correct the VM Drive metadata to ensure the drive is handled properly
    after the merge has finished.
    
    Change-Id: Ieb64bbfe798a27896442a173b7dac41cebc92543
    Signed-off-by: Adam Litke <alitke>
    Reviewed-on: http://gerrit.ovirt.org/36923
    Reviewed-by: Francesco Romani <fromani>
    Reviewed-by: Nir Soffer <nsoffer>
    Reviewed-by: Dan Kenigsberg <danken>

This patch was not backported to 3.5.

Comment 12 Nir Soffer 2015-03-05 18:02:29 UTC

Created attachment 998498 [details]
vdsm log with domainxml using ovirt-3.5 and preallocated disk

Comment 14 Allon Mureinik 2015-03-21 19:17:50 UTC

Removing the needinfo, as this bug is resolved.

Comment 15 Allon Mureinik 2015-04-16 08:41:23 UTC

Fixed the component to VDSM, which reset the flags.
Yaniv/Aharon, please re-ack.

Comment 17 Kevin Alon Goldblatt 2015-04-28 08:06:00 UTC

Test using version:
V3.6
ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch
vdsm-4.17.0-632.git19a83a2.el7.x86_64

Test scenario:

Steps to Reproduce:
Create vm with 2 block disks
Create a snapshot
Start the vm
Via the Storage > Disk Snapshot tab delete one of the snapshot disks
Via the Storage > Disk Snapshot tab delete the second snapshot disk
Both disks are deleted successfully and the storage displays the correct space allocation.

Moving to Verified

Comment 20 errata-xmlrpc 2016-03-09 19:32:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.