Bug 1091094

Summary: VM Disk extended after snapshot until block storage domain is out of space
Product: [Retired] oVirt Reporter: Adam Litke <alitke>
Component: vdsmAssignee: Adam Litke <alitke>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: alitke, amureini, bazulay, bugs, eblake, fsimonce, gklein, iheim, mgoldboi, nsoffer, ogofen, rbalakri, tnisan, yeylon
Target Milestone: ---Keywords: Reopened
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: v4.16.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:38:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1174791, 1176673, 1198128    
Attachments:
Description Flags
vdsm log
none
engine log none

Description Adam Litke 2014-04-24 21:08:21 UTC
Description of problem:
I have an iscsi storage domain with a pool of VMs.  The VM disks are 2GB (thin provisioned).  The system behaves normally until I use ovirt-engine to create a VM snapshot.  Once the libvirt snapshot API finishes pivoting to the new volume, vdsm will extend the new leaf volume forever (until it consumes all free space on the block storage domain).

Version-Release number of selected component (if applicable):
Post 3.4 release (master)
vdsm: e826e76c6c7df3791964b41c7720e39a406a98f6
ovirt-engine: 6dc681ba7b1682b9a3eed56e8b986d6a4a06e3ad

How reproducible: (For me) Always


Steps to Reproduce:
1. Create a pool of VMs from a template on an iscsi sd.  The disk should be a 2G thinly provisioned volume.
2. Start a VM
3. Create a disk-only snapshot of the VM
4. Monitor the storage "Used" and "Allocated" fields from ovirt engine for the data domain being used

Actual results:
"Used" will climb continuously and surpass "Allocated" as long as the VM continues to run.  On the vdsm side, the disk apparent/physical size will continue to increase.


Expected results:
The storage domain should reflect the allocation to accommodate the new snapshot volume.  "Used" should not increase dramatically (certainly not above "Allocated".

Additional info:
See attached vdsm.log and engine.log

Comment 1 Adam Litke 2014-04-24 21:09:09 UTC
Created attachment 889443 [details]
vdsm log

Comment 2 Adam Litke 2014-04-24 21:09:41 UTC
Created attachment 889444 [details]
engine log

Comment 3 Adam Litke 2014-04-25 21:35:34 UTC
Hmm, I am seeing bad information in the output of virsh domblkinfo as well.  I think the problem lies below vdsm actually.  The symptom here is that allocation jumps up to be equal to physical in some cases.

Federico, might this be a symptom of the lvmetad problem not being solved on my host?  I have set use_lvmetad to 0 and made sure that the daemon is not running.

Comment 4 Federico Simoncelli 2014-04-28 14:54:51 UTC
(In reply to Adam Litke from comment #3)
> Hmm, I am seeing bad information in the output of virsh domblkinfo as well. 
> I think the problem lies below vdsm actually.  The symptom here is that
> allocation jumps up to be equal to physical in some cases.
> 
> Federico, might this be a symptom of the lvmetad problem not being solved on
> my host?  I have set use_lvmetad to 0 and made sure that the daemon is not
> running.

No it's not related to lvmetad. What is the libvirt version that you're using?

Comment 5 Adam Litke 2014-04-28 17:06:36 UTC
libvirt-1.2.3 (compiled from source and manually installed on a F20 host)

Comment 6 Adam Litke 2014-04-28 17:06:36 UTC
libvirt-1.2.3 (compiled from source and manually installed on a F20 host)

Comment 7 Sandro Bonazzola 2014-05-08 13:52:13 UTC
This is an automated message.

oVirt 3.4.1 has been released.
This issue has been retargeted to 3.4.2 as it has severity high, please retarget if needed.
If this is a blocker please add it to the tracker Bug #1095370

Comment 8 Federico Simoncelli 2014-05-08 20:13:07 UTC
Adam have you discovered if this is caused by the libvirt build you made? Can we close this?

Comment 9 Adam Litke 2014-06-09 13:57:29 UTC
Closing since this seems to be limited to the custom libvirt build I was working with.  I'll reopen if I see it in an official libvirt release.

Comment 10 Adam Litke 2014-06-09 20:00:06 UTC
I've reproduced this and determined the root cause.  Taking bug.

Comment 11 Nir Soffer 2014-06-09 20:05:19 UTC
(In reply to Adam Litke from comment #10)
> I've reproduced this and determined the root cause.  Taking bug.

Can you share the root cause with us?

Comment 12 Adam Litke 2014-06-09 20:50:47 UTC
(In reply to Nir Soffer from comment #11)
> (In reply to Adam Litke from comment #10)
> > I've reproduced this and determined the root cause.  Taking bug.
> 
> Can you share the root cause with us?

Of course :) I was just waiting for the fix to appear in gerrit.

As explained in http://gerrit.ovirt.org/#/c/28531/1 , When a snapshot XML does not specify that it is of type block, type file is assumed.  This has the side effect of converting the libvirt disk to type file.  When libvirt is returning the high write watermark information for a drive, it always returns the physical file size for file disks, not the value given by qemu.

Comment 13 Allon Mureinik 2014-06-10 08:52:52 UTC
(In reply to Adam Litke from comment #12)
> (In reply to Nir Soffer from comment #11)
> > (In reply to Adam Litke from comment #10)
> > > I've reproduced this and determined the root cause.  Taking bug.
> > 
> > Can you share the root cause with us?
> 
> Of course :) I was just waiting for the fix to appear in gerrit.
> 
> As explained in http://gerrit.ovirt.org/#/c/28531/1 , When a snapshot XML
> does not specify that it is of type block, type file is assumed.  This has
> the side effect of converting the libvirt disk to type file.  When libvirt
> is returning the high write watermark information for a drive, it always
> returns the physical file size for file disks, not the value given by qemu.

Just to add the info from the patch - this behavior "change" was introduced in libvirt 1.2.2.

Comment 14 Allon Mureinik 2014-07-18 17:39:16 UTC
Adam - this was merged in oVirt 3.5.
Is there any sense in backporting it to 3.4.z too?

Comment 15 Adam Litke 2014-07-18 19:02:30 UTC
I guess we probably should.  I'll need to backport http://gerrit.ovirt.org/#/c/28531/1 and http://gerrit.ovirt.org/30228 (which fixes a regression introduced by the first patch).  Let me build up a test environment and get a patch for it submitted.

Comment 16 Adam Litke 2014-07-21 19:12:31 UTC
After further investigation I am reversing myself regarding a 3.4 backport.  I don't see this problem in my 3.4 environment.  I think the problem only manifests with newer versions of libvirt (ie. the current fedora 20 version is 1.1.3.5-2 and that works fine).  Given this, I think it only makes sense to target to 3.5 where the newer version of libvirt will be utilized.

Comment 17 Kevin Alon Goldblatt 2014-08-17 14:33:33 UTC
I ran the scenario from above as follows:

Steps to Reproduce:
1. created a VM with 1 disk of 2 GB thinly provisioned on a iscsi block device
2. Created a template of the VM
3. Created a Pool of 3 VMs from the template
4. Brought up one of the VMs 
5. Checked the Used and Available space on the Storage Domain
6. Created a snapshot of the VM that was brought up
7. Wrote to the disk of the VM (triend installing and os greater than 2GB) >>> The installation failed which means the size allocation is now correctly limited to the Virtual size of the leaf. 


Moving to Verified

Comment 18 Sandro Bonazzola 2014-10-17 12:38:23 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 19 Nir Soffer 2014-12-23 08:06:47 UTC
Ori, can you explain why this bug blocks bug 1176673?

Comment 20 Ori Gofen 2014-12-23 09:21:55 UTC
Nir I think that it's an automatic bug tracker, I haven't set this bug as depended (even though I can perfectly understand the logic :))