Bug 1321608

Summary:	When using RHEV with thin provisioned block storage, SAN space is allocated upon creation of VM virtual disks, but never returned back to the SAN after deleting virtual disks, causing the storage usage to never shrink
Product:	Red Hat Enterprise Linux 7	Reporter:	Greg Scott <gscott>
Component:	lvm2	Assignee:	LVM and device-mapper development team <lvm-team>
lvm2 sub component:	Thin Provisioning	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	agk, amureini, fsimonce, gscott, heinzm, jbrassow, mkalinin, msnitzer, prajnoha, prockai, thornber, ylavi, zkabelac
Version:	7.2
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-26 12:20:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Greg Scott 2016-03-28 15:24:41 UTC

Description of problem:
The SAN thin-provisions storage LUNs to RHEV-H or full-fledged RHEL hosts. And the RHEV virtual disks, in turn, are also thin provisioned to the virtual machines. I set up a virtual machine pool with, say, 1000 virtual machines. Each VM "thinks" it has a 70GB hard drive. But it's all thin provisioned, so I don't really allocate anywhere near that much. Over time, those VMs add data and the allocated space grows.

Every month, I update my VM template and remove all those VMs in the pool, freeing up the allocated space from the RHEV point of view. But from the SAN point of view, all that space is still allocated. When I create my new pool and 1000 new VMs from the updated template, the SAN allocates additional space because the SAN incorrectly "thinks" the space from the earlier pool is still allocated.

Eventually I run out of SAN space because the SAN continues to allocate new space instead of reusing the old space because it doesn't know about the space I tried to give back when I deleted all those virtual machines.

Supposedly, I can influence this behavior by setting:

issue_discards = 1

in /etc/lvm/lvm.conf.

But various articles and other BZ reports suggest this also has problems and apparently some flash storage arrays don't support it. Its default value is 0 everywhere, including with RHEV-H, and RHEV engineering does not recommend changing it.

The RHEV engineering and support teams and possibly other LVM consumers need guidance on the best setting for issue_discards and whether or not setting it will produce the desired behavior. Or if a better mechanism to tell the SAN that deleted LVM components are no longer allocated exists, we need guidance on how to use it. Or if RHEL does not pass this information to the SAN under any circumstances, then it's a bug and should be fixed.

Version-Release number of selected component (if applicable):
RHEL 6.n and 7.n and all versions of RHEV-H to date.

How reproducible:
At will

Steps to Reproduce:
1. Connect a RHEL host to a thin provisioned SAN LUN and set up LVM volume group(s).
2. Create a thin provisioned LVM logical volume.
3. Populate the LV with data.
4. Get rid of the LV above.
5. The SAN still "thinks" the space is allocated.

Actual results:
See "Steps to Reproduce" above.

Expected results:
The SAN should realize the space is free when I get rid of LVM components in a LUN and reuse that free space.

Additional info:

I am setting this to high priority because we've been talking about this for a long time and the problem is causing havoc with customers.

See related BZ reports,
https://bugzilla.redhat.com/show_bug.cgi?id=1241106
https://bugzilla.redhat.com/show_bug.cgi?id=729628
https://bugzilla.redhat.com/show_bug.cgi?id=981626

And this article from 2013:
https://access.redhat.com/solutions/417203

And support case number 01454232.

Comment 2 Federico Simoncelli 2016-03-29 10:42:57 UTC

WRT implementation, please consider comment 10 on bug 981626:

https://bugzilla.redhat.com/show_bug.cgi?id=981626#c10

Using blkdiscard on the LV instead of "issue_discards = 1" is the preferable (safest) approach.

Comment 3 Zdenek Kabelac 2016-03-29 11:07:48 UTC

Yes - your comments in BZ are quite true.

There is 'misunderstanding' what 'issue_discards=1' really means.

It has nothing in common with 'passing through' discards to underlying device - this works no matter what is set for issue_discards.

So issue_discard takes only place when some 'real' space in VG is released (aka lvremove).

And this happens when LV device itself is ALREADY destroyed - individual released extent chunks are then 1-by-1 discarded (and yes it can be lengthy operation - preventing other lvm2 commands to take action).

With 'thin-pool' there is no real-space returned to 'VG' - as all space is effectively still kept inside dataLV.

Now this BZ effectively wants from 'lvm2' to implement some kind of 'pre-remove-command-to-execute'

So intead of user executing blkdiscard - lvm2 could automatically call such operation (on still active LV) - from the nature of this operation - there are races since nothing will prevent such LV to be again opened and written during this 'time-window' - but it's likely the price we need to live with.

So as a workaround until something 'clever' will automate this in lvm2 - calling blkdiscard prio 'lvremove' call will have the same effect.

----------------- under-the-line------------------------

Users should NOT actually use issue_discard by default - it makes 'vgcfgrestore' pointless....
(I've already seen quite a few users left with just dump content of LV after they have realized their lvreduce/lvremove operation was not what they really wanted....)

Normally the space is TRIM-ed/discard after new LV is allocated and 'mkfs' is executed.

Comment 4 Greg Scott 2016-03-29 16:50:29 UTC

Thanks Zdenek. I set up this bz mostly to ask for guidance on the best way to proceed because the lost storage in this use case adds up to multiple terabytes.  If blkdiscard is the best way to proceed and can work right now, I don't have strong feelings that LVM needs major surgery.

For the RHEV/Ovirt team and other engineering consumers of LVM - will blkdiscard achieve the goal of returning unallocated space back to the SAN and is it easy to do?  And how fast can we implement it?

Comment 22 Yaniv Kaul 2016-03-30 13:22:44 UTC

I think the solution should be that the 'wipe after delete' action should be converted to use 'write_same' (via blkdiscard) and fill the area with zeros.

Comment 23 Allon Mureinik 2016-03-30 17:00:06 UTC

(In reply to Yaniv Kaul from comment #22)
> I think the solution should be that the 'wipe after delete' action should be
> converted to use 'write_same' (via blkdiscard) and fill the area with zeros.
Generally true, but you can see here that we use postZeroes=0, i.e., no wiping, just a bunch of lvrmeoves

Comment 24 Zdenek Kabelac 2016-03-31 09:13:31 UTC

Ok - I'm really getting confused from all the comments and BZ title.

So - if this suppose to remain as an 'lvm2'  BZ - could anyone please some condense what is supposedly an lvm2 problem for fixing ?

From my current understanding - it's not a problem with lvm2 'thin-provisioning' and TRIM/discard support.

lvm2 team has no real idea what RHEV calls thin-provisiong in other context.

So - we would like to see some  'basic' summary about:

current input  (executed command)

current output (what is missing on storage side?)

expected fixes for lvm2

Comment 25 Greg Scott 2016-04-01 06:46:42 UTC

Zdenek, I wish I could provide the info you're asking for.  But from the field point of view I don't know if we need lvm2 fixes or just RHEV fixes or fixes to both.  I don't know the internals on how this all works, I only know the overall behavior is broken and needs to be fixed yesterday. 

I also know there have been several BZs around this same problem and none of them have been resolved satisfactorily. So customers keep experiencing the problem I documented in the original problem statement.

If RHEV does lvremoves, but the lvremove doesn't tell the SAN about it, that feels like an LVM problem to me. But I only see the whole package, not deep into its components. 

I was hoping this time for a meeting of the minds with RHEV and LVM2 and any other components that make up the total package so we can finally solve this problem once and for all.

I'll leave the needsinfo turned on because I'm not able to provide the info you need, but perhaps others from RHEV and other Engineering groups can.

- Greg

Comment 27 Zdenek Kabelac 2016-04-20 08:21:22 UTC

Still confused...

From previous comments it does look like RHEV is using lvm2 thin-provisioning over  SAN-kind-thin-provisiong as IMHO if the 'SAN' needs to release unused blocks - it does look like another kind of lvm2 thin-pool running on the SAN machine.

This doesn't seem to be practical to have both technologies doing the same ?

Is there any practical reason to use then lvm2 thins in this case instead of plain linear LVs ? (snapshots ??)

As a 'workaround' for lvm2 thins - RHEV may just call 'blkdiscard' before calling lvremove (i.e. basically wrapper shell script  lvremovediscardthin) - lvm2 cannot do anything better anyway, but ATM lvm2 does not provide 'hooks' for calling apps before lvremove.

Comment 28 Greg Scott 2016-04-21 07:47:28 UTC

Would this whole issue go away by using fully provisioned RHEV virtual disk images on top of SAN thin-provisioned LUNs? If that's an easy workaround and it solves the problem, I'm more than willing to try it.  But if I use fully provisioned RHEV storage, does that consume more SAN space than thin provisioned RHEV storage, even though the SAN LUN is thin provisioned?

Comment 29 Zdenek Kabelac 2016-04-21 09:26:16 UTC

Reading back the history of this BZ - mainly comment 7 and its referenced BZ and Federico analysis -

To get to the core - I'd probably prefer some confcall/irc discussion to move more quickly forward here.

It seems there is no 'lvm2' issue - as there is no thinLV in use, and for normal LVs we do send discard right to underlying PVs (and holding VG write lock).

So is the PV device (SAN) actually supporting discard ?
(Could be easily checked via /sys/block/... content).

Given the fact the 'discard' ioctl is synchronous and made while holding VG write lock as Federico correctly pointed out - it's much better to call blkdiscard prior taking the lock.

I'll just explain difference a minor difference as bug 981626 comment 10 is not completely correct.

Issuing blkdiscard on active LV has an small 'race' problem - if you have still some user there - after discard operation the device space may still be occupied by new written content of a device user.

So lvm2 does NOT send discard on an ACTIVE LV - instead it is discarding areas on PV - for such operation lvm2 ATM holds the lock so such LV cannot be activated - we may improve this eventually and deploy polling architecture if that would be requested - but it'd be non-trvial RFE.

Getting back to the BZ -
Where in this chain of operation is seen an issue with lvm2 ?

We would need to then metadata, and exact trace of single lvm2 operations (not a mixture trace of python logging) where lvm2 does not discard released block.
Also we need to see that system does support discard on such PV device.

Comment 30 Yaniv Lavi 2016-05-22 07:47:17 UTC

Can you please provide a reply on the above question?

Comment 31 Greg Scott 2016-05-23 16:42:57 UTC

Zdenek, Yaniv - I don't know how to answer Zdenek's question:

> Where in this chain of operation is seen an issue with lvm2 ?

I don't know.  I do know that when we delete virtual machine disk images, the free space never becomes free from the SAN point of view and my customer put in two 18 hour days over the past weekend rolling over new RHEV storage domains to recover some free SAN space. Hopefully we can all put our heads together and come up with a solution.  I'm happy to IRC or talk on the phone or video as needed. 

- Greg

Comment 32 Yaniv Kaul 2016-05-26 12:20:16 UTC

I think we know know this is a duplicate of bug 981626 - we need to send DISCARD when deleting the disk (LV), so XtremIO can reclaim space.

*** This bug has been marked as a duplicate of bug 981626 ***