Bug 835622 - RFE: virt-sparsify should be able to sparsify onto a thin-provisioned LV [NEEDINFO]
RFE: virt-sparsify should be able to sparsify onto a thin-provisioned LV
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libguestfs (Show other bugs)
7.3
All Linux
unspecified Severity medium
: rc
: ---
Assigned To: Richard W.M. Jones
Virtualization Bugs
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-26 11:47 EDT by Dominic Cleal
Modified: 2017-06-01 21:32 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ydary: needinfo? (rjones)


Attachments (Terms of Use)
add block device discard support to qemu (WIP) (2.01 KB, patch)
2012-11-23 12:50 EST, Dominic Cleal
no flags Details | Diff

  None (edit)
Description Dominic Cleal 2012-06-26 11:47:12 EDT
Description of problem:
Using virt-sparsify from one volume to an LVM2 thinly provisioned volume (dm-thin/dm-thin-pool) results in the LV using 100% of the space of the original volume, with no sparsification.

It appears that qemu-img simply writes zeros to the destination volume from the temporary qcow2 image as it's raw.

Version-Release number of selected component (if applicable):
libguestfs-tools-c-1.18.2-1.fc17.x86_64
qemu-img-1.0-17.fc17.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Get a VM image - say on another LV, 8GB in this example
2. lvcreate -L 8G --type thin --thinpool mypool myvg
3. lvcreate -T -n mythinvol -V 8G myvg/mypool
4. lvs /dev/myvg/mythinvol
5. virt-sparsify /dev/myvg/vmimage /dev/myvg/mythinvol
6. lvs /dev/myvg/mythinvol

Actual results:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool         0.00                        

... virt-sparsify ...

# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g vtpool        100.00

Expected results:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool         0.00                        

Some non-100% value in the Data% column:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool        45.00                        

Additional info:
Thin LVM2 pools + volumes were added in Fedora 17 and RHEL 6.3:
http://fedoraproject.org/wiki/Features/ThinProvisioning
Comment 1 Richard W.M. Jones 2012-11-23 11:31:04 EST
The last step of virt-sparsify, the one which actually performs
sparsification, is that we run 'qemu-img convert' with the source
being a temporary disk image and the destination being the final
disk image (a thin LV in this case).

qemu-img convert normally ignores zero blocks on the input and
doesn't write them to the output, which is how sparsification
happens.

I was easily able to reproduce this problem just using qemu-img
convert and a hand-created thin volume:

truncate -s 256M /tmp/test1.img
lvcreate -L 4G --type thin --thinpool TmpPool /dev/vg_pin
lvcreate -T -n TmpThinVol -V 2G /dev/vg_pin/TmpPool

Initially the thin volume is not allocated:

# lvs|grep Tmp
  TmpPool           vg_pin twi-a-tz   4.00g                  0.00
  TmpThinVol        vg_pin Vwi-a-tz   2.00g TmpPool          0.00

After qemu-img convert of the empty raw file into the thin
volume, it is fully allocated up to 256MB (12.5% of 2G):

# qemu-img convert -f raw /tmp/test1.img -O raw /dev/vg_pin/TmpThinVol
# !lvs
lvs|grep Tmp
  TmpPool           vg_pin twi-a-tz   4.00g                  6.25
  TmpThinVol        vg_pin Vwi-a-tz   2.00g TmpPool         12.50

This shouldn't happen because qemu-img convert is supposed
to not be writing zeroes to the output.  I strace'd qemu-img
and found that in fact it was writing blocks of zeroes to
the output.

(Compare this to running the following command:
qemu-img convert -f raw /tmp/test1.img -O raw /tmp/test2.img
and you will see that qemu-img does not write anything to the
second file).

This comes down to the implementation of two block device
drivers inside qemu:

bdrv_file (in block/raw-posix.c) is used to handle regular files,
and it deals with holes in files.

bdrv_host_device (in the same file) is used to handle block devices
(which it detects using S_ISBLK).  This does not deal with holes
because (a) regular devices don't have holes (arguably[*]) and
(b) because the device already exists you have to be careful to
write zeroes, overwriting any data that is already there.

  [*] arguably there are new system calls that can do this now

What is needed, therefore, is a new block device type which can
specifically handle LVM thin LVs.  It needs to be able to detect
them, and then use whatever means necessary to deal with existing
sparseness in the image (note it's unlikely it would easily be able
to create new sparseness in an existing thin LV which had been
partially used).

(CCing Paolo Bonzini who can correct any egregious errors in
the above analysis ...)
Comment 2 Dominic Cleal 2012-11-23 12:49:19 EST
(In reply to comment #1)
> This comes down to the implementation of two block device
> drivers inside qemu:
> 
> bdrv_file (in block/raw-posix.c) is used to handle regular files,
> and it deals with holes in files.
> 
> bdrv_host_device (in the same file) is used to handle block devices
> (which it detects using S_ISBLK).  This does not deal with holes
> because (a) regular devices don't have holes (arguably[*]) and
> (b) because the device already exists you have to be careful to
> write zeroes, overwriting any data that is already there.
> 
>   [*] arguably there are new system calls that can do this now
> 
> What is needed, therefore, is a new block device type which can
> specifically handle LVM thin LVs.  It needs to be able to detect
> them, and then use whatever means necessary to deal with existing
> sparseness in the image (note it's unlikely it would easily be able
> to create new sparseness in an existing thin LV which had been
> partially used).

I think this can go into the generic raw block device support in qemu, without needing to explicitly support thin LVs.

The block interface already understands discards (presumably for qcow2 etc) so this could be added to the raw-posix implementation using the BLKDISCARD ioctl - which thin LVs and other devices (e.g. SCSI) respond to.

I put together a simple patch recently that does the following:
1. adds discard support if on Linux, based on Etienne Dechamps' patch here[1]
2. performs a full discard on the block device when "creating" it, so a used device is freed up

There are issues with it:
1. it makes a major assumption that the device will return zeros after discard.  /sys/block/<dev>/queue/discard_zeroes_data reports 0 for thin LVs on F17, which I suspect is wrong.  I think Linux also errs on the side of caution by saying 0 for most SCSI devices, unless it's explicitly using a SCSI command that writes zeros.  There's a very interesting and relevant discussion[2] to one of Paolo's patches in this area.
2. perhaps it should be using BLKDISCARDZEROES rather than BLKDISCARD
3. no error checking or testing for kernels/devices that don't support it

That said, the behaviour of discarding on block device creation looks good for a qemu-img convert to a thin LV.  The block discard support is untested.

> (CCing Paolo Bonzini who can correct any egregious errors in
> the above analysis ...)

If Paolo can review it, that'd be great as I'm probably missing many subtleties in the qemu block layer. 

[1]http://patchwork.ozlabs.org/patch/125298/
[2]http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg01260.html
Comment 3 Dominic Cleal 2012-11-23 12:50:32 EST
Created attachment 650590 [details]
add block device discard support to qemu (WIP)
Comment 4 Paolo Bonzini 2012-11-26 04:15:08 EST
Some corrections:

re. comment 1: the reason why raw devices behave differently for file and block devices, is that hdev_has_zero_init returns 0.  You cannot be sure that devices are all-zeroes when created, so "qemu-img convert" must write everything the hard way.  However, the patch of attachment 650590 [details] is wrong in making it return 1 for Linux, because there's no guarantee that BLKDISCARD works at all.

re. comment 2: BLKDISCARDZEROES is just a getter for discard_zeroes_data.  Discard_zeroes_data is set based on the information provided by the disk firmware.  In the specific case of dm-thinp, it could be set to one if the device is not a snapshot, but not in general.
Comment 8 Paolo Bonzini 2017-02-27 12:13:44 EST
I think Rich needs to answer, because virt-sparsify has been rewritten since.  I believe that now it uses virtio-scsi and can issue actual discard operations to the LV, it doesn't use "qemu-img convert" at all.
Comment 9 Paolo Bonzini 2017-02-27 12:35:08 EST
After talking to Pino, it should work now as long as:

1) the underlying disk supports the WRITE SAME SCSI commands.  This is true if the provisioning_mode should be writesame_16 or writesame_10 (see commit 7985090aa020, "sd: disable discard_zeroes_data for UNMAP", 2014-11-12).

2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support it---and unfortunately I think it doesn't.

I suggest creating an RFE for the latter.
Comment 10 Paolo Bonzini 2017-03-29 10:58:38 EDT
> 2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support 
> it---and unfortunately I think it doesn't.

The patch series "RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload" (http://www.spinics.net/lists/linux-scsi/msg106538.html) might be a start.

Note You need to log in before you can comment on or make changes to this bug.