RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 835622 - RFE: virt-sparsify should be able to sparsify onto a thin-provisioned LV
Summary: RFE: virt-sparsify should be able to sparsify onto a thin-provisioned LV
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libguestfs
Version: 7.3
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Richard W.M. Jones
QA Contact: YongkuiGuo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-06-26 15:47 UTC by Dominic Cleal
Modified: 2019-04-25 09:14 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-25 09:14:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
add block device discard support to qemu (WIP) (2.01 KB, patch)
2012-11-23 17:50 UTC, Dominic Cleal
no flags Details | Diff

Description Dominic Cleal 2012-06-26 15:47:12 UTC
Description of problem:
Using virt-sparsify from one volume to an LVM2 thinly provisioned volume (dm-thin/dm-thin-pool) results in the LV using 100% of the space of the original volume, with no sparsification.

It appears that qemu-img simply writes zeros to the destination volume from the temporary qcow2 image as it's raw.

Version-Release number of selected component (if applicable):
libguestfs-tools-c-1.18.2-1.fc17.x86_64
qemu-img-1.0-17.fc17.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Get a VM image - say on another LV, 8GB in this example
2. lvcreate -L 8G --type thin --thinpool mypool myvg
3. lvcreate -T -n mythinvol -V 8G myvg/mypool
4. lvs /dev/myvg/mythinvol
5. virt-sparsify /dev/myvg/vmimage /dev/myvg/mythinvol
6. lvs /dev/myvg/mythinvol

Actual results:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool         0.00                        

... virt-sparsify ...

# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g vtpool        100.00

Expected results:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool         0.00                        

Some non-100% value in the Data% column:
# lvs /dev/myvg/mythinvol
  LV        VG        Attr     LSize  Pool   Origin Data%  Move Log Copy%  Convert
  mythinvol myvg      Vwi-a-tz  8.00g mypool        45.00                        

Additional info:
Thin LVM2 pools + volumes were added in Fedora 17 and RHEL 6.3:
http://fedoraproject.org/wiki/Features/ThinProvisioning

Comment 1 Richard W.M. Jones 2012-11-23 16:31:04 UTC
The last step of virt-sparsify, the one which actually performs
sparsification, is that we run 'qemu-img convert' with the source
being a temporary disk image and the destination being the final
disk image (a thin LV in this case).

qemu-img convert normally ignores zero blocks on the input and
doesn't write them to the output, which is how sparsification
happens.

I was easily able to reproduce this problem just using qemu-img
convert and a hand-created thin volume:

truncate -s 256M /tmp/test1.img
lvcreate -L 4G --type thin --thinpool TmpPool /dev/vg_pin
lvcreate -T -n TmpThinVol -V 2G /dev/vg_pin/TmpPool

Initially the thin volume is not allocated:

# lvs|grep Tmp
  TmpPool           vg_pin twi-a-tz   4.00g                  0.00
  TmpThinVol        vg_pin Vwi-a-tz   2.00g TmpPool          0.00

After qemu-img convert of the empty raw file into the thin
volume, it is fully allocated up to 256MB (12.5% of 2G):

# qemu-img convert -f raw /tmp/test1.img -O raw /dev/vg_pin/TmpThinVol
# !lvs
lvs|grep Tmp
  TmpPool           vg_pin twi-a-tz   4.00g                  6.25
  TmpThinVol        vg_pin Vwi-a-tz   2.00g TmpPool         12.50

This shouldn't happen because qemu-img convert is supposed
to not be writing zeroes to the output.  I strace'd qemu-img
and found that in fact it was writing blocks of zeroes to
the output.

(Compare this to running the following command:
qemu-img convert -f raw /tmp/test1.img -O raw /tmp/test2.img
and you will see that qemu-img does not write anything to the
second file).

This comes down to the implementation of two block device
drivers inside qemu:

bdrv_file (in block/raw-posix.c) is used to handle regular files,
and it deals with holes in files.

bdrv_host_device (in the same file) is used to handle block devices
(which it detects using S_ISBLK).  This does not deal with holes
because (a) regular devices don't have holes (arguably[*]) and
(b) because the device already exists you have to be careful to
write zeroes, overwriting any data that is already there.

  [*] arguably there are new system calls that can do this now

What is needed, therefore, is a new block device type which can
specifically handle LVM thin LVs.  It needs to be able to detect
them, and then use whatever means necessary to deal with existing
sparseness in the image (note it's unlikely it would easily be able
to create new sparseness in an existing thin LV which had been
partially used).

(CCing Paolo Bonzini who can correct any egregious errors in
the above analysis ...)

Comment 2 Dominic Cleal 2012-11-23 17:49:19 UTC
(In reply to comment #1)
> This comes down to the implementation of two block device
> drivers inside qemu:
> 
> bdrv_file (in block/raw-posix.c) is used to handle regular files,
> and it deals with holes in files.
> 
> bdrv_host_device (in the same file) is used to handle block devices
> (which it detects using S_ISBLK).  This does not deal with holes
> because (a) regular devices don't have holes (arguably[*]) and
> (b) because the device already exists you have to be careful to
> write zeroes, overwriting any data that is already there.
> 
>   [*] arguably there are new system calls that can do this now
> 
> What is needed, therefore, is a new block device type which can
> specifically handle LVM thin LVs.  It needs to be able to detect
> them, and then use whatever means necessary to deal with existing
> sparseness in the image (note it's unlikely it would easily be able
> to create new sparseness in an existing thin LV which had been
> partially used).

I think this can go into the generic raw block device support in qemu, without needing to explicitly support thin LVs.

The block interface already understands discards (presumably for qcow2 etc) so this could be added to the raw-posix implementation using the BLKDISCARD ioctl - which thin LVs and other devices (e.g. SCSI) respond to.

I put together a simple patch recently that does the following:
1. adds discard support if on Linux, based on Etienne Dechamps' patch here[1]
2. performs a full discard on the block device when "creating" it, so a used device is freed up

There are issues with it:
1. it makes a major assumption that the device will return zeros after discard.  /sys/block/<dev>/queue/discard_zeroes_data reports 0 for thin LVs on F17, which I suspect is wrong.  I think Linux also errs on the side of caution by saying 0 for most SCSI devices, unless it's explicitly using a SCSI command that writes zeros.  There's a very interesting and relevant discussion[2] to one of Paolo's patches in this area.
2. perhaps it should be using BLKDISCARDZEROES rather than BLKDISCARD
3. no error checking or testing for kernels/devices that don't support it

That said, the behaviour of discarding on block device creation looks good for a qemu-img convert to a thin LV.  The block discard support is untested.

> (CCing Paolo Bonzini who can correct any egregious errors in
> the above analysis ...)

If Paolo can review it, that'd be great as I'm probably missing many subtleties in the qemu block layer. 

[1]http://patchwork.ozlabs.org/patch/125298/
[2]http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg01260.html

Comment 3 Dominic Cleal 2012-11-23 17:50:32 UTC
Created attachment 650590 [details]
add block device discard support to qemu (WIP)

Comment 4 Paolo Bonzini 2012-11-26 09:15:08 UTC
Some corrections:

re. comment 1: the reason why raw devices behave differently for file and block devices, is that hdev_has_zero_init returns 0.  You cannot be sure that devices are all-zeroes when created, so "qemu-img convert" must write everything the hard way.  However, the patch of attachment 650590 [details] is wrong in making it return 1 for Linux, because there's no guarantee that BLKDISCARD works at all.

re. comment 2: BLKDISCARDZEROES is just a getter for discard_zeroes_data.  Discard_zeroes_data is set based on the information provided by the disk firmware.  In the specific case of dm-thinp, it could be set to one if the device is not a snapshot, but not in general.

Comment 8 Paolo Bonzini 2017-02-27 17:13:44 UTC
I think Rich needs to answer, because virt-sparsify has been rewritten since.  I believe that now it uses virtio-scsi and can issue actual discard operations to the LV, it doesn't use "qemu-img convert" at all.

Comment 9 Paolo Bonzini 2017-02-27 17:35:08 UTC
After talking to Pino, it should work now as long as:

1) the underlying disk supports the WRITE SAME SCSI commands.  This is true if the provisioning_mode should be writesame_16 or writesame_10 (see commit 7985090aa020, "sd: disable discard_zeroes_data for UNMAP", 2014-11-12).

2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support it---and unfortunately I think it doesn't.

I suggest creating an RFE for the latter.

Comment 10 Paolo Bonzini 2017-03-29 14:58:38 UTC
> 2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support 
> it---and unfortunately I think it doesn't.

The patch series "RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload" (http://www.spinics.net/lists/linux-scsi/msg106538.html) might be a start.

Comment 12 Xianghua Chen 2017-07-13 02:53:31 UTC
Can reproduce this bug with package:
libguestfs-1.36.3-6.el7.x86_64

Steps:
1. Create pv, vg, lv, etc:
# pvcreate /dev/sda5
# vgcreate -s 8M /dev/vg_pin /dev/sda5
# lvcreate -L 4G --type thin --thinpool TmpPool /dev/vg_pin  --virtualsize 8G
  Using default stripesize 64.00 KiB.
  Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data.
  WARNING: Sum of all thin volume sizes (8.00 GiB) exceeds the size of thin pool vg_pin/TmpPool (4.00 GiB)!
  For thin pool auto extension activation/thin_pool_autoextend_threshold should be below 100.
  Logical volume "lvol1" created.

# lvcreate -T -n TmpThinVol -V 2G /dev/vg_pin/TmpPool
  Using default stripesize 64.00 KiB.
  WARNING: Sum of all thin volume sizes (10.00 GiB) exceeds the size of thin pool vg_pin/TmpPool (4.00 GiB)!
  For thin pool auto extension activation/thin_pool_autoextend_threshold should be below 100.
  Logical volume "TmpThinVol" created.

# lvs|grep Tmp
  TmpPool    vg_pin twi-aotz-- 4.00g                0.00   0.73                            
  TmpThinVol vg_pin Vwi-a-tz-- 2.00g TmpPool        0.00                                   
  lvol1      vg_pin Vwi-a-tz-- 8.00g TmpPool        0.00       

2. Create a test image and convert it.
# truncate -s 256M /tmp/test1.img
# qemu-img convert -f raw /tmp/test1.img -O raw /dev/vg_pin/TmpThinVol

After qemu-img convert of the empty raw file into the thin volume, it is fully allocated up to 256MB (12.5% of 2G),which is wrong:
#  !lvs
 lvs|grep Tmp
  TmpPool    vg_pin twi-aotz-- 4.00g                6.25   2.29                            
  TmpThinVol vg_pin Vwi-a-tz-- 2.00g TmpPool        12.50                                  
  lvol1      vg_pin Vwi-a-tz-- 8.00g TmpPool        0.00

Comment 13 Richard W.M. Jones 2018-07-16 12:14:59 UTC
(In reply to Paolo Bonzini from comment #8)
> I think Rich needs to answer, because virt-sparsify has been rewritten
> since.  I believe that now it uses virtio-scsi and can issue actual discard
> operations to the LV, it doesn't use "qemu-img convert" at all.

I think this is the question I was supposed to answer.  virt-sparsify
has two modes, but the --in-place mode does indeed use virtio-scsi
and should issue discard requests, so I see no reason why it
wouldn't work on a thin-LV (although naturally I have not tested it ...)

Comment 14 Richard W.M. Jones 2018-08-02 09:45:11 UTC
Although this bug has probably been fixed already, it needs
testing.  Moving to 7.7.

Comment 15 Jaroslav Suchanek 2019-04-09 11:30:02 UTC
(In reply to Richard W.M. Jones from comment #14)
> Although this bug has probably been fixed already, it needs
> testing.  Moving to 7.7.

Yongkui Guo, can you please help with testing this before we take any further action? Thanks.

Comment 16 YongkuiGuo 2019-04-10 11:26:01 UTC
I reproduced this issue on rhel7.6 according to the comment 12. The problem still exists.

Comment 17 Richard W.M. Jones 2019-04-25 09:14:23 UTC
We're not planning to fix this in RHEL, and in my opinion it's likely
to be an LVM bug rather than a virt-sparsify thing.  I'm closing
this as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.