Bug 835622
| Summary: | RFE: virt-sparsify should be able to sparsify onto a thin-provisioned LV | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dominic Cleal <dcleal> | ||||
| Component: | libguestfs | Assignee: | Richard W.M. Jones <rjones> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | YongkuiGuo <yoguo> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.3 | CC: | asdavis, jsuchane, kmorey, linl, mbooth, mtessun, pbonzini, ptoscano, rjones, robert, xchen, yoguo | ||||
| Target Milestone: | rc | Keywords: | FutureFeature | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-04-25 09:14:23 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Dominic Cleal
2012-06-26 15:47:12 UTC
The last step of virt-sparsify, the one which actually performs sparsification, is that we run 'qemu-img convert' with the source being a temporary disk image and the destination being the final disk image (a thin LV in this case). qemu-img convert normally ignores zero blocks on the input and doesn't write them to the output, which is how sparsification happens. I was easily able to reproduce this problem just using qemu-img convert and a hand-created thin volume: truncate -s 256M /tmp/test1.img lvcreate -L 4G --type thin --thinpool TmpPool /dev/vg_pin lvcreate -T -n TmpThinVol -V 2G /dev/vg_pin/TmpPool Initially the thin volume is not allocated: # lvs|grep Tmp TmpPool vg_pin twi-a-tz 4.00g 0.00 TmpThinVol vg_pin Vwi-a-tz 2.00g TmpPool 0.00 After qemu-img convert of the empty raw file into the thin volume, it is fully allocated up to 256MB (12.5% of 2G): # qemu-img convert -f raw /tmp/test1.img -O raw /dev/vg_pin/TmpThinVol # !lvs lvs|grep Tmp TmpPool vg_pin twi-a-tz 4.00g 6.25 TmpThinVol vg_pin Vwi-a-tz 2.00g TmpPool 12.50 This shouldn't happen because qemu-img convert is supposed to not be writing zeroes to the output. I strace'd qemu-img and found that in fact it was writing blocks of zeroes to the output. (Compare this to running the following command: qemu-img convert -f raw /tmp/test1.img -O raw /tmp/test2.img and you will see that qemu-img does not write anything to the second file). This comes down to the implementation of two block device drivers inside qemu: bdrv_file (in block/raw-posix.c) is used to handle regular files, and it deals with holes in files. bdrv_host_device (in the same file) is used to handle block devices (which it detects using S_ISBLK). This does not deal with holes because (a) regular devices don't have holes (arguably[*]) and (b) because the device already exists you have to be careful to write zeroes, overwriting any data that is already there. [*] arguably there are new system calls that can do this now What is needed, therefore, is a new block device type which can specifically handle LVM thin LVs. It needs to be able to detect them, and then use whatever means necessary to deal with existing sparseness in the image (note it's unlikely it would easily be able to create new sparseness in an existing thin LV which had been partially used). (CCing Paolo Bonzini who can correct any egregious errors in the above analysis ...) (In reply to comment #1) > This comes down to the implementation of two block device > drivers inside qemu: > > bdrv_file (in block/raw-posix.c) is used to handle regular files, > and it deals with holes in files. > > bdrv_host_device (in the same file) is used to handle block devices > (which it detects using S_ISBLK). This does not deal with holes > because (a) regular devices don't have holes (arguably[*]) and > (b) because the device already exists you have to be careful to > write zeroes, overwriting any data that is already there. > > [*] arguably there are new system calls that can do this now > > What is needed, therefore, is a new block device type which can > specifically handle LVM thin LVs. It needs to be able to detect > them, and then use whatever means necessary to deal with existing > sparseness in the image (note it's unlikely it would easily be able > to create new sparseness in an existing thin LV which had been > partially used). I think this can go into the generic raw block device support in qemu, without needing to explicitly support thin LVs. The block interface already understands discards (presumably for qcow2 etc) so this could be added to the raw-posix implementation using the BLKDISCARD ioctl - which thin LVs and other devices (e.g. SCSI) respond to. I put together a simple patch recently that does the following: 1. adds discard support if on Linux, based on Etienne Dechamps' patch here[1] 2. performs a full discard on the block device when "creating" it, so a used device is freed up There are issues with it: 1. it makes a major assumption that the device will return zeros after discard. /sys/block/<dev>/queue/discard_zeroes_data reports 0 for thin LVs on F17, which I suspect is wrong. I think Linux also errs on the side of caution by saying 0 for most SCSI devices, unless it's explicitly using a SCSI command that writes zeros. There's a very interesting and relevant discussion[2] to one of Paolo's patches in this area. 2. perhaps it should be using BLKDISCARDZEROES rather than BLKDISCARD 3. no error checking or testing for kernels/devices that don't support it That said, the behaviour of discarding on block device creation looks good for a qemu-img convert to a thin LV. The block discard support is untested. > (CCing Paolo Bonzini who can correct any egregious errors in > the above analysis ...) If Paolo can review it, that'd be great as I'm probably missing many subtleties in the qemu block layer. [1]http://patchwork.ozlabs.org/patch/125298/ [2]http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg01260.html Created attachment 650590 [details]
add block device discard support to qemu (WIP)
Some corrections: re. comment 1: the reason why raw devices behave differently for file and block devices, is that hdev_has_zero_init returns 0. You cannot be sure that devices are all-zeroes when created, so "qemu-img convert" must write everything the hard way. However, the patch of attachment 650590 [details] is wrong in making it return 1 for Linux, because there's no guarantee that BLKDISCARD works at all. re. comment 2: BLKDISCARDZEROES is just a getter for discard_zeroes_data. Discard_zeroes_data is set based on the information provided by the disk firmware. In the specific case of dm-thinp, it could be set to one if the device is not a snapshot, but not in general. I think Rich needs to answer, because virt-sparsify has been rewritten since. I believe that now it uses virtio-scsi and can issue actual discard operations to the LV, it doesn't use "qemu-img convert" at all. After talking to Pino, it should work now as long as: 1) the underlying disk supports the WRITE SAME SCSI commands. This is true if the provisioning_mode should be writesame_16 or writesame_10 (see commit 7985090aa020, "sd: disable discard_zeroes_data for UNMAP", 2014-11-12). 2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support it---and unfortunately I think it doesn't. I suggest creating an RFE for the latter. > 2) dm-thinp supports BLKDISCARDZEROES if the underlying disk(s) support > it---and unfortunately I think it doesn't. The patch series "RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload" (http://www.spinics.net/lists/linux-scsi/msg106538.html) might be a start. Can reproduce this bug with package: libguestfs-1.36.3-6.el7.x86_64 Steps: 1. Create pv, vg, lv, etc: # pvcreate /dev/sda5 # vgcreate -s 8M /dev/vg_pin /dev/sda5 # lvcreate -L 4G --type thin --thinpool TmpPool /dev/vg_pin --virtualsize 8G Using default stripesize 64.00 KiB. Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data. WARNING: Sum of all thin volume sizes (8.00 GiB) exceeds the size of thin pool vg_pin/TmpPool (4.00 GiB)! For thin pool auto extension activation/thin_pool_autoextend_threshold should be below 100. Logical volume "lvol1" created. # lvcreate -T -n TmpThinVol -V 2G /dev/vg_pin/TmpPool Using default stripesize 64.00 KiB. WARNING: Sum of all thin volume sizes (10.00 GiB) exceeds the size of thin pool vg_pin/TmpPool (4.00 GiB)! For thin pool auto extension activation/thin_pool_autoextend_threshold should be below 100. Logical volume "TmpThinVol" created. # lvs|grep Tmp TmpPool vg_pin twi-aotz-- 4.00g 0.00 0.73 TmpThinVol vg_pin Vwi-a-tz-- 2.00g TmpPool 0.00 lvol1 vg_pin Vwi-a-tz-- 8.00g TmpPool 0.00 2. Create a test image and convert it. # truncate -s 256M /tmp/test1.img # qemu-img convert -f raw /tmp/test1.img -O raw /dev/vg_pin/TmpThinVol After qemu-img convert of the empty raw file into the thin volume, it is fully allocated up to 256MB (12.5% of 2G),which is wrong: # !lvs lvs|grep Tmp TmpPool vg_pin twi-aotz-- 4.00g 6.25 2.29 TmpThinVol vg_pin Vwi-a-tz-- 2.00g TmpPool 12.50 lvol1 vg_pin Vwi-a-tz-- 8.00g TmpPool 0.00 (In reply to Paolo Bonzini from comment #8) > I think Rich needs to answer, because virt-sparsify has been rewritten > since. I believe that now it uses virtio-scsi and can issue actual discard > operations to the LV, it doesn't use "qemu-img convert" at all. I think this is the question I was supposed to answer. virt-sparsify has two modes, but the --in-place mode does indeed use virtio-scsi and should issue discard requests, so I see no reason why it wouldn't work on a thin-LV (although naturally I have not tested it ...) Although this bug has probably been fixed already, it needs testing. Moving to 7.7. (In reply to Richard W.M. Jones from comment #14) > Although this bug has probably been fixed already, it needs > testing. Moving to 7.7. Yongkui Guo, can you please help with testing this before we take any further action? Thanks. I reproduced this issue on rhel7.6 according to the comment 12. The problem still exists. We're not planning to fix this in RHEL, and in my opinion it's likely to be an LVM bug rather than a virt-sparsify thing. I'm closing this as WONTFIX. |