Bug 1462504
Summary: | disk discard flag not respected for NFS storage | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Markus Stockhausen <mst> | ||||
Component: | General | Assignee: | Idan Shaby <ishaby> | ||||
Status: | CLOSED NOTABUG | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.1.2 | CC: | amureini, bugs, kwolf, mst | ||||
Target Milestone: | ovirt-4.1.4 | Flags: | rule-engine:
ovirt-4.1+
|
||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-06-29 07:23:07 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Markus Stockhausen
2017-06-18 08:04:32 UTC
I can confirm that (re)setting the discard flag will change the qemu command line: Disk with discard: -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,drive=drive-scsi0-0-0-5,id=scsi0-0-0-5 -drive file=.../6e9edc6b-d133-4114-a275-be0d550afda8,format=raw,if=none,id=drive-scsi0-0-0-6,serial=b2635099-8044-4a44-897b-6e6d1ce53d36,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads Disk without discard: -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=6,drive=drive-scsi0-0-0-6,id=scsi0-0-0-6 -drive file=.../20f1546b-3312-46e0-9eea-f07046ec44e8,format=raw,if=none,id=drive-scsi0-0-0-4,serial=5532dab8-5bed-4278-a988-4fbb799a8012,cache=none,werror=stop,rerror=stop,aio=threads Doing the same normal mkfs.xfs with other VMs (e.g. CentOS 7) does not show the effect. Seems as if only SLES12 SP2 uses mkfs.xfs with discard or whatever option as default. We are using NFS 4.0 mounts in OVirt. At least if looking at the following output. 100.64.251.1:/var/data/nas1/OVirtIB on /rhev/data-center/mnt/100.64.251.1:_var_data_nas1_OVirtIB type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=100.64.252.11,local_lock=none,addr=100.64.251.1) Even if discard is NOT enabled for a disk in Ovirt and thus qemu is NOT started with -drive file=...discard=unmap... the disk inside the VM shows discard granularity: # cat /sys/block/sde/queue/discard_granularity 4096 (In reply to Markus Stockhausen from comment #1) > I can confirm that (re)setting the discard flag will change the qemu command > line: > > Disk with discard: > > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,drive=drive-scsi0-0-0-5, > id=scsi0-0-0-5 > -drive > file=.../6e9edc6b-d133-4114-a275-be0d550afda8,format=raw,if=none,id=drive- So it's indeed RAW. (In reply to Markus Stockhausen from comment #0) > Description of problem: > > We are running OVirt 4.1.2 in a XFS/NFS setup. Some of our VMs are SLES12 > SP2. When formatting thin provisioned disks inside these VMs with mkfs.xfs > the disk will be fully allocated afterwards. > > Version-Release number of selected component (if applicable): > > Engine: Ovirt 4.1.2 > Nodes: Centos 7 > VM: SLES12 SP2 > > How reproducible: > > 100% > > Scenario A: thin provisioned disk WITHOUT discard: > > 1. Create NFS thin provisioned disk. Do not enable discard > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 8 6e9edc6b-d133-4114-a275-be0d550afda8 > > 2. Start VM and format Disk with mkfs.xfs -K > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 20 6e9edc6b-d133-4114-a275-be0d550afda8 > > 3. Format disk as ususal with mkfs.xfs (without -K) > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 20481 6e9edc6b-d133-4114-a275-be0d550afda8 > > Scenario B: thin provisioned disk WITH discard: > > 1. Create NFS thin provisioned disk. Enable discard (for simplicity i just > moved the disk around our NFS storages and it was compacted) > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 8 6e9edc6b-d133-4114-a275-be0d550afda8 > > 2. Start VM and format Disk with mkfs.xfs -K > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 20 6e9edc6b-d133-4114-a275-be0d550afda8 > > 3. Format disk as ususal with mkfs.xfs (without -K) > # du -m 6e9edc6b-d133-4114-a275-be0d550afda8 > 20481 6e9edc6b-d133-4114-a275-be0d550afda8 > > Actual results: > > Disk is fully allocated on storage Can you compare the output of 'du -ch' and 'ls -lh' on the files? Also, run 'qemu-img info' on them, please (I suspect they are raw-sparse). # du -ch 6e9edc6b-d133-4114-a275-be0d550afda8 8.0M 6e9edc6b-d133-4114-a275-be0d550afda8 8.0M total # ls -lh 6e9edc6b-d133-4114-a275-be0d550afda8 -rw-rw----. 1 36 kvm 20G Jun 18 11:36 6e9edc6b-d133-4114-a275-be0d550afda8 # qemu-img info 6e9edc6b-d133-4114-a275-be0d550afda8 image: 6e9edc6b-d133-4114-a275-be0d550afda8 file format: raw virtual size: 20G (21474836480 bytes) disk size: 8.0M (In reply to Markus Stockhausen from comment #7) > # du -ch 6e9edc6b-d133-4114-a275-be0d550afda8 > 8.0M 6e9edc6b-d133-4114-a275-be0d550afda8 > 8.0M total > > # ls -lh 6e9edc6b-d133-4114-a275-be0d550afda8 > -rw-rw----. 1 36 kvm 20G Jun 18 11:36 6e9edc6b-d133-4114-a275-be0d550afda8 > > # qemu-img info 6e9edc6b-d133-4114-a275-be0d550afda8 > image: 6e9edc6b-d133-4114-a275-be0d550afda8 > file format: raw > virtual size: 20G (21474836480 bytes) > disk size: 8.0M Looks OK to me. As suspected - it is raw, but sparsely allocated. Can you verify this also is the case in all scenarios, with and without discard support? It is the same in both cases. Remember that I always use the same image. I change the disk flag in Ovirt, start the VM and then I move the image to another NFS. Afterwards it is small again. Just created a completly new empty disk WITHOUT discard flag. An strace -tt mkfs.xfs /dev/sde1 inside the VM gave the following output: ... 12:01:08.166192 ioctl(4, BLKSSZGET, 512) = 0 12:01:08.166215 chdir("/root") = 0 12:01:08.166293 close(3) 12:01:08.166338 ioctl(4, BLKDISCARD, {0, 0}) = 0 12:01:41.963004 fstat(1, {st_mode=S_IFCHR@0620, st_rdev=makedev(136, 0), ...}) = 0 ... As you can see the DISCARD command takes 30 seconds (in this case for an 20GB disk). That does not seem that qemu skips discard commands. Tracing qemu at that time gives tons of those: 34195 12:11:09.063477 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12231626752 <unfinished ...> 34192 12:11:09.089665 <... pwrite resumed> ) = 16777216 34193 12:11:09.090354 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16593698816 <unfinished ...> 34194 12:11:09.104689 <... pwrite resumed> ) = 16777216 34192 12:11:09.105191 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14496550912 <unfinished ...> 34195 12:11:09.143977 <... pwrite resumed> ) = 16777216 34194 12:11:09.144939 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12248403968 <unfinished ...> 34193 12:11:09.145499 <... pwrite resumed> ) = 16777216 34191 12:11:09.145783 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16610476032 <unfinished ...> 34192 12:11:09.172102 <... pwrite resumed> ) = 16777216 34193 12:11:09.172661 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14513328128 <unfinished ...> 34194 12:11:09.173719 <... pwrite resumed> ) = 16777216 34192 12:11:09.174230 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12265181184 <unfinished ...> 34191 12:11:09.179010 <... pwrite resumed> ) = 16777216 34194 12:11:09.179477 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16627253248 <unfinished ...> 34193 12:11:09.237510 <... pwrite resumed> ) = 16777216 34191 12:11:09.238988 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14530105344 <unfinished ...> 34192 12:11:09.276079 <... pwrite resumed> ) = 16777216 Why should it write zeros if discard is disabled for the disk? > As you can see the DISCARD command takes 30 seconds (in this case for an
> 20GB disk). That does not seem that qemu skips discard commands.
You need to differentiate between two things here:
1. Whether the UNMAP command is called or not (by the guest OS).
2. Whether qemu passes it to the underlying storage.
What you just saw is the UNMAP command called from the guest, which is ok, since you did not use the -K flag of mkfs.xfs.
What you don't see here is that qemu didn't pass it to the underlying storage.
You can check it by examining the thinly provisioned underlying storage lun - its free space should not grow right after the UNMAP command is sent from the guest.
See my comment 12. From my current point of view. qemu wraps it into write zeros. (In reply to Markus Stockhausen from comment #4) > Even if discard is NOT enabled for a disk in Ovirt and thus qemu is NOT > started with -drive file=...discard=unmap... the disk inside the VM shows > discard granularity: > > # cat /sys/block/sde/queue/discard_granularity > 4096 Correct, but discard can also be ignored silently, in any layer of the chain, and higher layers know that. I think with some sg_inq querying we can see if discard is supported or not. sudo sg_inq -p 0xb0 /dev/sde could give you the information on unmap (or sudo sg_vpd -p 0xb2 /dev/sde perhaps, with LBPU flag?) > Why should it write zeros if discard is disabled for the disk?
1. Do these "pwrite"s occur also when using the -K flag?
I suspect that they are executed as part of the same process that calls discard and can be skipped by using -K.
2. Since you are using NFS V4.0 and discard is supported only from NFS V4.2, why don't you use the -K flag?
From what I understand, there is no reason to use "Enable Discard" nor to intentionally generate UNMAP calls that will not be used anyway, since it only causes performance degradation.
(In reply to Idan Shaby from comment #15) > From what I understand, there is no reason to use "Enable Discard" nor to > intentionally generate UNMAP calls that will not be used anyway, since it > only causes performance degradation. My problem is, that I do not have "enable discard" active. Checkbox is empty. So I would expect qemu to drop the discard command from the VM and not to write gigabytes of zeros. The root of it all is the installation of an OS inside the VM. If you use the usual partitioning dialogue you usually have no chance of using the -K flag. A cross check with NFS 4.2 mount shows behaviour as expected. Discard command I did several checks with NFS 4.0, 4.1 and 4.2. The discard flag has no effect at all. Check or unchecked the behaviour is always as follows: NFS 4.0 + NFS 4.1 Discard inside VM will always result in writing zeros to NFS image NFS 4.2 Discard inside VM will always result in discard to NFS image > NFS 4.0 + NFS 4.1
> Discard inside VM will always result in writing zeros to NFS image
Can you explain what exactly do you mean by "Discard"? Which command are we talking about?
Sorry for mixing this up - in detail: Im talking about the qemu discard option: discard=unmap -> qemu should handover the unmap command to the lower layers This is the same as the Ovirt option for a disk "enable discard" -> start qemu with discard=unmap for a disk Inside the VM we are talking about the unmap command: unmap() -> issued inside the VM from mkfs.xfs From my analysis the Ovirt (and thus qemu) option does not have any effect when using NFS. With unmap() inside the VM qemu is either writing zeroes (NFS 4.0 & 4.1) or unmapping data (NFS 4.2). From comments above I understand that qemu should ignore any unmap() command from inside the VM if we are using NFS 4.0 or 4.1 because that protocol does not support unmap. To my surprise it is not dropped but instead qemu is using zeroes to empty the data on the NFS storage. Sounds like "does not work as expected". (In reply to Markus Stockhausen from comment #18) > I did several checks with NFS 4.0, 4.1 and 4.2. The discard flag has no > effect at all. Check or unchecked the behaviour is always as follows: > > NFS 4.0 + NFS 4.1 > Discard inside VM will always result in writing zeros to NFS image > > NFS 4.2 > Discard inside VM will always result in discard to NFS image Kevin - does that make sense that QEMU would translate a discard into writing zeros if the underlying FS does not support discard? (I remember the other way around - a zero write can become a discard if the underlying storage supports it) (In reply to Yaniv Kaul from comment #21) > Kevin - does that make sense that QEMU would translate a discard into > writing zeros if the underlying FS does not support discard? No, it doesn't. If QEMU can't discard, it ignores the request because discard is only a hint. > (I remember the other way around - a zero write can become a discard if the > underlying storage supports it) What does the guest OS actually request from QEMU? If it sends a WRITE SAME SCSI command with the unmap flag set, then that's not a discard request, but a write zeros request that QEMU is allowed (but not required) to fulfill with doing a discard instead. If it can't use discard, it must write explicit zeros. (This is the case that you remember.) If the guest sends an UNMAP command, however, then QEMU can ignore the request if discarding isn't possible. See Comment 10. Guest runs a mkfs.xfs. strace shows the following output with an 30 seconds delay during the ioctl BLKDISCARD. ... 12:01:08.166192 ioctl(4, BLKSSZGET, 512) = 0 12:01:08.166215 chdir("/root") = 0 12:01:08.166293 close(3) 12:01:08.166338 ioctl(4, BLKDISCARD, {0, 0}) = 0 12:01:41.963004 fstat(1, {st_mode=S_IFCHR@0620, st_rdev=makedev(136, 0), ...}) = 0 ... So essentially, if the underlying FS (NFS earlier than v4.2) does not support DISCARD, it passes this information on to the guest, which then uses zero writes instead of DISCARD? Makes some sense to me... It'd be interesting to see what the guest sees in both cases, using: sudo sg_inq -p 0xb0 /dev/sde could give you the information on unmap (or sudo sg_vpd -p 0xb2 /dev/sde perhaps, with LBPU flag?) (In reply to Yaniv Kaul from comment #24) > So essentially, if the underlying FS (NFS earlier than v4.2) does not > support DISCARD, it passes this information on to the guest, which then uses > zero writes instead of DISCARD? Makes some sense to me... But the outcome of discarding a block of data and writing zeroes to it is not the same, so how can it be? And what do you mean by saying that the FS passes the information to the guest? Can you clarify? I should still investigate this issue, but from a quick glance at mkfs.xfs I do see that it may write zeros (128M) to "the beginning and end of the device to remove traces of other filesystems, raid superblocks, etc" (xfsprogs-dev/mkfs/xfs_mkfs.c, WHACK_SIZE). Maybe if we can discard we do, and if we can't we zero those areas manually? Worth checking. Can you please attach the full command that you execute inside the vm along with its trace and qemu's trace on NFS4.1 and 4.2? Had to use a new machine. VMs /dev/sde is now /dev/sdb 1) NFS 4.2 + Ovirt disk discard-flag unchecked # sg_inq -p 0xb0 /dev/sdb VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 0 blocks Optimal transfer length granularity: 0 blocks Maximum transfer length: 4194303 blocks Optimal transfer length: 0 blocks Maximum prefetch transfer length: 0 blocks Maximum unmap LBA count: 2097152 Maximum unmap block descriptor count: 255 Optimal unmap granularity: 8 Unmap granularity alignment valid: 0 Unmap granularity alignment: 0 Maximum write same length: 0x3fffff blocks Maximum atomic transfer length: 0 Atomic alignment: 0 Atomic transfer length granularity: 0 # sg_vpd -p 0xb2 /dev/sdb Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 1 Logical block provisioning read zeros (LBPRZ): 0 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Minimum percentage: 0 Provisioning type: 2 Threshold percentage: 0 2) NFS 4.2 + Ovirt disk discard-flag checked # sg_inq -p 0xb0 /dev/sdb VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 0 blocks Optimal transfer length granularity: 0 blocks Maximum transfer length: 4194303 blocks Optimal transfer length: 0 blocks Maximum prefetch transfer length: 0 blocks Maximum unmap LBA count: 2097152 Maximum unmap block descriptor count: 255 Optimal unmap granularity: 8 Unmap granularity alignment valid: 0 Unmap granularity alignment: 0 Maximum write same length: 0x3fffff blocks Maximum atomic transfer length: 0 Atomic alignment: 0 Atomic transfer length granularity: 0 # sg_vpd -p 0xb2 /dev/sdb Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 1 Logical block provisioning read zeros (LBPRZ): 0 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Minimum percentage: 0 Provisioning type: 2 Threshold percentage: 0 3) NFS 4.1 + Ovirt disk discard-flag unchecked # sg_inq -p 0xb0 /dev/sdb VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 0 blocks Optimal transfer length granularity: 0 blocks Maximum transfer length: 4194303 blocks Optimal transfer length: 0 blocks Maximum prefetch transfer length: 0 blocks Maximum unmap LBA count: 2097152 Maximum unmap block descriptor count: 255 Optimal unmap granularity: 8 Unmap granularity alignment valid: 0 Unmap granularity alignment: 0 Maximum write same length: 0x3fffff blocks Maximum atomic transfer length: 0 Atomic alignment: 0 Atomic transfer length granularity: 0 # sg_vpd -p 0xb2 /dev/sdb Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 1 Logical block provisioning read zeros (LBPRZ): 0 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Minimum percentage: 0 Provisioning type: 2 Threshold percentage: 0 4) NFS 4.1 + Ovirt disk discard-flag checked # sg_inq -p 0xb0 /dev/sdb VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 0 blocks Optimal transfer length granularity: 0 blocks Maximum transfer length: 4194303 blocks Optimal transfer length: 0 blocks Maximum prefetch transfer length: 0 blocks Maximum unmap LBA count: 2097152 Maximum unmap block descriptor count: 255 Optimal unmap granularity: 8 Unmap granularity alignment valid: 0 Unmap granularity alignment: 0 Maximum write same length: 0x3fffff blocks Maximum atomic transfer length: 0 Atomic alignment: 0 Atomic transfer length granularity: 0 # sg_vpd -p 0xb2 /dev/sdb Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 1 Logical block provisioning read zeros (LBPRZ): 0 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Minimum percentage: 0 Provisioning type: 2 Threshold percentage: 0 Created attachment 1291730 [details]
vm + hypervisor logs
From the logs we see: mkfs inside VM executes ioctl BLKDISCARD this translates into: qemu on nfs 4.2 executing fallocate calls qemu on nfs 4.1 executing write zero calls (In reply to Markus Stockhausen from comment #29) > From the logs we see: > > mkfs inside VM executes ioctl BLKDISCARD > > this translates into: > > qemu on nfs 4.2 executing fallocate calls OK, this makes sense to me - I forgot this is RAW and not qcow2. I assume it's FALLOC_FL_PUNCH_HOLE (https://github.com/qemu/qemu/blob/0748b3526e8cb78b9cd64208426bfc3d54a72b04/block/file-posix.c#L1396 perhaps?) The file is still sparse, right? > > qemu on nfs 4.1 executing write zero calls I think this is what fallocate does on NFS before 4.2 (http://thread.gmane.org/gmane.linux.nfs/59563) which should translate to essentially the same size, without the real sparsification (writing all zeros may or may not allocate disk space for real, depending on the storage backend). That leaves two questions open. 1. Does qemu present NFS 4.1 / 4.2 storead raw disk images to the client in the right fashion? 2. As discard option does not make sense images on NFS should this be disabled or automatically set to the right value (matching the NFS version)? (In reply to Markus Stockhausen from comment #31) > That leaves two questions open. > > 1. Does qemu present NFS 4.1 / 4.2 storead raw disk images to the client in > the right fashion? oVirt always uses raw-sparse disks on file-based storage. > > 2. As discard option does not make sense images on NFS should this be > disabled or automatically set to the right value (matching the NFS version)? I'm not sure. It still leaves sparse files on NFS, and I'd imagine on 4.2 it can pass discard to the underlying storage, so it looks like it's better to leave as is. Note that some underlying storage know how to dedup/compress/whatever zero-filled blocks intelligently, so it does make sense to bother and write zeros. Looks like it's not a bug. Please reopen if you got further questions. |