Bug 1462504

Summary:

disk discard flag not respected for NFS storage

Product:

[oVirt] ovirt-engine

Reporter:

Markus Stockhausen <mst>

Component:

General

Assignee:

Idan Shaby <ishaby>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.1.2

CC:

amureini, bugs, kwolf, mst

Target Milestone:

ovirt-4.1.4

Flags:

rule-engine: ovirt-4.1+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-06-29 07:23:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vm + hypervisor logs	none

Description Markus Stockhausen 2017-06-18 08:04:32 UTC

Description of problem:

We are running OVirt 4.1.2 in a XFS/NFS setup. Some of our VMs are SLES12 SP2. When formatting thin provisioned disks inside these VMs with mkfs.xfs the disk will be fully allocated afterwards.

Version-Release number of selected component (if applicable):

Engine: Ovirt 4.1.2
Nodes: Centos 7
VM: SLES12 SP2

How reproducible:

100%

Scenario A: thin provisioned disk WITHOUT discard:

1. Create NFS thin provisioned disk. Do not enable discard
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
8       6e9edc6b-d133-4114-a275-be0d550afda8

2. Start VM and format Disk with mkfs.xfs -K
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
20      6e9edc6b-d133-4114-a275-be0d550afda8
 
3. Format disk as ususal with mkfs.xfs (without -K)
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
20481   6e9edc6b-d133-4114-a275-be0d550afda8

Scenario B: thin provisioned disk WITH discard:

1. Create NFS thin provisioned disk. Enable discard (for simplicity i just moved the disk around our NFS storages and it was compacted)
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
8       6e9edc6b-d133-4114-a275-be0d550afda8

2. Start VM and format Disk with mkfs.xfs -K
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
20      6e9edc6b-d133-4114-a275-be0d550afda8
 
3. Format disk as ususal with mkfs.xfs (without -K)
# du -m 6e9edc6b-d133-4114-a275-be0d550afda8
20481   6e9edc6b-d133-4114-a275-be0d550afda8

Actual results:

Disk is fully allocated on storage

Expected results:

Disk should be still thin provisioned

Comment 1 Markus Stockhausen 2017-06-18 08:09:01 UTC

I can confirm that (re)setting the discard flag will change the qemu command line:

Disk with discard:

-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,drive=drive-scsi0-0-0-5,id=scsi0-0-0-5 
 -drive file=.../6e9edc6b-d133-4114-a275-be0d550afda8,format=raw,if=none,id=drive-scsi0-0-0-6,serial=b2635099-8044-4a44-897b-6e6d1ce53d36,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads 

Disk without discard:

-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=6,drive=drive-scsi0-0-0-6,id=scsi0-0-0-6 
 -drive file=.../20f1546b-3312-46e0-9eea-f07046ec44e8,format=raw,if=none,id=drive-scsi0-0-0-4,serial=5532dab8-5bed-4278-a988-4fbb799a8012,cache=none,werror=stop,rerror=stop,aio=threads

Comment 2 Markus Stockhausen 2017-06-18 08:11:20 UTC

Doing the same normal mkfs.xfs with other VMs (e.g. CentOS 7) does not show the effect. Seems as if only SLES12 SP2 uses mkfs.xfs with discard or whatever option as default.

Comment 3 Markus Stockhausen 2017-06-18 08:15:07 UTC

We are using NFS 4.0 mounts in OVirt. At least if looking at the following output.

100.64.251.1:/var/data/nas1/OVirtIB on /rhev/data-center/mnt/100.64.251.1:_var_data_nas1_OVirtIB type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=100.64.252.11,local_lock=none,addr=100.64.251.1)

Comment 4 Markus Stockhausen 2017-06-18 08:22:11 UTC

Even if discard is NOT enabled for a disk in Ovirt and thus qemu is NOT started with -drive file=...discard=unmap... the disk inside the VM shows discard granularity:

# cat /sys/block/sde/queue/discard_granularity
4096

Comment 5 Yaniv Kaul 2017-06-18 09:20:44 UTC

(In reply to Markus Stockhausen from comment #1)
> I can confirm that (re)setting the discard flag will change the qemu command
> line:
> 
> Disk with discard:
> 
> -device
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,drive=drive-scsi0-0-0-5,
> id=scsi0-0-0-5 
>  -drive
> file=.../6e9edc6b-d133-4114-a275-be0d550afda8,format=raw,if=none,id=drive-

So it's indeed RAW.

Comment 6 Yaniv Kaul 2017-06-18 09:22:36 UTC

(In reply to Markus Stockhausen from comment #0)
> Description of problem:
> 
> We are running OVirt 4.1.2 in a XFS/NFS setup. Some of our VMs are SLES12
> SP2. When formatting thin provisioned disks inside these VMs with mkfs.xfs
> the disk will be fully allocated afterwards.
> 
> Version-Release number of selected component (if applicable):
> 
> Engine: Ovirt 4.1.2
> Nodes: Centos 7
> VM: SLES12 SP2
> 
> How reproducible:
> 
> 100%
> 
> Scenario A: thin provisioned disk WITHOUT discard:
> 
> 1. Create NFS thin provisioned disk. Do not enable discard
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 8       6e9edc6b-d133-4114-a275-be0d550afda8
> 
> 2. Start VM and format Disk with mkfs.xfs -K
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 20      6e9edc6b-d133-4114-a275-be0d550afda8
>  
> 3. Format disk as ususal with mkfs.xfs (without -K)
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 20481   6e9edc6b-d133-4114-a275-be0d550afda8
> 
> Scenario B: thin provisioned disk WITH discard:
> 
> 1. Create NFS thin provisioned disk. Enable discard (for simplicity i just
> moved the disk around our NFS storages and it was compacted)
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 8       6e9edc6b-d133-4114-a275-be0d550afda8
> 
> 2. Start VM and format Disk with mkfs.xfs -K
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 20      6e9edc6b-d133-4114-a275-be0d550afda8
>  
> 3. Format disk as ususal with mkfs.xfs (without -K)
> # du -m 6e9edc6b-d133-4114-a275-be0d550afda8
> 20481   6e9edc6b-d133-4114-a275-be0d550afda8
> 
> Actual results:
> 
> Disk is fully allocated on storage

Can you compare the output of 'du -ch' and 'ls -lh' on the files? Also, run 'qemu-img info' on them, please (I suspect they are raw-sparse).

Comment 7 Markus Stockhausen 2017-06-18 09:40:32 UTC

# du -ch 6e9edc6b-d133-4114-a275-be0d550afda8
8.0M    6e9edc6b-d133-4114-a275-be0d550afda8
8.0M    total

# ls -lh 6e9edc6b-d133-4114-a275-be0d550afda8
-rw-rw----. 1 36 kvm 20G Jun 18 11:36 6e9edc6b-d133-4114-a275-be0d550afda8

# qemu-img info 6e9edc6b-d133-4114-a275-be0d550afda8
image: 6e9edc6b-d133-4114-a275-be0d550afda8
file format: raw
virtual size: 20G (21474836480 bytes)
disk size: 8.0M

Comment 8 Yaniv Kaul 2017-06-18 09:42:16 UTC

(In reply to Markus Stockhausen from comment #7)
> # du -ch 6e9edc6b-d133-4114-a275-be0d550afda8
> 8.0M    6e9edc6b-d133-4114-a275-be0d550afda8
> 8.0M    total
> 
> # ls -lh 6e9edc6b-d133-4114-a275-be0d550afda8
> -rw-rw----. 1 36 kvm 20G Jun 18 11:36 6e9edc6b-d133-4114-a275-be0d550afda8
> 
> # qemu-img info 6e9edc6b-d133-4114-a275-be0d550afda8
> image: 6e9edc6b-d133-4114-a275-be0d550afda8
> file format: raw
> virtual size: 20G (21474836480 bytes)
> disk size: 8.0M

Looks OK to me. As suspected - it is raw, but sparsely allocated.
Can you verify this also is the case in all scenarios, with and without discard support?

Comment 9 Markus Stockhausen 2017-06-18 09:58:05 UTC

It is the same in both cases. Remember that I always use the same image. I change the disk flag in Ovirt, start the VM and then I move the image to another NFS. Afterwards it is small again.

Comment 10 Markus Stockhausen 2017-06-18 10:08:04 UTC

Just created a completly new empty disk WITHOUT discard flag.

An strace -tt mkfs.xfs /dev/sde1 inside the VM gave the following output:

...
12:01:08.166192 ioctl(4, BLKSSZGET, 512) = 0
12:01:08.166215 chdir("/root") = 0
12:01:08.166293 close(3)
12:01:08.166338 ioctl(4, BLKDISCARD, {0, 0}) = 0
12:01:41.963004 fstat(1, {st_mode=S_IFCHR@0620, st_rdev=makedev(136, 0), ...}) = 0
...

As you can see the DISCARD command takes 30 seconds (in this case for an 20GB disk). That does not seem that qemu skips discard commands.

Comment 11 Markus Stockhausen 2017-06-18 10:16:51 UTC

Tracing qemu at that time gives tons of those:

34195 12:11:09.063477 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12231626752 <unfinished ...>
34192 12:11:09.089665 <... pwrite resumed> ) = 16777216
34193 12:11:09.090354 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16593698816 <unfinished ...>
34194 12:11:09.104689 <... pwrite resumed> ) = 16777216
34192 12:11:09.105191 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14496550912 <unfinished ...>
34195 12:11:09.143977 <... pwrite resumed> ) = 16777216
34194 12:11:09.144939 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12248403968 <unfinished ...>
34193 12:11:09.145499 <... pwrite resumed> ) = 16777216
34191 12:11:09.145783 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16610476032 <unfinished ...>
34192 12:11:09.172102 <... pwrite resumed> ) = 16777216
34193 12:11:09.172661 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14513328128 <unfinished ...>
34194 12:11:09.173719 <... pwrite resumed> ) = 16777216
34192 12:11:09.174230 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 12265181184 <unfinished ...>
34191 12:11:09.179010 <... pwrite resumed> ) = 16777216
34194 12:11:09.179477 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 16627253248 <unfinished ...>
34193 12:11:09.237510 <... pwrite resumed> ) = 16777216
34191 12:11:09.238988 pwrite(24, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16777216, 14530105344 <unfinished ...>
34192 12:11:09.276079 <... pwrite resumed> ) = 16777216

Why should it write zeros if discard is disabled for the disk?

Comment 12 Idan Shaby 2017-06-18 10:19:19 UTC

> As you can see the DISCARD command takes 30 seconds (in this case for an
> 20GB disk). That does not seem that qemu skips discard commands.
You need to differentiate between two things here:
1. Whether the UNMAP command is called or not (by the guest OS).
2. Whether qemu passes it to the underlying storage.

What you just saw is the UNMAP command called from the guest, which is ok, since you did not use the -K flag of mkfs.xfs.
What you don't see here is that qemu didn't pass it to the underlying storage.
You can check it by examining the thinly provisioned underlying storage lun - its free space should not grow right after the UNMAP command is sent from the guest.

Comment 13 Markus Stockhausen 2017-06-18 10:22:15 UTC

See my comment 12.

From my current point of view. qemu wraps it into write zeros.

Comment 14 Yaniv Kaul 2017-06-18 10:35:33 UTC

(In reply to Markus Stockhausen from comment #4)
> Even if discard is NOT enabled for a disk in Ovirt and thus qemu is NOT
> started with -drive file=...discard=unmap... the disk inside the VM shows
> discard granularity:
> 
> # cat /sys/block/sde/queue/discard_granularity
> 4096

Correct, but discard can also be ignored silently, in any layer of the chain, and higher layers know that. I think with some sg_inq querying we can see if discard is supported or not. sudo sg_inq -p 0xb0 /dev/sde could give you the information on unmap (or sudo sg_vpd -p 0xb2 /dev/sde perhaps, with LBPU flag?)

Comment 15 Idan Shaby 2017-06-18 10:55:30 UTC

> Why should it write zeros if discard is disabled for the disk?
1. Do these "pwrite"s occur also when using the -K flag?
I suspect that they are executed as part of the same process that calls discard and can be skipped by using -K.
2. Since you are using NFS V4.0 and discard is supported only from NFS V4.2, why don't you use the -K flag?
From what I understand, there is no reason to use "Enable Discard" nor to intentionally generate UNMAP calls that will not be used anyway, since it only causes performance degradation.

Comment 16 Markus Stockhausen 2017-06-18 11:12:00 UTC

(In reply to Idan Shaby from comment #15)
> From what I understand, there is no reason to use "Enable Discard" nor to
> intentionally generate UNMAP calls that will not be used anyway, since it
> only causes performance degradation.

My problem is, that I do not have "enable discard" active. Checkbox is empty. So I would expect qemu to drop the discard command from the VM and not to write gigabytes of zeros. 

The root of it all is the installation of an OS inside the VM. If you use the usual partitioning dialogue you usually have no chance of using the -K flag.

Comment 17 Markus Stockhausen 2017-06-18 14:01:06 UTC

A cross check with NFS 4.2 mount shows behaviour as expected. Discard command

Comment 18 Markus Stockhausen 2017-06-18 14:04:11 UTC

I did several checks with NFS 4.0, 4.1 and 4.2. The discard flag has no effect at all. Check or unchecked the behaviour is always as follows:

NFS 4.0 + NFS 4.1
Discard inside VM will always result in writing zeros to NFS image

NFS 4.2
Discard inside VM will always result in discard to NFS image

Comment 19 Idan Shaby 2017-06-19 06:25:00 UTC

> NFS 4.0 + NFS 4.1
> Discard inside VM will always result in writing zeros to NFS image
Can you explain what exactly do you mean by "Discard"? Which command are we talking about?

Comment 20 Markus Stockhausen 2017-06-19 06:41:16 UTC

Sorry for mixing this up - in detail:

Im talking about the qemu discard option:
discard=unmap -> qemu should handover the unmap command to the lower layers

This is the same as the Ovirt option for a disk
"enable discard" -> start qemu with discard=unmap for a disk

Inside the VM we are talking about the unmap command:
unmap() -> issued inside the VM from mkfs.xfs

From my analysis the Ovirt (and thus qemu) option does not have any effect when using NFS. With unmap() inside the VM qemu is either writing zeroes (NFS 4.0 & 4.1) or unmapping data (NFS 4.2).

From comments above I understand that qemu should ignore any unmap() command from inside the VM if we are using NFS 4.0 or 4.1 because that protocol does not support unmap. To my surprise it is not dropped but instead qemu is using zeroes to empty the data on the NFS storage.

Sounds like "does not work as expected".

Comment 21 Yaniv Kaul 2017-06-22 15:15:04 UTC

(In reply to Markus Stockhausen from comment #18)
> I did several checks with NFS 4.0, 4.1 and 4.2. The discard flag has no
> effect at all. Check or unchecked the behaviour is always as follows:
> 
> NFS 4.0 + NFS 4.1
> Discard inside VM will always result in writing zeros to NFS image
> 
> NFS 4.2
> Discard inside VM will always result in discard to NFS image

Kevin - does that make sense that QEMU would translate a discard into writing zeros if the underlying FS does not support discard?
(I remember the other way around - a zero write can become a discard if the underlying storage supports it)

Comment 22 Kevin Wolf 2017-06-22 17:15:49 UTC

(In reply to Yaniv Kaul from comment #21)
> Kevin - does that make sense that QEMU would translate a discard into
> writing zeros if the underlying FS does not support discard?

No, it doesn't. If QEMU can't discard, it ignores the request because discard
is only a hint.

> (I remember the other way around - a zero write can become a discard if the
> underlying storage supports it)

What does the guest OS actually request from QEMU?

If it sends a WRITE SAME SCSI command with the unmap flag set, then that's not
a discard request, but a write zeros request that QEMU is allowed (but not
required) to fulfill with doing a discard instead. If it can't use discard, it
must write explicit zeros. (This is the case that you remember.)

If the guest sends an UNMAP command, however, then QEMU can ignore the request
if discarding isn't possible.

Comment 23 Markus Stockhausen 2017-06-22 17:19:06 UTC

See Comment 10.

Guest runs a mkfs.xfs. strace shows the following output with an 30 seconds delay during the ioctl BLKDISCARD.

...
12:01:08.166192 ioctl(4, BLKSSZGET, 512) = 0
12:01:08.166215 chdir("/root") = 0
12:01:08.166293 close(3)
12:01:08.166338 ioctl(4, BLKDISCARD, {0, 0}) = 0
12:01:41.963004 fstat(1, {st_mode=S_IFCHR@0620, st_rdev=makedev(136, 0), ...}) = 0
...

Comment 24 Yaniv Kaul 2017-06-22 19:52:33 UTC

So essentially, if the underlying FS (NFS earlier than v4.2) does not support DISCARD, it passes this information on to the guest, which then uses zero writes instead of DISCARD? Makes some sense to me...

It'd be interesting to see what the guest sees in both cases, using:
sudo sg_inq -p 0xb0 /dev/sde could give you the information on unmap (or sudo sg_vpd -p 0xb2 /dev/sde perhaps, with LBPU flag?)

Comment 25 Idan Shaby 2017-06-22 20:35:34 UTC

(In reply to Yaniv Kaul from comment #24)
> So essentially, if the underlying FS (NFS earlier than v4.2) does not
> support DISCARD, it passes this information on to the guest, which then uses
> zero writes instead of DISCARD? Makes some sense to me...
But the outcome of discarding a block of data and writing zeroes to it is not the same, so how can it be?
And what do you mean by saying that the FS passes the information to the guest? Can you clarify?

I should still investigate this issue, but from a quick glance at mkfs.xfs I do see that it may write zeros (128M) to "the beginning and end of the device to remove traces of other filesystems, raid superblocks, etc" (xfsprogs-dev/mkfs/xfs_mkfs.c, WHACK_SIZE). Maybe if we can discard we do, and if we can't we zero those areas manually? Worth checking.

Comment 26 Idan Shaby 2017-06-25 13:17:00 UTC

Can you please attach the full command that you execute inside the vm along with its trace and qemu's trace on NFS4.1 and 4.2?

Comment 27 Markus Stockhausen 2017-06-25 16:41:58 UTC

Had to use a new machine. VMs /dev/sde is now /dev/sdb

1) NFS 4.2 + Ovirt disk discard-flag unchecked

# sg_inq -p 0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 0 blocks
  Maximum transfer length: 4194303 blocks
  Optimal transfer length: 0 blocks
  Maximum prefetch transfer length: 0 blocks
  Maximum unmap LBA count: 2097152
  Maximum unmap block descriptor count: 255
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0x3fffff blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0
  
 # sg_vpd -p 0xb2 /dev/sdb
Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 0
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Minimum percentage: 0
  Provisioning type: 2
  Threshold percentage: 0
  
2) NFS 4.2 + Ovirt disk discard-flag checked

# sg_inq -p 0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 0 blocks
  Maximum transfer length: 4194303 blocks
  Optimal transfer length: 0 blocks
  Maximum prefetch transfer length: 0 blocks
  Maximum unmap LBA count: 2097152
  Maximum unmap block descriptor count: 255
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0x3fffff blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0

# sg_vpd -p 0xb2 /dev/sdb
Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 0
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Minimum percentage: 0
  Provisioning type: 2
  Threshold percentage: 0

3) NFS 4.1 + Ovirt disk discard-flag unchecked

# sg_inq -p 0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 0 blocks
  Maximum transfer length: 4194303 blocks
  Optimal transfer length: 0 blocks
  Maximum prefetch transfer length: 0 blocks
  Maximum unmap LBA count: 2097152
  Maximum unmap block descriptor count: 255
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0x3fffff blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0

# sg_vpd -p 0xb2 /dev/sdb
Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 0
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Minimum percentage: 0
  Provisioning type: 2
  Threshold percentage: 0

4) NFS 4.1 + Ovirt disk discard-flag checked

# sg_inq -p 0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 0 blocks
  Optimal transfer length granularity: 0 blocks
  Maximum transfer length: 4194303 blocks
  Optimal transfer length: 0 blocks
  Maximum prefetch transfer length: 0 blocks
  Maximum unmap LBA count: 2097152
  Maximum unmap block descriptor count: 255
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0x3fffff blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0

# sg_vpd -p 0xb2 /dev/sdb
Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 0
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Minimum percentage: 0
  Provisioning type: 2
  Threshold percentage: 0

Comment 28 Markus Stockhausen 2017-06-25 16:46:16 UTC

Created attachment 1291730 [details]
vm + hypervisor logs

Comment 29 Markus Stockhausen 2017-06-25 16:54:54 UTC

From the logs we see:

mkfs inside VM executes ioctl BLKDISCARD

this translates into: 

qemu on nfs 4.2 executing fallocate calls

qemu on nfs 4.1 executing write zero calls

Comment 30 Yaniv Kaul 2017-06-25 17:27:31 UTC

(In reply to Markus Stockhausen from comment #29)
> From the logs we see:
> 
> mkfs inside VM executes ioctl BLKDISCARD
> 
> this translates into: 
> 
> qemu on nfs 4.2 executing fallocate calls

OK, this makes sense to me - I forgot this is RAW and not qcow2.
I assume it's FALLOC_FL_PUNCH_HOLE  (https://github.com/qemu/qemu/blob/0748b3526e8cb78b9cd64208426bfc3d54a72b04/block/file-posix.c#L1396 perhaps?)

The file is still sparse, right?

> 
> qemu on nfs 4.1 executing write zero calls

I think this is what fallocate does on NFS before 4.2 (http://thread.gmane.org/gmane.linux.nfs/59563)

which should translate to essentially the same size, without the real sparsification (writing all zeros may or may not allocate disk space for real, depending on the storage backend).

Comment 31 Markus Stockhausen 2017-06-25 17:51:47 UTC

That leaves two questions open.

1. Does qemu present NFS 4.1 / 4.2 storead raw disk images to the client in the right fashion?

2. As discard option does not make sense images on NFS should this be disabled or automatically set to the right value (matching the NFS version)?

Comment 32 Yaniv Kaul 2017-06-26 09:24:56 UTC

(In reply to Markus Stockhausen from comment #31)
> That leaves two questions open.
> 
> 1. Does qemu present NFS 4.1 / 4.2 storead raw disk images to the client in
> the right fashion?

oVirt always uses raw-sparse disks on file-based storage.

> 
> 2. As discard option does not make sense images on NFS should this be
> disabled or automatically set to the right value (matching the NFS version)?

I'm not sure. It still leaves sparse files on NFS, and I'd imagine on 4.2 it can pass discard to the underlying storage, so it looks like it's better to leave as is. Note that some underlying storage know how to dedup/compress/whatever zero-filled blocks intelligently, so it does make sense to bother and write zeros.

Comment 33 Idan Shaby 2017-06-29 07:23:07 UTC

Looks like it's not a bug.
Please reopen if you got further questions.