Bug 1850660 - Image files in the XFS filesystem are getting heavily fragmented
Summary: Image files in the XFS filesystem are getting heavily fragmented
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.2
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 8.3
Assignee: Kevin Wolf
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-24 16:03 UTC by nijin ashok
Modified: 2023-10-06 20:48 UTC (History)
14 users (show)

Fixed In Version: qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-17 17:49:34 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description nijin ashok 2020-06-24 16:03:51 UTC
Description of problem:

The i/o operations within the VM are creating a heavily fragmented image in the XFS filesystem. I created an RHEL VM in a new XFS storage from RHV and after installation, the filefrag was showing 57320 extents after for just 1.4 GB of data.

===
Storage size before installation.
-----------

du -sch /home/vgpu_storage/
396K	/home/vgpu_storage/
396K	total

Image info and filefrag size
-----------

[root@dell-r740-3 home]# qemu-img info /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
image: /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
file format: raw
virtual size: 40G (42949672960 bytes)
disk size: 1.4G


[root@dell-r740-3 home]# filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af: 57320 extents found
===

Then after writing 15 GB of data from the VM (dd'ed urandom to a file from VM), the extents increased to 286956.

===
[root@dell-r740-3 home]# filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af: 286956 extents found
===

However, the free space in the filesystem was not fragmented at all, but still, the image file is heavily fragmented.

===
for AGNO in `seq 0 3`;do /usr/sbin/xfs_db -r -c "freesp -s -a $AGNO" /dev/mapper/rhel-home ;done |egrep -i "total free blocks|total free extents|average free extent size"
total free extents 5
total free blocks 9155060
average free extent size 1.83101e+06
total free extents 9
total free blocks 9155048
average free extent size 1.01723e+06
total free extents 1336
total free blocks 5130570
average free extent size 3840.25
total free extents 9
total free blocks 9155060
average free extent size 1.01723e+06
===

Then a snapshot was created and another 35 GB of data was written from the VM and below is the filefrag output of snapshot image.

===
[root@dell-r740-3 home]# filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651: 127524 extents found

[root@dell-r740-3 home]# du -sch /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
35G	/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
35G	total
===

Since the image file is heavily fragmented, all the "qemu-img" commands are taking too much time to complete which also creates an extreme load on the host. A simple measure command took 15 mins to complete.

===
[root@dell-r740-3 home]# time qemu-img measure -O qcow2 /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651 -U

required size: 37842124800
fully allocated size: 42956488704

real	14m39.385s
user	0m0.173s
sys	14m35.241s
===

This is the same for most of the storage operations like map, commit, live snapshot deletion, etc. Due to high fragmentation, all the commands take too much time to complete.

As per the strace output, it's spending most of the time on "lseek" calls. 

As mentioned, free space in the filesystem is not fragmented at all, and writing files within the host doesn't create a fragmented file.

===
[root@dell-r740-3 home]# dd if=/dev/urandom bs=1M of=vgpu_storage/test_1.out count=20000 
20000+0 records in
20000+0 records out
20971520000 bytes (21 GB) copied, 152.096 s, 138 MB/s


[root@dell-r740-3 home]# filefrag vgpu_storage/test_1.out
vgpu_storage/test_1.out: 1 extent found
====

Also, I was able to defragment the image files to a much less extends.

===
[root@dell-r740-3 home]# xfs_fsr -v /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af
extents before:286982 after:50      /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-becb-8dda497f91af

[root@dell-r740-3 home]# xfs_fsr -v /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
extents before:127525 after:7      /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651
===

The measure command completes in less than a second after defragmentation.

===
home]# time qemu-img measure -O qcow2 /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/45381766-c5b1-4200-9b19-ea53702e4651

required size: 37842124800
fully allocated size: 42956488704

real	0m0.818s
user	0m0.130s
sys	0m0.364s
====


Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.12.0-44.el7_8.2.x86_64
kernel-3.10.0-1127.8.2.el7.x86_64
Red Hat Enterprise Linux Server release 7.8 (Maipo)


How reproducible:

100%

Steps to Reproduce:

Create a RAW thin-provisioned image and write data from the VM. Check the fragmentation of the image file and time taken for the "qemu-img" commands to complete.

Actual results:

Image files in the XFS storage are getting heavily fragmented. Because of this all storage operation are taking huge time to complete and this cause timeout in RHV and causing major critical issues in RHV side. Also, the whole of this operation is using CPU for a longer time causing issues for the running VMs.   

Expected results:

The image file should not get highly fragmented since free space in XFS is not fragmented.

Additional info:

Comment 5 Kevin Wolf 2020-06-25 16:14:40 UTC
This seems to have some overlap with bug 1666864. Let's involve Brian and Dave from the start, in particular because in the other bug Dave said that in theory, a lot of extents doesn't mean there is a problem.

(In reply to nijin ashok from comment #0)
> Since the image file is heavily fragmented, all the "qemu-img" commands are
> taking too much time to complete which also creates an extreme load on the
> host. A simple measure command took 15 mins to complete.

Which are the exact qemu-img subcommands you mean here? "qemu-img measure" and partically "qemu-img convert" are different from normal VM operation in that they query the allocation status of blocks a lot, which translates to lseek(SEEK_HOLE/DATA)...

> As per the strace output, it's spending most of the time on "lseek" calls.

...which I assume is what you're seeing here, right? Can you provide the strace output, both for the fragmented and the defragmented case for comparison?

Newer QEMU versions just trust the qcow2 metadata instead of asking the filesystem for finer grained information. We could probably backport that. It probably wouldn't make a difference for raw images, though, because the filesystem is the only source for this information there.

Comment 6 Brian Foster 2020-06-25 18:14:06 UTC
(In reply to nijin ashok from comment #0)
...
> file format: raw
> virtual size: 40G (42949672960 bytes)
> disk size: 1.4G
> 
> 
> [root@dell-r740-3 home]# filefrag
...
> becb-8dda497f91af: 57320 extents found
> ===
> 
> Then after writing 15 GB of data from the VM (dd'ed urandom to a file from
> VM), the extents increased to 286956.
> 
> ===
...
> becb-8dda497f91af: 286956 extents found
> ===
> 
...

So you started off with 57320 extents for ~1.4GB of data, which is ~25kB per extent. The 15GB write resulted in the addition of 229636 extents, so that operation averaged under 70kB per extent.

...
> 
> Also, I was able to defragment the image files to a much less extends.
> 
> ===
> [root@dell-r740-3 home]# xfs_fsr -v
> /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-
> 361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-
> becb-8dda497f91af
> /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-
> 361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-
> becb-8dda497f91af
> extents before:286982 after:50     
> /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-
> 361f6eadbbcd/images/d878fe60-8472-49d3-a75b-7fc250d72e5c/2d53c195-248f-4670-
> becb-8dda497f91af
> 
...

Then this was reduced to 50 extents by xfs_fsr. That suggests that at least contiguous ranges of the image file are being allocated and that the host fs has sufficient contiguous free space to accommodate, but contiguous allocations aren't happening for whatever reason. Do we know the I/O mode of the associated vdisk? For example, are writes via the guest cached/transformed into buffered writes in the host (which would facilitate larger allocations) or direct I/O (which would result in smaller allocations)?

Note that in general it's recommended to use extent size hints when using sparse files on XFS as raw vdisk images. That forces a minimum allocation granularity on the file and so mitigates fragmentation due to small random writes over time. Given the data here, I suspect an extent size hint as small as 1MB (xfs_io -c "extsize 1m" <imgfile>) would have significantly reduced the extent count over the course of this workload.

Comment 7 nijin ashok 2020-06-26 03:21:14 UTC
(In reply to Kevin Wolf from comment #5)

> Which are the exact qemu-img subcommands you mean here? "qemu-img measure"
> and partically "qemu-img convert" are different from normal VM operation in
> that they query the allocation status of blocks a lot, which translates to
> lseek(SEEK_HOLE/DATA)...

The commands which I tried are measure, map, commit. Also, the live merge was very slow. 

> 
> > As per the strace output, it's spending most of the time on "lseek" calls.
> ...which I assume is what you're seeing here, right? Can you provide the
> strace output, both for the fragmented and the defragmented case for
> comparison?

Attaching strace outputs of measure command for the below image file.

==
qemu-img info /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/8c259d2d-171e-4b2d-81e0-ea2d44edb59f -U
image: /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/8c259d2d-171e-4b2d-81e0-ea2d44edb59f
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 20G
cluster_size: 65536
backing file: 517fdf98-d8d5-4b9c-900d-7521d4007c59 (actual path: /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/517fdf98-d8d5-4b9c-900d-7521d4007c59)
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
===

Both the images were heavily fragmented.

===
filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/8c259d2d-171e-4b2d-81e0-ea2d44edb59f
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/8c259d2d-171e-4b2d-81e0-ea2d44edb59f: 76048 extents found

filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/517fdf98-d8d5-4b9c-900d-7521d4007c59
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/cc818403-e102-42a9-a5be-2d48cc0f5c64/517fdf98-d8d5-4b9c-900d-7521d4007c59: 184203 extents found
===

It took 5 minutes for the map command to complete and it spends the most of the time on "lseek" call. There are 60879 calls and only 130-140 calls were getting completed in a second.

===
grep "lseek" /var/tmp/strace_measure_frag|wc -l
60879


grep "22:32:19 lseek" /var/tmp/strace_measure_frag|wc -l
134

grep "22:32:20 lseek" /var/tmp/strace_measure_frag|wc -l
134

grep "22:32:25 lseek" /var/tmp/strace_measure_frag|wc -l
136
===

Although it's the same amount of lseek call in the defragmented image, it took only 3 seconds for 60860 calls to complete.

===
grep "lseek" /var/tmp/strace_measure_defrag|wc -l
60860

grep "22:46:27 lseek" /var/tmp/strace_measure_defrag |wc -l
24345

grep "22:46:28 lseek" /var/tmp/strace_measure_defrag |wc -l
30274

grep "22:46:29 lseek" /var/tmp/strace_measure_defrag |wc -l
6241
===

> 
> Newer QEMU versions just trust the qcow2 metadata instead of asking the
> filesystem for finer grained information. We could probably backport that.
> It probably wouldn't make a difference for raw images, though, because the
> filesystem is the only source for this information there.

Comment 9 nijin ashok 2020-06-26 03:45:29 UTC
(In reply to Brian Foster from comment #6)
> Then this was reduced to 50 extents by xfs_fsr. That suggests that at least
> contiguous ranges of the image file are being allocated and that the host fs
> has sufficient contiguous free space to accommodate, but contiguous
> allocations aren't happening for whatever reason. Do we know the I/O mode of
> the associated vdisk? For example, are writes via the guest
> cached/transformed into buffered writes in the host (which would facilitate
> larger allocations) or direct I/O (which would result in smaller
> allocations)?

RHV doesn't use cache for disk i/o by default. I changed this into writeback cache and got interesting results. For the 20 GB image file, the number of extents was only 18 extents.

===
filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/f5b872f7-83b1-46e4-ab52-0b6831604677/26ea58af-18c5-4c98-b1bf-ea82574a7dfc
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/f5b872f7-83b1-46e4-ab52-0b6831604677/26ea58af-18c5-4c98-b1bf-ea82574a7dfc: 18 extents found
===

I believe we can use this as a workaround for the customer.

> 
> Note that in general it's recommended to use extent size hints when using
> sparse files on XFS as raw vdisk images. That forces a minimum allocation
> granularity on the file and so mitigates fragmentation due to small random
> writes over time. Given the data here, I suspect an extent size hint as
> small as 1MB (xfs_io -c "extsize 1m" <imgfile>) would have significantly
> reduced the extent count over the course of this workload.

In RHV, the image files are not created by the user. The user creates the disk from GUI and the images files are created in the backend as per the requirement. The location of the image file and information about it is not exposed to the user. They are removed and added as when the user does tasks (like snapshot operation) from the GUI. So setting this manually is not a feasible option in RHV.

Comment 10 Brian Foster 2020-06-26 13:01:51 UTC
(In reply to nijin ashok from comment #9)
...
> In RHV, the image files are not created by the user. The user creates the
> disk from GUI and the images files are created in the backend as per the
> requirement. The location of the image file and information about it is not
> exposed to the user. They are removed and added as when the user does tasks
> (like snapshot operation) from the GUI. So setting this manually is not a
> feasible option in RHV.

Sure, but can whatever backend component that implements raw vdisk image creation implement a default extent size hint if 1.) caching is not enabled and 2.) the underlying fs is XFS? It might be worth opening up a broader discussion on the appropriate default value, but 1MB seems fairly conservative to me. Also, couldn't the user override it manually before the image is used (XFS allows this only until extents are allocated in the file) or perhaps a future GUI could expose an option+value as an image creation parameter.

Also note that an extent size hint can be set on a directory. If set, the hint is automatically inherited by files created in the directory. However, this would change allocation patterns for all images created in the directory, not just those configured for uncached I/O, and thus may not be appropriate for this use case.

Comment 11 Kevin Wolf 2020-06-26 13:35:55 UTC
(In reply to Brian Foster from comment #10)
> Sure, but can whatever backend component that implements raw vdisk image
> creation implement a default extent size hint if 1.) caching is not enabled
> and 2.) the underlying fs is XFS?

The image file is created in QEMU code and we can certainly set an extent size hint during image creation if we know the right value. We can't however know whether the image will later be used with buffered or direct I/O because this doesn't depend on the image file, but on the options with which a VM is started. So I'm afraid the best we can offer is making the extent size hint an option and then the management layer can set it according to how it intends to use the image later (if it even knows - the user might change their decision later). This also means that the default, while management tools don't know this option yet, must work for both cases.

In case of doubt, direct I/O is the much more important case, though.

> It might be worth opening up a broader discussion on the appropriate
> default value, but 1MB seems fairly conservative to me.

By conservative you mean, you would normally go for something larger even?

Just to check that I understand the effect of the extent size hint correctly: If I write some data somewhere and an allocation is necessary, the extent start and end are rounded to the next multiple of the extent size hint? So the tradeoff is that you get less fragmentation, but you potentially waste some space if the rest of the extent is never written to. Or is the mechanism more complex than this?

> Also note that an extent size hint can be set on a directory. If set, the
> hint is automatically inherited by files created in the directory. However,
> this would change allocation patterns for all images created in the
> directory, not just those configured for uncached I/O, and thus may not be
> appropriate for this use case.

We could maybe suggest this as a workaround while we haven't changed the QEMU code yet.

Brian, I wonder, even on the fragmented file, does the time the SEEK_HOLE/DATA calls take look expected to you? Even if each lseek() required a disk access to fetch metadata, just under 200 lseeks() per second doesn't seem like a lot for an SSD.

Comment 12 Brian Foster 2020-06-26 14:57:37 UTC
(In reply to Kevin Wolf from comment #11)
> (In reply to Brian Foster from comment #10)
> > Sure, but can whatever backend component that implements raw vdisk image
> > creation implement a default extent size hint if 1.) caching is not enabled
> > and 2.) the underlying fs is XFS?
> 
> The image file is created in QEMU code and we can certainly set an extent
> size hint during image creation if we know the right value. We can't however
> know whether the image will later be used with buffered or direct I/O
> because this doesn't depend on the image file, but on the options with which
> a VM is started. So I'm afraid the best we can offer is making the extent
> size hint an option and then the management layer can set it according to
> how it intends to use the image later (if it even knows - the user might
> change their decision later). This also means that the default, while
> management tools don't know this option yet, must work for both cases.
> 
> In case of doubt, direct I/O is the much more important case, though.
> 

I wouldn't expect qemu-img itself to implement this policy, if that's what you're referring to. I suppose it could grow the mechanism to set a hint on an image file and then some higher level code (for example, wherever the user currently specifies image format) could specify whether to set a hint as well. That said, I'm not familiar enough with the full stack involved here to know how feasible something like that is.

> > It might be worth opening up a broader discussion on the appropriate
> > default value, but 1MB seems fairly conservative to me.
> 
> By conservative you mean, you would normally go for something larger even?
> 

I think it depends on image size, use case, and probably some degree of testing. I just mean that to me a 1MB allocation granularity for tens of GB image sizes seems like a fairly reasonable tradeoff of space consumption for efficiency. Perhaps it doesn't (or shouldn't) matter as much for SSDs..

> Just to check that I understand the effect of the extent size hint
> correctly: If I write some data somewhere and an allocation is necessary,
> the extent start and end are rounded to the next multiple of the extent size
> hint? So the tradeoff is that you get less fragmentation, but you
> potentially waste some space if the rest of the extent is never written to.
> Or is the mechanism more complex than this?
> 

That's pretty much it. It's easy to test by setting a hint on an empty file via 'xfs_io -c "extsize <size>" <file>', doing some random writes and evaluating the result with fiemap (xfs_io -c fiemap <file>).

> > Also note that an extent size hint can be set on a directory. If set, the
> > hint is automatically inherited by files created in the directory. However,
> > this would change allocation patterns for all images created in the
> > directory, not just those configured for uncached I/O, and thus may not be
> > appropriate for this use case.
> 
> We could maybe suggest this as a workaround while we haven't changed the
> QEMU code yet.
> 
> Brian, I wonder, even on the fragmented file, does the time the
> SEEK_HOLE/DATA calls take look expected to you? Even if each lseek()
> required a disk access to fetch metadata, just under 200 lseeks() per second
> doesn't seem like a lot for an SSD.

Hmm.. it's hard to say based on just the raw strace output, particularly since it doesn't look like it includes syscall timing. Also after the first lookup the entire extent tree should be read from disk into memory and all subsequent lseeks served from that in-memory tree, so I'm not sure that the storage device will make much of a difference for the lseek calls themselves vs. what the application might be doing with the associated data. Is the qemu-img tool that demonstrates this slowdown on the fragmented file actually reading data or just processing extent metadata?

Regardless, strace output with timing data might be more informative. Could somebody attach the extent list output ('xfs_io -c "fiemap -v" <file>' redirected to a file and compressed) of the fragmented variant of the image file and the specific qemu-img command that's being traced and demonstrates the slowdown?

Comment 13 Kevin Wolf 2020-06-26 15:57:55 UTC
(In reply to Brian Foster from comment #12)
> I wouldn't expect qemu-img itself to implement this policy, if that's what
> you're referring to. I suppose it could grow the mechanism to set a hint on
> an image file and then some higher level code (for example, wherever the
> user currently specifies image format) could specify whether to set a hint
> as well. That said, I'm not familiar enough with the full stack involved
> here to know how feasible something like that is.

Yes, the mechanism needs to be in qemu-img because depending on the given options, it can already preallocate data. And for non-raw image formats it has to write something to the file anyway. The actual policy can't be in qemu-img because we don't need know what the image will be used for, so RHV and possibly libvirt will need to implement support for the new option.

What we do need to do in qemu-img is choosing a default that is used when the option isn't specified. We could just leave it disabled by default, but maybe we can pick a default that makes sense for most people.

> Hmm.. it's hard to say based on just the raw strace output, particularly
> since it doesn't look like it includes syscall timing.

Yeah, it doesn't tell the syscall timing explicitly, but I think the comparison with the defragmented case (which contains more or less the same syscalls with the same results, just faster) indicates that the user space logic is probably doing the same as before and the difference must be in the syscalls.

But maybe we can get another strace of the fragmented case that actually contains the syscall timing information? (Oh, and sub-second precision timestamps would have been nice, too.)

> Also after the first
> lookup the entire extent tree should be read from disk into memory and all
> subsequent lseeks served from that in-memory tree, so I'm not sure that the
> storage device will make much of a difference for the lseek calls themselves
> vs. what the application might be doing with the associated data. Is the
> qemu-img tool that demonstrates this slowdown on the fragmented file
> actually reading data or just processing extent metadata?

Ok, having everything (or a good part of it) in memory is what I expected and hoped.

The examples above use "qemu-img measure", which prints the amount of space needed to convert a given image. On raw images, it's basically just a simulated sparse copy. As you can see from the strace, apart from the initialisation, it's really only processing extent metadata, with alternating SEEK_HOLE and SEEK_DATA.

Comment 14 Brian Foster 2020-06-26 18:21:45 UTC
(In reply to Kevin Wolf from comment #13)
...
> 
> Yeah, it doesn't tell the syscall timing explicitly, but I think the
> comparison with the defragmented case (which contains more or less the same
> syscalls with the same results, just faster) indicates that the user space
> logic is probably doing the same as before and the difference must be in the
> syscalls.
> 
...

Yeah, I guess we'd see reads/writes in the trace as well if that were going on as well.

> The examples above use "qemu-img measure", which prints the amount of space
> needed to convert a given image. On raw images, it's basically just a
> simulated sparse copy. As you can see from the strace, apart from the
> initialisation, it's really only processing extent metadata, with
> alternating SEEK_HOLE and SEEK_DATA.

In taking a closer look at the traces (and being unfamiliar with the associated tool), I see a sequence of SEEK_DATA/SEEK_HOLE call pairs at matching offsets. The SEEK_HOLE calls all return the same offset, which corresponds to the return from SEEK_END so appears to be EOF. Can somebody familiar with this algorithm elaborate on why there are so many SEEK_DATA calls if the SEEK_HOLE calls all seem to point to EOF?

Comment 15 Eric Sandeen 2020-06-26 22:29:33 UTC
Ademar asked me to chime in on this one (Ademar: in the future please just set needinfo in the bug, if you need me to look at a bug, the round trip for gchat notifications via email is ~24hrs)

It looks like Brian's on top of it, and I agree that in general the extent size hint is recommended if we're filling in a sparse file like this.  I sort of thought we'd already adopted that as best practice in this situation...

Comment 16 Kevin Wolf 2020-06-29 11:24:46 UTC
(In reply to Brian Foster from comment #14)
> In taking a closer look at the traces (and being unfamiliar with the
> associated tool), I see a sequence of SEEK_DATA/SEEK_HOLE call pairs at
> matching offsets. The SEEK_HOLE calls all return the same offset, which
> corresponds to the return from SEEK_END so appears to be EOF. Can somebody
> familiar with this algorithm elaborate on why there are so many SEEK_DATA
> calls if the SEEK_HOLE calls all seem to point to EOF?

Basically an artifact of querying the block status for each fragment on the qcow2 layer without caching anything. You wouldn't see this in current RHEL 8 versions, and the possible backport I mentioned in comment 5 would get rid of these calls because we would trust that if the qcow2 metadata says something is allocated, it will be so on the filesystem level, too.

Of course, for raw images we'll still issue SEEK_DATA/HOLE a lot because we still depend on the filesystem information there, but the pattern won't look as redundant then. Maybe Nijin can test how raw images behave in his case? For a more theoretical case, it should be easy enough to construct an artificial raw image that is very fragmented and where data blocks and sparse blocks alternate so you get a lots of lseek() calls that can't be optimised away by QEMU.

Anyway, no matter if the lseek() calls are actually necessary in every case, if everything needed is in memory, how can it be so slow?

(In reply to Eric Sandeen from comment #15)
> It looks like Brian's on top of it, and I agree that in general the extent
> size hint is recommended if we're filling in a sparse file like this.  I
> sort of thought we'd already adopted that as best practice in this
> situation...

Yes, I'll have a look.

Though, to digress a bit, adding filesystem specific code in QEMU isn't our favourite activity. If something is generally useful and even considered a best practice rather than an obscure hack, why is there no generic kernel interface? Another example of something filesystem specific we use, to digress even more, is XFS_IOC_DIOINFO. This is something that the kernel should provide for any file with a generic interface. Second guessing in userspace isn't fun and can't work reliably, and even if every filesystem supported some interface, adding special code for each of them wouldn't scale. If userspace is expected to use something as a best practice (or to make things even work in the first place like with O_DIRECT), please make it a first class interface rather than a filesystem specific ioctl. (End of rant. ;-))

Comment 17 Brian Foster 2020-06-29 13:59:47 UTC
(In reply to Kevin Wolf from comment #16)
> (In reply to Brian Foster from comment #14)
> > In taking a closer look at the traces (and being unfamiliar with the
> > associated tool), I see a sequence of SEEK_DATA/SEEK_HOLE call pairs at
> > matching offsets. The SEEK_HOLE calls all return the same offset, which
> > corresponds to the return from SEEK_END so appears to be EOF. Can somebody
> > familiar with this algorithm elaborate on why there are so many SEEK_DATA
> > calls if the SEEK_HOLE calls all seem to point to EOF?
> 
> Basically an artifact of querying the block status for each fragment on the
> qcow2 layer without caching anything. You wouldn't see this in current RHEL
> 8 versions, and the possible backport I mentioned in comment 5 would get rid
> of these calls because we would trust that if the qcow2 metadata says
> something is allocated, it will be so on the filesystem level, too.
> 

Ok, thanks. I was more asking from the angle of whether qemu-img was doing something odd or unexpected that was consuming time. It sounds like this is expected, so we can disregard this for now.

> Of course, for raw images we'll still issue SEEK_DATA/HOLE a lot because we
> still depend on the filesystem information there, but the pattern won't look
> as redundant then. Maybe Nijin can test how raw images behave in his case?
> For a more theoretical case, it should be easy enough to construct an
> artificial raw image that is very fragmented and where data blocks and
> sparse blocks alternate so you get a lots of lseek() calls that can't be
> optimised away by QEMU.
> 
> Anyway, no matter if the lseek() calls are actually necessary in every case,
> if everything needed is in memory, how can it be so slow?
> 
...

That's not clear to me. All we really know so far is that the utility is taking quite a bit of time to execute thousands of seek calls. I just created a 20GB file, punched out every other 64k block and ran the qemu-img measure command on the resulting file. It completes in just under 1s (but requires ~30s with strace). The command itself makes ~480k lseek() calls, so clearly I'm not reproducing the problem with this test.

That was on a Fedora vm. I went to repeat on RHEL7 (kernel 3.10.0-1153.el7.x86_64), but it appears that the 'measure' command is not supported on the qemu-img tool in RHEL7 (version 1.5.3). What version is being used to reproduce the problem? I ran 'qemu-img convert -O qcow2 ...' in lieu of measure as it appears to do a lot of seeks (though around ~20k, not nearly as many as the 'measure' test). That requires a few minutes, but it's not clear to me if that's analogous to the original test because it's not a read-only operation.

Comment 18 Kevin Wolf 2020-06-29 14:55:57 UTC
This is reported with qemu-kvm-rhev, so you won't find the version in the normal RHEL repositories, but only in layered products like RHV. Maybe the easiest way to get the package is from Brew: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1165508 (the link should be for the exact build Nijin reported). I just checked the code, this one does indeed support measure.

I agree that convert is probably a bad comparison because it does lots of actual I/O, though it is an interesting observation that it seems to use less seeks. I guess I should have a closer look to check why this is.

Comment 19 Brian Foster 2020-06-29 15:36:10 UTC
(In reply to Kevin Wolf from comment #18)
> This is reported with qemu-kvm-rhev, so you won't find the version in the
> normal RHEL repositories, but only in layered products like RHV. Maybe the
> easiest way to get the package is from Brew:
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1165508 (the
> link should be for the exact build Nijin reported). I just checked the code,
> this one does indeed support measure.
> 

Thanks. I can install/run this version...

> I agree that convert is probably a bad comparison because it does lots of
> actual I/O, though it is an interesting observation that it seems to use
> less seeks. I guess I should have a closer look to check why this is.

Ok, I figured. In any event, the rhev variant of qemu-img still runs in < 1s (though actually slightly faster than upstream qemu-img on Fedora) on the same 160k extent/160k hole test file. strace still shows ~480k lseek calls.

I think the next step is to try with the original image file and filesystem where this is reproduced. Since the measure command doesn't seem to look at file data, we should be able to accomplish this with a metadump of the fs that contains the associated file image. Can somebody create a metadump image of this fs (xfs_metadump -go <dev> <output>), compress it and attach or upload it to somewhere otherwise accessible? We'll also need the pathname of the target image file in the fs.

Note that metadump captures only metadata and ignores file data, but the '-o' option disables obfuscation so that directory and filenames pass through unmodified. If you didn't want to send a metadump of the original fs for whatever reason, you could also create a temporary fs for the purpose of reconstructing a vdisk image that similarly reproduces the problem and create a metadump of that.

Comment 20 nijin ashok 2020-06-30 03:52:01 UTC
(In reply to Kevin Wolf from comment #13)

> But maybe we can get another strace of the fragmented case that actually
> contains the syscall timing information? (Oh, and sub-second precision
> timestamps would have been nice, too.)
> 

I executed strace with syscall timing and I can see that SEEK_HOLE is taking much more time compared to SEEK_DATA.

===
150630 23:15:03 lseek(11, 2457272320, SEEK_DATA) = 2457272320 <0.000021>
150630 23:15:03 lseek(11, 2457272320, SEEK_HOLE) = 14973468672 <0.071897>
150630 23:15:03 lseek(11, 2429812736, SEEK_DATA) = 2429812736 <0.000020>
150630 23:15:03 lseek(11, 2429812736, SEEK_HOLE) = 14973468672 <0.072158>
150630 23:15:04 lseek(11, 2456616960, SEEK_DATA) = 2456616960 <0.000019>
150630 23:15:04 lseek(11, 2456616960, SEEK_HOLE) = 14973468672 <0.071904>
150630 23:15:04 lseek(11, 2430140416, SEEK_DATA) = 2430140416 <0.000020>
150630 23:15:04 lseek(11, 2430140416, SEEK_HOLE) = 14973468672 <0.072293>
150630 23:15:04 lseek(11, 2454061056, SEEK_DATA) = 2454061056 <0.000020>
150630 23:15:04 lseek(11, 2454061056, SEEK_HOLE) = 14973468672 <0.071898>
===

Will attach the strace output to the bug.


(In reply to Kevin Wolf from comment #16)
> Of course, for raw images we'll still issue SEEK_DATA/HOLE a lot because we
> still depend on the filesystem information there, but the pattern won't look
> as redundant then. Maybe Nijin can test how raw images behave in his case?
> For a more theoretical case, it should be easy enough to construct an
> artificial raw image that is very fragmented and where data blocks and
> sparse blocks alternate so you get a lots of lseek() calls that can't be
> optimised away by QEMU.

In my test, the RAW images "measure" is completing in a few seconds even if it's fragmented.  

===
filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723: 99891 extents found

time qemu-img measure -O raw  /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723
required size: 42949672960
fully allocated size: 42949672960

real	0m0.087s
user	0m0.079s
sys	0m0.009s
===

And I can't see any lseek calls in my test. And "required size" in the output seems to be wrong as it's a 19 GB image file?

===
du -sch /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723
19G	/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723
19G	total
===


To reproduce this issue, I have to first create a RAW thin provisioned image and then has to create a snapshot(qcow2) on top of it and has to then write data within the VM to make the image fragmented. Sorry, this was not clear in my initial bug description.


(In reply to Brian Foster from comment #19)
> I think the next step is to try with the original image file and filesystem
> where this is reproduced. Since the measure command doesn't seem to look at
> file data, we should be able to accomplish this with a metadump of the fs
> that contains the associated file image. Can somebody create a metadump
> image of this fs (xfs_metadump -go <dev> <output>), compress it and attach
> or upload it to somewhere otherwise accessible? We'll also need the pathname
> of the target image file in the fs.
> 

The image files are below. The e76077fe is a qcow2 image pointing to RAW image 8b2ace38.

===
qemu-img info /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c -U
image: /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c
file format: qcow2
virtual size: 40G (42949672960 bytes)
disk size: 14G
cluster_size: 65536
backing file: 8b2ace38-c59f-4bff-aeb9-b7a7f392f723 (actual path: /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723)
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-aeb9-b7a7f392f723: 99891 extents found

filefrag /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c
/rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c: 386006 extents found
====

I will attach the metadump here.

Comment 23 Eric Sandeen 2020-06-30 04:37:15 UTC
Can you explain exactly how to reproduce it using those 2 files?  Running  qemu-img measure -O qcow2 <filename> -U against either e76077fe or 8b2ace38 completes very quickly here.

Maybe I'm doing something wrong, qemu-img info doesn't seem to think it has a backing file:

# qemu-img info ./vgpu_storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c
image: ./vgpu_storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c
file format: raw
virtual size: 14G (14973468672 bytes)
disk size: 14G
#

Comment 24 Kevin Wolf 2020-06-30 09:56:24 UTC
(In reply to nijin ashok from comment #20)
> time qemu-img measure -O raw 
> /rhev/data-center/mnt/_home_vgpu__storage/71dafb20-d15f-46ca-b8ea-
> 361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/8b2ace38-c59f-4bff-
> aeb9-b7a7f392f723
> [...]
> And I can't see any lseek calls in my test.

Sorry, my question was unclear. What I meant was using a raw input file, but still measuring with -O qcow2 (i.e. what size the image would become if converted to qcow2) because only that will do the lseek calls.

> And "required size" in the output seems to be wrong as it's a 19 GB image file?

It prints the required file size, not the space that will actually be allocated, so you get the 40G which are also reported as virtual size. This is why we don't have to search for holes when using raw output.

(In reply to Eric Sandeen from comment #23)
> Can you explain exactly how to reproduce it using those 2 files?  Running 
> qemu-img measure -O qcow2 <filename> -U against either e76077fe or 8b2ace38
> completes very quickly here.
> 
> Maybe I'm doing something wrong, qemu-img info doesn't seem to think it has
> a backing file

I assume you took the files from the metadump? With a file that contains only zeros, qemu-img obviously can't see a qcow2 image any more. 

So what I tried now is doing something similar as qemu-img measure would do with the original data present, just with a stupid small C program:

#include <fcntl.h>
#include <unistd.h>

int main(void)
{
    int fd, i;

    fd = open("/mnt/vgpu_storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c", O_RDONLY);
    for (i = 0; i < 1024; i++) {
        lseek(fd, i * 0x10000, SEEK_DATA);
        lseek(fd, i * 0x10000, SEEK_HOLE);
    }
    return 0;
}

Running this does take quite a while, and it's only 1024 pairs of lseek calls:

$ time ./a.out 

real  0m56,274s
user  0m0,003s
sys 0m56,118s

Starting the lseeks from offsets much closer to the end of the file makes it run a lot faster. I guess this means that it goes through every single extent to check its allocation status and with 386005 extents that just does take some time.

If this is the case, I guess fragmentation does matter after all, even on SSDs.

Comment 25 Brian Foster 2020-06-30 12:20:07 UTC
(In reply to Kevin Wolf from comment #24)
...
> (In reply to Eric Sandeen from comment #23)
> > Can you explain exactly how to reproduce it using those 2 files?  Running 
> > qemu-img measure -O qcow2 <filename> -U against either e76077fe or 8b2ace38
> > completes very quickly here.
> > 
> > Maybe I'm doing something wrong, qemu-img info doesn't seem to think it has
> > a backing file
> 
> I assume you took the files from the metadump? With a file that contains
> only zeros, qemu-img obviously can't see a qcow2 image any more. 
> 

Ok, it wasn't clear that the underlying file was particularly formatted (i.e. qcow2). I thought we were just operating on an underlying raw image and the qemu tool was calculating how to convert to qcow2 format. Instead, it sounds like there's a qcow2 snapshot linked to an underlying raw image and the tool is run on the snapshot. Therefore the content of the snapshot file is relevant because it contains qcow2 metadata. Am I following that correctly?

> So what I tried now is doing something similar as qemu-img measure would do
> with the original data present, just with a stupid small C program:
> 
...
> Running this does take quite a while, and it's only 1024 pairs of lseek
> calls:
> 
> $ time ./a.out 
> 
> real  0m56,274s
> user  0m0,003s
> sys 0m56,118s
> 
> Starting the lseeks from offsets much closer to the end of the file makes it
> run a lot faster. I guess this means that it goes through every single
> extent to check its allocation status and with 386005 extents that just does
> take some time.
> 
> If this is the case, I guess fragmentation does matter after all, even on
> SSDs.

ISTM that comment #20 suggests that extra time is being consumed in SEEK_HOLE requests because we have to iterate extents from the specified offset until an extent is found that is not logically contiguous with the previous (i.e. a hole). E.g., if I modify your program to skip SEEK_HOLE requests, it returns nearly instantly. If I do the opposite and only execute SEEK_HOLE, it takes ~2m.

Alternatively, if I just use xfs_io to seek through the entire file:

# time xfs_io -c "seek -a -r 0" /mnt/vgpu_storage/71dafb20-d15f-46ca-b8ea-361f6eadbbcd/images/ee36addb-a945-4f4a-8d17-d16653338be4/e76077fe-825b-47a6-bf26-5830210c130c
Whence  Result
DATA    0
HOLE    200704
DATA    262144
HOLE    14973468672

real    0m0.123s
user    0m0.000s
sys     0m0.123s

It completes much faster because it uses the data/hole offsets to efficiently map the file. IOW, once SEEK_HOLE passes over a large number of logically contiguous extents, there's no reason to search over it again because we can seek further data from the offset of the hole. The fact that this file has ~386k extents and only a couple holes probably explains why pushing the start offset toward the end of the file tends to speed up the test.

Comment 26 Eric Sandeen 2020-06-30 13:19:45 UTC
> I assume you took the files from the metadump? With a file that contains only zeros, qemu-img obviously can't see a qcow2 image any more. 

Oh of course, sorry for the thinko.

Comment 27 Brian Foster 2020-06-30 14:01:38 UTC
Eric raised the point offline about how fragmentation mitigation might improve the situation. Just for reference, it looks like over 260k of the extents on the original image file are single block extents. If I recreate a 14G image made up of ~14k 1MB extents (to simulate a 1MB extent size hint), the original seek.c test program completes in 3.5s (instead of ~2m on the original). If I repeat the same test with ~229k 64k extents, it's slower at around 1m.

Comment 28 Yanhui Ma 2020-07-01 06:14:51 UTC
Hi Nijin,

I tried to reproduce the issue with following steps, but didn't reproduce it. Could you please help check them?

Package version:
qemu-kvm-rhev-2.12.0-44.el7_8.2.x86_64
kernel-3.10.0-1127.8.2.el7.x86_64

1. on rhel7.8 host, run qemu-img create -f raw /home/kvm_autotest_root/images/rhel78-64-virtio.raw 40G

2. Install a RHEL7.8 guest

3. After installation 
# qemu-img info rhel78-64-virtio.raw
 image: rhel78-64-virtio.raw
 file format: raw
 virtual size: 40G (42949672960 bytes)
 disk size: 4.5G

# filefrag rhel78-64-virtio.raw 
 rhel78-64-virtio.raw: 86468 extents found

# time qemu-img measure -O qcow2 rhel78-64-virtio.raw -U
 required size: 4799201280
 fully allocated size: 42956488704
 real 0m0.038s
 user 0m0.012s
 sys 0m0.026s 

4. on the guest, run dd several times
#  dd if=/dev/urandom bs=1M of=test count=30000
30000+0 records in
30000+0 records out
31457280000 bytes (31 GB) copied, 186.694 s, 168 MB/s

5. 
# filefrag rhel78-64-virtio.raw 
rhel78-64-virtio.raw: 136348 extents found

# qemu-img info rhel78-64-virtio.raw 
image: rhel78-64-virtio.raw
file format: raw
virtual size: 40G (42949672960 bytes)
disk size: 34G
 
# time qemu-img measure -O qcow2 rhel78-64-virtio.raw -U
required size: 36100702208
fully allocated size: 42956488704

real	0m0.049s
user	0m0.012s
sys	0m0.038s


6. qemu-img create -f qcow2  sn1.qcow2  -b rhel78-64-virtio.raw  -F raw

7. on the snapshot sn1.qcow2
# dd if=/dev/urandom bs=1M of=test count=30000
30000+0 records in
30000+0 records out
31457280000 bytes (31 GB) copied, 233.001 s, 135 MB/s

8. # filefrag sn1.qcow2 
sn1.qcow2: 35869 extents found

# time qemu-img measure -O qcow2 sn1.qcow2 -U
required size: 36100702208
fully allocated size: 42956488704

real	1m47.165s
user	0m0.045s
sys	1m46.764s

It takes more time than base image rhel78-64-virtio.raw, but doesn't take 15 mins.

Comment 29 Dave Chinner 2020-07-02 04:20:52 UTC
(In reply to nijin ashok from comment #20)
> (In reply to Kevin Wolf from comment #13)
> 
> > But maybe we can get another strace of the fragmented case that actually
> > contains the syscall timing information? (Oh, and sub-second precision
> > timestamps would have been nice, too.)
> > 
> 
> I executed strace with syscall timing and I can see that SEEK_HOLE is taking
> much more time compared to SEEK_DATA.
> 
> ===
> 150630 23:15:03 lseek(11, 2457272320, SEEK_DATA) = 2457272320 <0.000021>
> 150630 23:15:03 lseek(11, 2457272320, SEEK_HOLE) = 14973468672 <0.071897>
> 150630 23:15:03 lseek(11, 2429812736, SEEK_DATA) = 2429812736 <0.000020>
> 150630 23:15:03 lseek(11, 2429812736, SEEK_HOLE) = 14973468672 <0.072158>
> 150630 23:15:04 lseek(11, 2456616960, SEEK_DATA) = 2456616960 <0.000019>
> 150630 23:15:04 lseek(11, 2456616960, SEEK_HOLE) = 14973468672 <0.071904>
> 150630 23:15:04 lseek(11, 2430140416, SEEK_DATA) = 2430140416 <0.000020>
> 150630 23:15:04 lseek(11, 2430140416, SEEK_HOLE) = 14973468672 <0.072293>
> 150630 23:15:04 lseek(11, 2454061056, SEEK_DATA) = 2454061056 <0.000020>
> 150630 23:15:04 lseek(11, 2454061056, SEEK_HOLE) = 14973468672 <0.071898>
> ===

Ok, so what is this seek pattern trying to acheive?

The SEEK_DATA is telling you that the there is data at this exact offset, and the next SEEK_HOLE is telling you that there is data all the way from the current offset to the end of teh file. Why are you going to another slightly higher offset where you already know there is data and repeating the pattern? I don't understand what this tells you that the previous pair of seeks hasn't already told you about the data layout in the file, so this looks more like an application issue than a filesystem problem from this level...

-Dave.

Comment 30 Kevin Wolf 2020-07-03 14:24:15 UTC
Nijin, just to confirm: The customer is currently happy with the workaround of enabling the "writeback" cache mode? Did we also suggest the option of setting an extent size hint on the directories containing disk images so that this option can be disabled again?

(In reply to Dave Chinner from comment #29)
> I don't understand what this tells you that the previous pair
> of seeks hasn't already told you about the data layout in the file, so this
> looks more like an application issue than a filesystem problem from this
> level...

As stated in previous comments, yes, this sequence is silly, does not happen in current RHEL 8 versions and I know what to backport to stop it.

But it made us look at the time that SEEK_HOLE takes, which is (even without the silly seek pattern) still something that seems to potentially grow linearly with the number of extents. So assuming that XFS can't do this with less than linear complexity, I think the conclusion is that we need to do something about fragmentation and avoid images with multiple hundred thousand extents - and this something might be, or at least include, setting an extent size hint.

Comment 31 Dave Chinner 2020-07-05 23:51:15 UTC
(In reply to Kevin Wolf from comment #30)
> So assuming that XFS can't do this with
> less than linear complexity,

I think you are assigning blame without actually understanding the required behaviour of the SEEK_HOLE API and the limitations that places on filesytem implementations.

By definition, finding the next "hole" (defined as a consecutive run of zeros) in the file requires walking all the extents from the current offset to the end of the file because this deinition means an unwritten extent is a HOLE. i.e. preallocated, unwritten  space is not DATA, it is a HOLE. Hence we have to check every extent for UNWRITTEN state to correctly detect holes in the data and hence it's a O(n) search algorithm that cannot be otherwise reduced.

This is not unique to XFS. Both Ext4 and gfs2 use *exactly same the code* as XFS to implement SEEK_HOLE, whilst btrfs, ocfs2 and f2fs all use their own internal "walk all higher offset extents/blocks checking unwritten bits until we find the next hole" linear search algorithms. IOWs, every single filesystem implements SEEK_HOLE exactly the same way, so they all have O(n) behaviour.

> I think the conclusion is that we need to do
> something about fragmentation and avoid images with multiple hundred
> thousand extents - and this something might be, or at least include, setting
> an extent size hint.

That's what we've been telling application developers for the last couple of decades. :(

For an IO intensive application, minimising potential file fragmentation is -good application design practice-. It doesn't matter what filesystem you are targetting, if you don't take steps to mitigate the potential causes of fragmentation then it's only a matter of time before the underlying filesystem will be unable to hide the effects of excessive fragmentation from the application and IO performance will suffer.

-Dave.

Comment 33 Kevin Wolf 2020-07-06 10:08:35 UTC
(In reply to Dave Chinner from comment #31)
> That's what we've been telling application developers for the last couple of
> decades. :(

I don't think any application developer would disagree that avoiding fragmentation is a good goal. But sometimes the tools to achieve things are less than perfect and their existence is not obvious. I see that FS_IOC_FSGETXATTR is now a generic interface (thanks!), but it's not yet in the context of RHEL 7.

If you want features to be used by applications, it's probably the most important thing that they are accessible with a generic (i.e. filesystem independent) interface that applications can use to solve the problem with a single implementation for every filesystem that supports the feature. Otherwise they remain obscure hacks that applications are hesitant to use - if people even know about them in the first place. Filesystem specific ioctls may be fine for the initial implementation of something new, but it's not sufficient if widespread use is the goal.

As you agreed before that AIO+DIO is the right way to achieve performance, this seems to be another case that should be made easier to use. Could you try to turn XFS_IOC_DIOINFO into a generic interface, too? Without an interface like this, it's really hard to know what you're supposed to do as an application.

Comment 35 nijin ashok 2020-07-14 14:18:35 UTC
(In reply to Kevin Wolf from comment #30)
> Nijin, just to confirm: The customer is currently happy with the workaround
> of enabling the "writeback" cache mode? Did we also suggest the option of
> setting an extent size hint on the directories containing disk images so
> that this option can be disabled again?
> 

The customer is good with the current workaround of enabling "writeback" cache mode. Their major issues are solved after enabling this.

Comment 37 Eric Sandeen 2020-07-16 03:01:48 UTC
Kevin - after conversations w/ Brian & Dave, and thinking more about common interfaces that applications could use rather than xfs-specific ioctls, we wondered -

Any chance you already set fadvise FADV_RANDOM on these files?

Right now FADV_RANDOM seems to be used for read access patterns, but setting it to indicate write access, and setting an extent size hint on sparse files automatically, might be possible.  We could see if ext4 could make use of this as well.

Comment 38 Kevin Wolf 2020-07-16 10:16:58 UTC
No, we don't. We also don't really know for sure whether the I/O will be mostly sequential or random, though I guess the assumption that it will be mostly random in the common case is plausible enough that we would consider it if you tell us that it will help. The current use seems to influence only the page cache readahead, which is not that interesting given that we recommend O_DIRECT anyway. But if you're considering using it for more, why not.

QEMU 5.1 will set an extent size hint for newly created image files (which thankfully already is a filesystem-independent ioctl), defaulting to 1 MB. Of course, this doesn't cover existing images or images that weren't created with QEMU, but for example with dd, or that were copied with cp. I'm not sure if much can be done about that, though. Maybe cp could be taught to copy the extent size hint, too, but that's probably the maximum that could be done.

Comment 39 Kevin Wolf 2020-07-20 08:17:56 UTC
Nir, on second thoughts, I wonder, does RHV even use qemu-img create/convert to create/copy raw image files? Because if it doesn't, we would obviously not get the extent size hint for them, and the qemu-img fix would only help qcow2 for qcow2 images. I seem to remember that there are cases where you don't use qemu-img, though I'm not sure if that was just block devices or also regular files. If that was the case for regular case, we would need to clone the bug for RHV to make use of qemu-img create (or to set an extent size hint itself, independent of QEMU).

Comment 40 Nir Soffer 2020-07-20 10:58:21 UTC
(In reply to Kevin Wolf from comment #39)
> Nir, on second thoughts, I wonder, does RHV even use qemu-img create/convert
> to create/copy raw image files?

Yes, our approach is the this is the best tool for handling images so duplicating
it's functionallity in RHV is the wrong thing. We do this sometimes as a temporary
solution until we can fix qemu-img itself.

Since 3.6 we always convert images using qemu-img convert.

Since 4.4 we always create images using qemu-img create.

One exception, since 4.4.2 (next release), we don't use -o preallocation
for raw images on file storage, since qemu-img is using posix_fallocate()
which is inefficient and cause trouble with legacy NFS, so we use our own
fallocate helper. It tries to use fallocate() and if it is not supported fall
back to writing zeroes.
https://github.com/oVirt/vdsm/blob/master/helpers/fallocate

This works much better compared to posix_fallocate.
https://github.com/oVirt/vdsm/commit/9533b644336636f42807a78054eae3c03da5fa4a

I think qemu-img should use the same approach.

Comment 41 Kevin Wolf 2020-07-20 13:41:31 UTC
(In reply to Nir Soffer from comment #40)
> Yes, our approach is the this is the best tool for handling images so duplicating
> it's functionallity in RHV is the wrong thing. We do this sometimes as a temporary
> solution until we can fix qemu-img itself.

Great, so making this change in QEMU should indeed be enough.

> One exception, since 4.4.2 (next release), we don't use -o preallocation
> for raw images on file storage

Ok. As long as you still create the image with qemu-img create, this doesn't make a difference for this one.

We should talk about your suggestion for improving file-posix preallocation, but probably not here. Would you like to either create a new BZ or just write a mail to qemu-block about it?

Comment 42 Dave Chinner 2020-07-20 22:56:48 UTC
(In reply to Nir Soffer from comment #40)
> One exception, since 4.4.2 (next release), we don't use -o preallocation
> for raw images on file storage, since qemu-img is using posix_fallocate()
> which is inefficient and cause trouble with legacy NFS, so we use our own
> fallocate helper. It tries to use fallocate() and if it is not supported fall
> back to writing zeroes.
> https://github.com/oVirt/vdsm/blob/master/helpers/fallocate

/me goes and looks....

> This works much better compared to posix_fallocate.
> https://github.com/oVirt/vdsm/commit/9533b644336636f42807a78054eae3c03da5fa4a

Ahhhhh, you do understand that posix_fallocate() != overwrite_everything_with_zeroes()?

The speedup over glibc's posix_fallocate() emulation is no surprise because write_zeroes() in the "fallocate helper" is not emulating posix_fallocate(). It's emulating fallocate(FALLOC_FL_ZERO_RANGE) instead.  posix_fallocate() does not -destroy existing data in the file-, while write_zeroes() destroys all the pre-existing data in the file.

i.e. posix_fallocate() emulation needs to read the data first to see if it can write zeroes (i.e. can only write zeroes if teh read is all zeroes), and hence it's a read latency bound operation, not a streaming write operation.

So, yeah, that fallocate() preallocation helper is actually dangerous: anyone who thinks "this is a fast posix_fallocate() method" is going to *lose their data* because the emulation code is actually fallocate(FALLOC_FL_ZERO_RANGE) emulation, not fallocate(0) emulation like the name and "native_fallocate()" and posix_fallocate() functions are implementing.

I'd strongly suggest reverting that change because it is clearly implementing incorrect behaviour.

 Just the premise of "we can do emulation faster than glibc" is a big red flag, because if there was a faster mechanism for emulation, glibs would be using it. Indeed, the glibc code special cases NFS because... well, it just look like a poor implementation:

https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html

.....
 /* Minimize data transfer for network file systems, by issuing
     single-byte write requests spaced by the file system block size.
     (Most local file systems have fallocate support, so this fallback
     code is not used there.)  */
  unsigned increment;
  {
    struct statfs64 f;
    if (__fstatfs64 (fd, &f) != 0)
      return errno;
    if (f.f_bsize == 0)
      increment = 512;
    else if (f.f_bsize < 4096)
      increment = f.f_bsize;
    else
      /* NFS does not propagate the block size of the underlying
         storage and may report a much larger value which would still
         leave holes after the loop below, so we cap the increment at
         4096.  */
      increment = 4096;
  }
.....

i.e. they crippled the IO sizes for NFS to 4kB (could be up to 1MB) because the read/check/write() loop doesn't handle the final partial EOF region correctly. i.e. the glibc emulation code needs fixing for large IO sizes, and a large part of the emulation overhead will go away....

-Dave.

Comment 43 Kevin Wolf 2020-07-21 09:39:25 UTC
I think I have to defend the oVirt guys here. While you're right that this is not a full replacement for posix_fallocate(), nobody suggests changing glibc to do the same and Nir clearly mentioned the script together with its very specific context: Preallocation of a new raw image. I might have called the script "preallocate-raw" rather than "fallocate", but honestly, that's a detail that doesn't invalidate the approach.

The code in 'qemu-img create' calls posix_fallocate() for ranges after EOF, so glibc will be clever enough not to issue those reads - because they are not needed in our case.

The big difference is just that glibc uses tons of one-byte pwrite() calls, one for each block, and the oVirt code uses large buffers. As the glibc comment tells, this is a tradeoff meant to reduce network traffic. If oVirt comes to the conclusion that a different tradeoff (less requests, but transfer more data) is better for the typical customer case, there is no reason not to implement it this way.

(In reply to Dave Chinner from comment #42)
> i.e. they crippled the IO sizes for NFS to 4kB (could be up to 1MB) because
> the read/check/write() loop doesn't handle the final partial EOF region
> correctly. i.e. the glibc emulation code needs fixing for large IO sizes,
> and a large part of the emulation overhead will go away....

The comment in the glibc code says otherwise, but if you think the comment is wrong, feel free it to discuss it with the glibc people.

Comment 44 Nir Soffer 2020-07-21 10:02:07 UTC
Dave, thanks for looking in oVirt code!

As Kevin said, our helper is not a generic replacement for posix_falloate()
or /usr/bin/fallocate. It lives in /usr/libexec/vdsm/fallocate and used in
2 cases:
- create new raw preallocated images
- extending raw preallocated images

In both cases you we write after EOF so there is no data to destroy.

The write_zeroes() fallback is used only when the fallocate() is not 
available, for example NFS < 4.2.

Can suggest a better way to allocate space when fallocate() is not
available, that works with any NFS server that you don't control?

Comment 45 Yanhui Ma 2020-07-23 10:25:47 UTC
Hi Kevin, 

I tried to reproduce the bug with the steps in commit ffa244c84a, but still can't reproduce it, could you please help look at them?

qemu-kvm-5.1.0-0.scrmod+el8.3.0+7384+2e5aeafb.wrb200716.x86_64
kernel-4.18.0-193.8.1.el8_2.x86_64

[root@ibm-x3850x6-02 no_extent]# qemu-img create -f raw -o extent_size_hint=0 test.raw 10G
Formatting 'test.raw', fmt=raw size=10737418240 extent_size_hint=0

[root@ibm-x3850x6-02 no_extent]# filefrag test.raw 
test.raw: 1 extent found

[root@ibm-x3850x6-02 no_extent]# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 431.390 seconds.

[root@ibm-x3850x6-02 no_extent]# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 323.155 seconds.

[root@ibm-x3850x6-02 no_extent]# filefrag test.raw
test.raw: 2000000 extents found

[root@ibm-x3850x6-02 no_extent]# time qemu-img map test.raw
Offset          Length          Mapped to       File
0               0x1e8480000     0               test.raw

real	0m0.366s
user	0m0.005s
sys	0m0.350s

Comment 46 Kevin Wolf 2020-07-23 12:05:31 UTC
(In reply to Yanhui Ma from comment #45)
> [root@ibm-x3850x6-02 no_extent]# filefrag test.raw
> test.raw: 2000000 extents found

I would say that you have in fact reproduced the problem. The fragmentation is what we're addressing.

> [root@ibm-x3850x6-02 no_extent]# time qemu-img map test.raw
> Offset          Length          Mapped to       File
> 0               0x1e8480000     0               test.raw
> 
> real	0m0.366s
> user	0m0.005s
> sys	0m0.350s

This is faster than what I described in the commit message, but still slower than what I got after the fix. So I think this is just differences between the test environments.

Comment 47 Xueqiang Wei 2020-07-23 18:31:50 UTC
Yanhui,

I think I reproduced the bug with the steps in commit ffa244c84a.


Versions:
kernel-4.18.0-224.el8.x86_64
qemu-kvm-5.1.0-0.scrmod+el8.3.0+7384+2e5aeafb.wrb200716


Without an extent size hint:
# qemu-img create -f raw -o extent_size_hint=0 test.raw 10G
Formatting 'test.raw', fmt=raw size=10737418240 extent_size_hint=0

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 448.901 seconds.

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 388.021 seconds.

# filefrag test.raw
test.raw: 2000000 extents found

# time qemu-img map test.raw
Offset          Length          Mapped to       File
0               0x1e8480000     0               test.raw

real    0m0.414s
user    0m0.009s
sys     0m0.397s


With the new default extent size hint of 1 MB:
# qemu-img create -f raw -o extent_size_hint=1M test.raw 10G
Formatting 'test.raw', fmt=raw size=10737418240 extent_size_hint=1048576

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 489.152 seconds.

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 385.999 seconds.

# filefrag test.raw
test.raw: 6511 extents found

# time qemu-img map test.raw
Offset          Length          Mapped to       File
0               0x1e8480000     0               test.raw

real    0m0.029s
user    0m0.011s
sys     0m0.011s

Comment 48 Yanhui Ma 2020-07-24 03:48:41 UTC
(In reply to Kevin Wolf from comment #46)
> (In reply to Yanhui Ma from comment #45)
> > [root@ibm-x3850x6-02 no_extent]# filefrag test.raw
> > test.raw: 2000000 extents found
> 
> I would say that you have in fact reproduced the problem. The fragmentation
> is what we're addressing.
> 
> > [root@ibm-x3850x6-02 no_extent]# time qemu-img map test.raw
> > Offset          Length          Mapped to       File
> > 0               0x1e8480000     0               test.raw
> > 
> > real	0m0.366s
> > user	0m0.005s
> > sys	0m0.350s
> 
> This is faster than what I described in the commit message, but still slower
> than what I got after the fix. So I think this is just differences between
> the test environments.

Here are test results with extent_size_hint=1M:

[root@ibm-x3850x6-02 home]# qemu-img create -f raw -o extent_size_hint=1M test1.raw 10G
Formatting 'test1.raw', fmt=raw size=10737418240 extent_size_hint=1048576
[root@ibm-x3850x6-02 home]# qemu-img bench -f raw -t none -n -w test1.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 459.346 seconds.
[root@ibm-x3850x6-02 home]# qemu-img bench -f raw -t none -n -w test1.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 370.200 seconds.
[root@ibm-x3850x6-02 home]# filefrag test1.raw 
test1.raw: 7811 extents found
[root@ibm-x3850x6-02 home]# 
[root@ibm-x3850x6-02 home]# time qemu-img map test1.raw 
Offset          Length          Mapped to       File
0               0x1e8480000     0               test1.raw

real	0m0.025s
user	0m0.005s
sys	0m0.003s

yes, 0.025s is faster than 0.366s. 

Hi Kevin, because this is customer's bug and we don't want to miss it, the test case needs to be added to our test plan. We will use your reproduce steps as our test case steps. But now I have some questions:

1) For checkpoints, from my test results, the time difference of qemu-img is not as obvious as yours. To easily identify whether there is a bug, we will check the fragments and take it as checkpoint. Do you think it is ok?

2) If we take fragments as checkpoint, we may need a range of fragments to judge whether it is a bug later. My test results is 7811 extents, about the range of value, do you have any suggestions?

3) For the parameter extent_size_hint=1M, do we need to explicitly specify it when creating a image after qemu5.1 or it is default?

4) How could we check the value of extent_size_hint after setting it?

Thanks Yanhui

Comment 49 Kevin Wolf 2020-07-24 08:14:10 UTC
(In reply to Yanhui Ma from comment #48)
> 1) For checkpoints, from my test results, the time difference of qemu-img is
> not as obvious as yours. To easily identify whether there is a bug, we will
> check the fragments and take it as checkpoint. Do you think it is ok?

Yes, the number of extents is the primary thing to check.
 
> 2) If we take fragments as checkpoint, we may need a range of fragments to
> judge whether it is a bug later. My test results is 7811 extents, about the
> range of value, do you have any suggestions?

With 1 MB extents, the theoretical maximum for a 10 GB image is 10000 extents (10000 * 1 MB = 10 GB).

> 3) For the parameter extent_size_hint=1M, do we need to explicitly specify
> it when creating a image after qemu5.1 or it is default?

1 MB is the default in QEMU 5.1. Disabling it will require an explicit extent_size_hint=0 then.

> 4) How could we check the value of extent_size_hint after setting it?

On XFS, you can use something like 'xfs_io -c extsize test.img'. We may later add this information to qemu-img info, though it's too late for upstream 5.1. If you need this in 8.3.0, I'd suggest to open an RFE BZ so we can backport it after the rebase.

Comment 50 Dave Chinner 2020-07-27 22:08:45 UTC
(In reply to Nir Soffer from comment #44)
> Dave, thanks for looking in oVirt code!
> 
> As Kevin said, our helper is not a generic replacement for posix_falloate()
> or /usr/bin/fallocate. It lives in /usr/libexec/vdsm/fallocate and used in
> 2 cases:
> - create new raw preallocated images
> - extending raw preallocated images
> 
> In both cases you we write after EOF so there is no data to destroy.

Then please don't call it "fallocate" because that has specific behavioural meaning to a large number of developers and users, both for the OS provided CLI utility and for the kernel provided syscall. Naming matters, and overloaded the same name with incompatible behaviours is very user-unfriendly....

> The write_zeroes() fallback is used only when the fallocate() is not 
> available, for example NFS < 4.2.
> 
> Can suggest a better way to allocate space when fallocate() is not
> available, that works with any NFS server that you don't control?

If you can't measure the fragmentation on a remote NFS server via the local client or application, then you can't actually tell if fragmentation is a problem. Even writing zeroes can -cause- fragmentation - if the NFS server is doing inline de-dupe and so doesn't actually write gigabytes of zeroes to disk it will fragment when you start writing real data, just like it was a sparse file. Further, if the NFS server is using a write-anywhere style filesystem (WAFL, ZFS, btrfs, etc) then writing zeroes will do nothing to prevent fragmentation as the copy-on-overwrite mechanisms will fragment the files.

IOWs, if you don't control the NFS server, the best advice is "don't try to do something smart" because there are a good number of situations where the "smart" thing to do for one server is the exactly the wrong thing to do for another server. Leave it to the sysadmin to optimise access methods to their NFS storage if the client/server does not have native fallocate() support.

There is no "one size fits all" magic solution to this problem....

-Dave.

Comment 53 Yanhui Ma 2020-08-17 03:27:31 UTC
Verify the bug on qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901.x86_64 with following steps:

# qemu-img create -f raw test.raw 10G
Formatting 'test.raw', fmt=raw size=10737418240

# filefrag test.raw 
test.raw: 1 extent found

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 132.517 seconds.

# qemu-img bench -f raw -t none -n -w test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 131.507 seconds.

# filefrag test.raw
test.raw: 778 extents found

# time qemu-img map test.raw
Offset          Length          Mapped to       File
0               0x1e8480000     0               test.raw

real	0m0.018s
user	0m0.002s
sys	0m0.006s

According to above results, set the bug verified.

Comment 58 errata-xmlrpc 2020-11-17 17:49:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.