Bug 1851097
| Summary: | qemu-create and qemu-img convert with -o preallocation=falloc can cause severe impact on NFS <v4.2 file systems | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Gordon Watson <gwatson> |
| Component: | kernel | Assignee: | nfs-maint |
| kernel sub component: | NFS | QA Contact: | Filesystem QE <fs-qe> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | bfields, coli, dwysocha, jinzhao, juzhang, kwolf, mkalinin, nsoffer, xuwei, xzhou, yoyang |
| Version: | 7.8 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-07 17:52:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Gordon Watson
2020-06-25 15:30:33 UTC
Xueqiang, Well, yes and no. With NFS v4.2, fallocate() is supported and should be successfully executed. Hence, it's significantly faster. The issue that I'm reporting though is that with NFS v3, 'qemu-img create -f raw -o preallocation=falloc' is much slower than 'fallocate --posix'. In my testing, it's almost twice as slow. In the customer's RHV environment, it causes severe problems on the SPM jost, e.g. i/o timeouts, storage domains going offline, etc. Their NFS server is Netapp. We have looked at and are still looking at that being a factor for the customer. However, just in my test environment and as you showed in comment #7, 'qemu-img create -f raw -o preallocation=falloc' is much slower 'fallocate --posix'. They both try to execute fallocate(), which fails on NFS v3. Then pwrite() system calls are issued to write a single zero byte to each block. However, this is the same for both commands. The only difference I see is that 'qemu-img create' uses OFD locking. It sets locks before and unlocks them after the pwrites. In my tests here, when run 'qemu-img create -f raw -o preallocation=falloc', it impacts other i/o to that storage domain/NFS server. In the customer's environment, when they run 'qemu-img create -f raw -o preallocation=falloc', it causes VDSM to report read timeouts and storage domains offline just the same as when operations that use 'qemu-img create/convert' with '-o preallocation=falloc' are executed by VDSM. So, they can cause/replicate the problem outisde of RHV/VDSM, that ends up causing RHV/VDSM problems. I hope this makes sense and answers your question. If not, please let me know. Regards, GFW. Let's test whether turning the locking off makes a difference. Unfortunately, this can't be done during image creation, but only when opening an image. For debugging, we can work around it, though, by creating an empty image and then resizing it: $ qemu-img create -f raw /tmp/test.raw 0 Formatting '/tmp/test.raw', fmt=raw size=0 $ qemu-img resize --preallocation=falloc --image-opts driver=file,filename=/tmp/test.raw,locking=off 1G Image resized. For comparison, you can also try locking=on first to verify that it causes the same problems as 'qemu-img create'. Kevin,
Thanks for the suggestion. I ran some specific tests (see below). Essentially, this time using 'locking=on' and 'locking=off' resulted in the 'qemu-img resize' taking the same amount of elapsed time, but the impact inside the guest was significant. With 'locking=on', the 'ls -R' took as long as the 'resize' (about 55 seconds), and with 'locking=off', I was able to run it twice with the same period of time (24 seconds each).
I ran the first test with a 5GB file, as that is what I had been using before. When I then ran it under 'strace', I used a 1GB file.
Regards, GFW.
- Basic test with and without 'locking';
# qemu-img create -f raw gfwfile 0
Formatting 'gfwfile', fmt=raw size=0
# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 5G
Image resized.
real 0m55.230s
user 0m0.753s
sys 0m3.787s
# qemu-img create -f raw gfwfile2 0
Formatting 'gfwfile2', fmt=raw size=0
# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 5G
Image resized.
real 0m54.471s
user 0m1.336s
sys 0m2.897s
- Repeated this under 'strace' (with 1GB file);
# time strace -fttTv -o /tmp/lockingon.str qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 1G
Image resized.
real 0m15.502s
user 0m3.425s
sys 0m10.932s
# time strace -fttTv -o /tmp/lockingoff.str qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 1G
Image resized.
real 0m14.072s
user 0m3.392s
sys 0m10.596s
# grep fallocate /tmp/lockingon.str
20592 10:59:19.170731 fallocate(11, 0, 0, 1073741824) = -1 EOPNOTSUPP (Operation not supported) <0.000012>
# grep fallocate /tmp/lockingoff.str
20855 10:59:56.485677 fallocate(11, 0, 0, 1073741824) = -1 EOPNOTSUPP (Operation not supported) <0.000009>
# grep pwrite /tmp/lockingon.str |wc -l
262144
# grep pwrite /tmp/lockingoff.str |wc -l
262144
# grep OFD /tmp/lockingon.str |wc -l
10
# grep OFD /tmp/lockingoff.str |wc -l
0
# grep OFD /tmp/lockingon.str
20590 10:59:19.168866 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0, l_pid=0}) = 0 <0.000010>
20590 10:59:19.169609 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=100, l_len=1}) = 0 <0.000012>
20590 10:59:19.169649 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=101, l_len=1}) = 0 <0.000009>
20590 10:59:19.169686 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=103, l_len=1}) = 0 <0.000009>
20590 10:59:19.169721 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=200, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:19.169758 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=201, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:19.169793 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=203, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:33.175535 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=100, l_len=1}) = 0 <1.324324>
20590 10:59:34.499934 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=101, l_len=1}) = 0 <0.000017>
20590 10:59:34.500007 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=103, l_len=1}) = 0 <0.000010>
- Repeated first test and ran 'ls' within guest at the same time;
- on host;
# qemu-img create -f raw gfwfile 0
Formatting 'gfwfile', fmt=raw size=0
# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 5G
Image resized.
real 0m56.907s
user 0m0.669s
sys 0m3.874s
- in guest;
# echo 3 > /proc/sys/vm/drop_caches ; time ls -R /
real 0m56.907s
user 0m0.669s
sys 0m3.874s
- on host;
# qemu-img create -f raw gfwfile2 0
Formatting 'gfwfile2', fmt=raw size=0
# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 5G
Image resized.
real 0m53.973s
user 0m0.971s
sys 0m3.429s
- in guest (was able to run this twice during execution of 'resize' on host);
# echo 3 > /proc/sys/vm/drop_caches ; time ls -R /
real 0m24.548s
user 0m0.776s
sys 0m2.355s
Another test, this time on the same host as 'qemu-img create' is being run, is to 'dd' the 'metadata' file from the storage domain, which is what VDSM does cyclically. The path is; /rhev/data-center/mnt/netapp-fas2240-1-ctrl1.gsslab.rdu2.redhat.com:_vol_vol__rhev20170207_amar-rhv41_test/16f69cb1-8c60-4071-87a7-5b53ea1f0016/dom_md/metadata - With locking; # pwd /rhev/data-center/mnt/netapp-fas2240-1-ctrl1.gsslab.rdu2.redhat.com:_vol_vol__rhev20170207_amar-rhv41_test # qemu-img create -f raw gfwfile 0 Formatting 'gfwfile', fmt=raw size=0 # time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 1G Image resized. real 0m10.491s user 0m0.281s sys 0m0.793s At the same in another window; 726 bytes (726 B) copied, 0.000397695 s, 1.8 MB/s real 0m0.002s <<<< qemu-img create not running yet user 0m0.001s sys 0m0.000s 726 bytes (726 B) copied, 0.079152 s, 9.2 kB/s real 0m0.191s <<<< qemu-img create started user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.101975 s, 7.1 kB/s real 0m0.222s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.112243 s, 6.5 kB/s real 0m0.203s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.0857111 s, 8.5 kB/s real 0m0.192s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.170831 s, 4.2 kB/s real 0m0.323s user 0m0.000s sys 0m0.002s 726 bytes (726 B) copied, 0.000467613 s, 1.6 MB/s real 0m0.002s <<<< qemu-img create completed user 0m0.001s sys 0m0.000s - Without locking; # qemu-img create -f raw gfwfile2 0 Formatting 'gfwfile2', fmt=raw size=0 # time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 1G Image resized. real 0m10.911s user 0m0.290s sys 0m0.740s At the same in another window; 726 bytes (726 B) copied, 0.00037985 s, 1.9 MB/s real 0m0.002s <<<< qemu-img create not running yet user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.0146004 s, 49.7 kB/s real 0m0.030s <<<< qemu-img create started user 0m0.002s sys 0m0.000s 726 bytes (726 B) copied, 0.0148508 s, 48.9 kB/s real 0m0.034s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.0134097 s, 54.1 kB/s real 0m0.029s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.00939221 s, 77.3 kB/s real 0m0.022s user 0m0.001s sys 0m0.000s 726 bytes (726 B) copied, 0.00860009 s, 84.4 kB/s real 0m0.023s user 0m0.000s sys 0m0.001s 726 bytes (726 B) copied, 0.0005528 s, 1.3 MB/s real 0m0.002s <<<< qemu-img create completed user 0m0.000s sys 0m0.002s - Assessment; With locking enabled, the direct i/o dd'd when from thousandths of a second to tens of seconds. With locking disabled, it went to hundreds of a seconds. So, a factor of 10 difference. In the customer's environment, they see these "same dd" commands taking 2.5 seconds or more at times. I don't think advisory locking (which OFD locking is) should have any impact on the I/O path in theory, so this looks like a kernel problem to me. Let's ask the NFS people to have a look. I haven't looked at the rest of the bug, just responding to this one point: (In reply to Kevin Wolf from comment #12) > I don't think advisory locking (which OFD locking is) should have any impact > on the I/O path in theory, so this looks like a kernel problem to me. The client revalidates its caches on lock, and waits for any dirty data to be committed before returning from an unlock. That could definitely have a performance impact depending on how often you're doing it. (This is for cache consistency; normally if multiple clients are concurrently accessing the same file, they don't really get any guarantees--unless they use locks around their IO, in which case they're supposed to see up-to-date results.) (In reply to J. Bruce Fields from comment #14) > (In reply to Kevin Wolf from comment #12) > > I don't think advisory locking (which OFD locking is) should have any impact > > on the I/O path in theory, so this looks like a kernel problem to me. > > The client revalidates its caches on lock, and waits for any dirty data to > be committed before returning from an unlock. That could definitely have a > performance impact depending on how often you're doing it. If you mean that the lock and unlock operations can be somewhat expensive, yes, this is totally expected. qemu-img takes the lock only once, right after opening the file, and unlocks only before closing. The part that I expect to be unaffected is all the I/O that it does between those two calls, and I/O of other processes to other files on the same filesystem. If I understand correctly, what Gordon reports is that even I/O to other files in different processes is impacted. What might be relevant is that it doesn't lock the whole file, but only single bytes. (In reply to Kevin Wolf from comment #15) > If you mean that the lock and unlock operations can be somewhat expensive, > yes, this is totally expected. qemu-img takes the lock only once, right > after opening the file, and unlocks only before closing. The part that I > expect to be unaffected is all the I/O that it does between those two calls, > and I/O of other processes to other files on the same filesystem. If I > understand correctly, what Gordon reports is that even I/O to other files in > different processes is impacted. Understood, that sounds weird. > What might be relevant is that it doesn't lock the whole file, but only > single bytes. I wouldn't expect that to make much difference. I don't see this as a high enough priority for a RHEL7 accelerated fix. This requires very specific environmental conditions and it does not appear widespread. On Aug 6, 2020 Red Hat Enterprise Linux 7 entered Maintenance Support 2 Phase: https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase That means only "Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released". This BZ does not appear to meet Maintenance Support 2 Phase criteria so is being closed WONTFIX. If this is critical for your environment please provide a thorough business justification and ask that the BZ be re-opened for consideration. |