RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1851097 - qemu-create and qemu-img convert with -o preallocation=falloc can cause severe impact on NFS <v4.2 file systems
Summary: qemu-create and qemu-img convert with -o preallocation=falloc can cause sever...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: nfs-maint
QA Contact: Filesystem QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-25 15:30 UTC by Gordon Watson
Modified: 2023-10-06 20:50 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-07 17:52:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5178081 0 None None None 2020-06-25 17:20:12 UTC

Description Gordon Watson 2020-06-25 15:30:33 UTC
Description of problem:

This is a follow-on from RHV BZ 1850267. In a RHV environment, we have seen that 'qemu-img convert' and 'qemu-img create' with '-o prealloction=falloc' can cause  delays, timeouts, etc., on NFS storage domains with NFS v3. 

These commands issue a posix_fallocate() call. The fallocate() system call fails because NFS v3 does not support it. In fact, this is an issue prior to NFS v4.2, which is the first version to support fallocate().

When the fallocate(), pwrite calls are issue to write a single zero byte to each block of the target volume. This by itself does not cause a problem, as when we tested this by running 'fallocate --posix' on the customer's system, we did not see the problem. It only happens when 'qemu-img' is used. Looking at 'strace' output, it seems that the main difference is that 'qemu-img' uses OFD locking.


Version-Release number of selected component (if applicable):

RHV 4.3.9
RHVH 4.3.9 host (RHEL 7.8);
  vdsm-4.30.44-1.el7ev.x86_64  
  libvirt-4.5.0-33.el7.x86_64                
  qemu-kvm-rhev-2.12.0-44.el7.x86_64 
  ioprocess-1.3.1-1.el7ev.x86_64 


How reproducible:

100% in the customer's RHV environment. In-house, I can't reproduce the same symptoms, but I can reproduce a negative impact when 'qemu-img create' is run with '-o prealloction=falloc'.


Steps to Reproduce (not customer's repro steps, just mine);

1. On a RHV host, in an NFS v3 storage domain's root directory, run;

# qemu-img create -f raw -o preallocation=falloc outfile1 5g

2. In a guest using a vdisk in the same storage domain, run;

# echo 3 > /proc/sys/vm/drop_caches ; time ls -R /

3. As a comparsion, then repeat the above but instead of 'qemu-img create' run;

# fallocate --posix --length 5g outfile2



Actual results:

'ls -R \' takes around 3.5 seconds in this guest when nothing else is running.
It takes about 22 seconds when 'fallocate --posix' is run at the same time.
It takes about 55 seconds when 'qemu-img create -o preallocation=falloc' is run at the same time.


Expected results:


Additional info:

Comment 8 Gordon Watson 2020-06-29 13:40:46 UTC
Xueqiang,

Well, yes and no. With NFS v4.2, fallocate() is supported and should be successfully executed. Hence, it's significantly faster. The issue that I'm reporting though is that with NFS v3, 'qemu-img create -f raw -o preallocation=falloc' is much slower than 'fallocate --posix'. In my testing, it's almost twice as slow. In the customer's RHV environment, it causes severe problems on the SPM jost, e.g. i/o timeouts, storage domains going offline, etc. Their NFS server is Netapp. We have looked at and are still looking at that being a factor for the customer. However, just in my test environment and as you showed in comment #7, 'qemu-img create -f raw -o preallocation=falloc' is much slower 'fallocate --posix'.

They both try to execute fallocate(), which fails on NFS v3. Then pwrite() system calls are issued to write a single zero byte to each block. However, this is the same for both commands. The only difference I see is that 'qemu-img create' uses OFD locking. It sets locks before and unlocks them after the pwrites. 

In my tests here, when run 'qemu-img create -f raw -o preallocation=falloc', it impacts other i/o to that storage domain/NFS server. In the customer's environment, when they run 'qemu-img create -f raw -o preallocation=falloc', it causes VDSM to report read timeouts and storage domains offline just the same as when operations that use 'qemu-img create/convert' with '-o preallocation=falloc' are executed by VDSM. So, they can cause/replicate the problem outisde of RHV/VDSM, that ends up causing RHV/VDSM problems.

I hope this makes sense and answers your question. If not, please let me know.

Regards, GFW.

Comment 9 Kevin Wolf 2020-06-29 13:56:53 UTC
Let's test whether turning the locking off makes a difference. Unfortunately, this can't be done during image creation, but only when opening an image. For debugging, we can work around it, though, by creating an empty image and then resizing it:

$ qemu-img create -f raw /tmp/test.raw 0
Formatting '/tmp/test.raw', fmt=raw size=0
$ qemu-img resize --preallocation=falloc --image-opts driver=file,filename=/tmp/test.raw,locking=off 1G
Image resized.

For comparison, you can also try locking=on first to verify that it causes the same problems as 'qemu-img create'.

Comment 10 Gordon Watson 2020-06-29 15:32:11 UTC
Kevin,

Thanks for the suggestion. I ran some specific tests (see below). Essentially, this time using 'locking=on' and 'locking=off' resulted in the 'qemu-img resize' taking the same amount of elapsed time, but the impact inside the guest was significant. With 'locking=on', the 'ls -R' took as long as the 'resize' (about 55 seconds), and with 'locking=off', I was able to run it twice with the same period of time (24 seconds each).

I ran the first test with a 5GB file, as that is what I had been using before. When I then ran it under 'strace', I used a 1GB file.

Regards, GFW.



-  Basic test with and without 'locking';

# qemu-img create -f raw gfwfile 0
Formatting 'gfwfile', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 5G
Image resized.

real	0m55.230s
user	0m0.753s
sys	0m3.787s


# qemu-img create -f raw gfwfile2 0
Formatting 'gfwfile2', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 5G
Image resized.

real	0m54.471s
user	0m1.336s
sys	0m2.897s




-  Repeated this under 'strace' (with 1GB file);

# time strace -fttTv -o /tmp/lockingon.str qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 1G
Image resized.

real	0m15.502s
user	0m3.425s
sys	0m10.932s


# time strace -fttTv -o /tmp/lockingoff.str qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 1G
Image resized.

real	0m14.072s
user	0m3.392s
sys	0m10.596s


# grep fallocate /tmp/lockingon.str 
20592 10:59:19.170731 fallocate(11, 0, 0, 1073741824) = -1 EOPNOTSUPP (Operation not supported) <0.000012>

# grep fallocate /tmp/lockingoff.str 
20855 10:59:56.485677 fallocate(11, 0, 0, 1073741824) = -1 EOPNOTSUPP (Operation not supported) <0.000009>


# grep pwrite /tmp/lockingon.str |wc -l
262144

# grep pwrite /tmp/lockingoff.str |wc -l
262144


# grep OFD /tmp/lockingon.str |wc -l
10

# grep OFD /tmp/lockingoff.str |wc -l
0


# grep OFD /tmp/lockingon.str 
20590 10:59:19.168866 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0, l_pid=0}) = 0 <0.000010>
20590 10:59:19.169609 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=100, l_len=1}) = 0 <0.000012>
20590 10:59:19.169649 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=101, l_len=1}) = 0 <0.000009>
20590 10:59:19.169686 fcntl(11, F_OFD_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=103, l_len=1}) = 0 <0.000009>
20590 10:59:19.169721 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=200, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:19.169758 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=201, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:19.169793 fcntl(11, F_OFD_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=203, l_len=1, l_pid=0}) = 0 <0.000008>
20590 10:59:33.175535 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=100, l_len=1}) = 0 <1.324324>
20590 10:59:34.499934 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=101, l_len=1}) = 0 <0.000017>
20590 10:59:34.500007 fcntl(11, F_OFD_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=103, l_len=1}) = 0 <0.000010>
 



-  Repeated first test and ran 'ls' within guest at the same time;
 
   - on host;

# qemu-img create -f raw gfwfile 0
Formatting 'gfwfile', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 5G
Image resized.

real	0m56.907s
user	0m0.669s
sys	0m3.874s


   - in guest;

# echo 3 > /proc/sys/vm/drop_caches ; time ls -R /

real	0m56.907s
user	0m0.669s
sys	0m3.874s



   - on host;

# qemu-img create -f raw gfwfile2 0
Formatting 'gfwfile2', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 5G
Image resized.

real	0m53.973s
user	0m0.971s
sys	0m3.429s


   - in guest (was able to run this twice during execution of 'resize' on host);

# echo 3 > /proc/sys/vm/drop_caches ; time ls -R /			

real	0m24.548s
user	0m0.776s
sys	0m2.355s

Comment 11 Gordon Watson 2020-06-29 17:15:47 UTC
Another test, this time on the same host as 'qemu-img create' is being run, is to 'dd' the 'metadata' file from the storage domain, which is what VDSM does cyclically. The path is;

  /rhev/data-center/mnt/netapp-fas2240-1-ctrl1.gsslab.rdu2.redhat.com:_vol_vol__rhev20170207_amar-rhv41_test/16f69cb1-8c60-4071-87a7-5b53ea1f0016/dom_md/metadata


-  With locking;

# pwd
/rhev/data-center/mnt/netapp-fas2240-1-ctrl1.gsslab.rdu2.redhat.com:_vol_vol__rhev20170207_amar-rhv41_test

# qemu-img create -f raw gfwfile 0
Formatting 'gfwfile', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile,locking=on 1G
Image resized.

real	0m10.491s
user	0m0.281s
sys	0m0.793s


At the same in another window;

726 bytes (726 B) copied, 0.000397695 s, 1.8 MB/s

real	0m0.002s                                      <<<< qemu-img create not running yet
user	0m0.001s
sys	0m0.000s

726 bytes (726 B) copied, 0.079152 s, 9.2 kB/s

real	0m0.191s                                      <<<< qemu-img create started
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.101975 s, 7.1 kB/s

real	0m0.222s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.112243 s, 6.5 kB/s

real	0m0.203s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.0857111 s, 8.5 kB/s

real	0m0.192s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.170831 s, 4.2 kB/s

real	0m0.323s
user	0m0.000s
sys	0m0.002s

726 bytes (726 B) copied, 0.000467613 s, 1.6 MB/s

real	0m0.002s                                      <<<< qemu-img create completed
user	0m0.001s
sys	0m0.000s




-  Without locking;

# qemu-img create -f raw gfwfile2 0
Formatting 'gfwfile2', fmt=raw size=0

# time qemu-img resize --preallocation=falloc --image-opts driver=file,filename=gfwfile2,locking=off 1G
Image resized.

real	0m10.911s
user	0m0.290s
sys	0m0.740s



At the same in another window;

726 bytes (726 B) copied, 0.00037985 s, 1.9 MB/s

real	0m0.002s                                      <<<< qemu-img create not running yet
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.0146004 s, 49.7 kB/s

real	0m0.030s                                      <<<< qemu-img create started
user	0m0.002s
sys	0m0.000s

726 bytes (726 B) copied, 0.0148508 s, 48.9 kB/s

real	0m0.034s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.0134097 s, 54.1 kB/s

real	0m0.029s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.00939221 s, 77.3 kB/s

real	0m0.022s
user	0m0.001s
sys	0m0.000s

726 bytes (726 B) copied, 0.00860009 s, 84.4 kB/s

real	0m0.023s
user	0m0.000s
sys	0m0.001s

726 bytes (726 B) copied, 0.0005528 s, 1.3 MB/s

real	0m0.002s                                      <<<< qemu-img create completed
user	0m0.000s
sys	0m0.002s




-  Assessment;

With locking enabled, the direct i/o dd'd when from thousandths of a second to tens of seconds. With locking disabled, it went to hundreds of a seconds. So, a factor of 10 difference.

In the customer's environment, they see these "same dd" commands taking 2.5 seconds or more at times.

Comment 12 Kevin Wolf 2020-06-30 11:02:16 UTC
I don't think advisory locking (which OFD locking is) should have any impact on the I/O path in theory, so this looks like a kernel problem to me.

Let's ask the NFS people to have a look.

Comment 14 J. Bruce Fields 2020-07-01 01:50:11 UTC
I haven't looked at the rest of the bug, just responding to this one point:

(In reply to Kevin Wolf from comment #12)
> I don't think advisory locking (which OFD locking is) should have any impact
> on the I/O path in theory, so this looks like a kernel problem to me.

The client revalidates its caches on lock, and waits for any dirty data to be committed before returning from an unlock.  That could definitely have a performance impact depending on how often you're doing it.

(This is for cache consistency; normally if multiple clients are concurrently accessing the same file, they don't really get any guarantees--unless they use locks around their IO, in which case they're supposed to see up-to-date results.)

Comment 15 Kevin Wolf 2020-07-01 09:10:40 UTC
(In reply to J. Bruce Fields from comment #14)
> (In reply to Kevin Wolf from comment #12)
> > I don't think advisory locking (which OFD locking is) should have any impact
> > on the I/O path in theory, so this looks like a kernel problem to me.
> 
> The client revalidates its caches on lock, and waits for any dirty data to
> be committed before returning from an unlock.  That could definitely have a
> performance impact depending on how often you're doing it.

If you mean that the lock and unlock operations can be somewhat expensive, yes, this is totally expected. qemu-img takes the lock only once, right after opening the file, and unlocks only before closing. The part that I expect to be unaffected is all the I/O that it does between those two calls, and I/O of other processes to other files on the same filesystem. If I understand correctly, what Gordon reports is that even I/O to other files in different processes is impacted.

What might be relevant is that it doesn't lock the whole file, but only single bytes.

Comment 16 J. Bruce Fields 2020-07-01 21:25:27 UTC
(In reply to Kevin Wolf from comment #15)
> If you mean that the lock and unlock operations can be somewhat expensive,
> yes, this is totally expected. qemu-img takes the lock only once, right
> after opening the file, and unlocks only before closing. The part that I
> expect to be unaffected is all the I/O that it does between those two calls,
> and I/O of other processes to other files on the same filesystem. If I
> understand correctly, what Gordon reports is that even I/O to other files in
> different processes is impacted.

Understood, that sounds weird.

> What might be relevant is that it doesn't lock the whole file, but only
> single bytes.

I wouldn't expect that to make much difference.

Comment 25 Dave Wysochanski 2020-10-07 17:52:11 UTC
I don't see this as a high enough priority for a RHEL7 accelerated fix.  This requires very specific environmental conditions and it does not appear widespread.

On Aug 6, 2020 Red Hat Enterprise Linux 7 entered Maintenance Support 2 Phase: https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
That means only "Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released". This BZ does not appear to meet Maintenance Support 2 Phase criteria so is being closed WONTFIX. If this is critical for your environment please provide a thorough business justification and ask that the BZ be re-opened for consideration.


Note You need to log in before you can comment on or make changes to this bug.